Preparing a Project Gutenberg ebook for use on a 6″ ereader

For a while I’ve been trying to find a nice way to convert project Gutenberg books to look pleasant on a BeBook One. I’ve finally hit on the perfect combination of tools, that produces documents ideally suited to 6″ eInk ebook readers like my BeBook. The tool chain involves using GutenMark to convert the file into LaTeX and then TeXworks to modify the geometry and typography of the LaTeX file to suit the dimensions of the document to suit the small screen of the BeBook, then MiKTeX to convert the resultant LaTeX files into PDF (using pdfLaTeX).  Go to GutenMark (plus GUItenMark) for windows, MikTeX which includes the powerful TeX editor TeXworks, install them, and ensure they are on the path.

Here’s an example of the usual LaTeX output gtom GUItenMark. Note that this is configured for double-sided printed output.

\geometry{verbose,paperwidth=5.5in,paperheight=8.5in, tmargin=0.75in,bmargin=0.75in, lmargin=1in,rmargin=1in}
\evensidemargin = -0.25in
\oddsidemargin = 0.25in

We don’t need the margins to be so large, and we don’t need a difference in the odd and even side margins, since all pages on an ereader need to look the same. Modify the geometry of the page to the following:

\geometry{verbose,paperwidth=3.5in,paperheight=4.72in, tmargin=0.5in,bmargin=0in, lmargin=0.2in,rmargin=0.2in}

This has the added benefit of slightly increasing the perceived size of the text when displayed on the screen. Comment out the odd and even side margins like so:

%\evensidemargin = -0.25in
%\oddsidemargin = 0.25in

And here is what you get:

The finished product

Since both gutenmark and pdflatex are command line tools, we can script the conversion process. The editing is done with Sed (the stream editor). I get mine from cygwin, though there are plenty of ways to get the Gnu toolset onto a windows machine these days.

/c/Program\ Files/GutenMark/binary/GutenMark.exe --config="C:\Program Files\GutenMark\GutConfigs\GutenMark.cfg" --ron --latex "$1.txt" "$1.tex"

sed 's/paperwidth=5.5in/paperwidth=3.5in/
s/\\evensidemargin/%\\evensidemargin/' <"$1.tex" >"$1.bebook.tex"

pdflatex  -interaction nonstopmode "$1.bebook.tex"

rm *.aux *.log *.toc *.tex

Now all you need to do is invoke this bash script with the (extensionless) name of the gutenberg text file, and it will give you a PDF file in return. nice.

Some pictures of Carlton Gardens


Carlton Gardens, a set on Flickr.

This was my first outing with the Pentax K-x that I got recently. In these pictures, I’m trying to get to grips with the camera, so I didn’t have any particular objective other than to take pictures.

The light was so harsh it was very difficult for me to gauge whether the exposures were working – I couldn’t see the live views or previews at all! All in all I was very surprised that any of them were worth looking at.

Note to Self: Convert UTF-8 w/ BOM to ASCII (WIX + DB) using GNU uconv

This one took me a long time to work out, and it took a non-latin alphabet user (Russian) to point me at the right tools. Yet again, I’m guilty of being a complacent anglophone.

I was producing a database installer project using WIX 3.5, and ran into all sorts of inexplicable problems, which I finally tracked down to the Byte Order Mark (BOM) on my SQL update files that I was importing into my MSI file. See here for more on that.

I discovered that the ‘varied’ toolset used in our dev environments (i.e. VS 2010, Cygwin, VIM, GIT, SVN, NAnt, MSBuild, R# etc) meant that the update scripts had steadily diffused out into Unicode space. You can find out (approximately) what the encodings are for a directory of files using the GNU file command. Here’s a selection of files that I was including in my installer:

$ file *
01.sql:          ASCII text, with CRLF line terminators
02.sql:          Little-endian UTF-16 Unicode text, with very long lines, with CRLF, CR line terminator
03.sql:          UTF-8 Unicode (with BOM) text, with CRLF line terminators
05.sql:          ASCII English text, with CRLF line terminators
06.sql:          UTF-8 Unicode (with BOM) text, with CRLF line terminators
11.sql:          ASCII C program text, with CRLF line terminators
12.sql:          UTF-8 Unicode (with BOM) text, with CRLF line terminators
23.sql:          ASCII text, with CRLF line terminators
24.sql:          UTF-8 Unicode (with BOM) text, with CRLF line terminators
25.sql:          UTF-8 Unicode (with BOM) text, with CRLF line terminators
26.sql:          ASCII text, with CRLF line terminators
27.sql:          UTF-8 Unicode (with BOM) text, with CRLF line terminators
28.sql:          UTF-8 Unicode (with BOM) text, with CRLF line terminators
29.sql:          Little-endian UTF-16 Unicode C program text, with very long lines, with CRLF, CR line
30.sql:          UTF-8 Unicode (with BOM) C program text, with very long lines, with CRLF line terminat
37.sql:          UTF-8 Unicode (with BOM) English text, with CRLF line terminators
38.sql:          Little-endian UTF-16 Unicode text, with CRLF, CR line terminators
39.sql:          Little-endian UTF-16 Unicode text, with CRLF line terminators
44.sql:          UTF-8 Unicode (with BOM) text, with CRLF line terminators
AlwaysRun0001.sql: ASCII C program text, with CRLF line terminators
AlwaysRun0002.sql: UTF-8 Unicode (with BOM) C program text, with CRLF line terminators
TestData0001.sql:        UTF-8 Unicode (with BOM) text, with very long lines, with CRLF line terminators

You can see that there appear to be a variety of encodings. I initially assumed that a quick run through d2u or u2d would fix them up, but that did nothing to change the encoding or remove the BOM. In the end I found the IBM uconv command, that has the handy ‘–remove-signature’ option that was the key to the solution. Don’t confuse this with the GNU iconv app, that doesn’t allow you to strip the BOM from the front of your files.

$ uconv --remove-signature -t ASCII TestData0001.sql > TestData0001.sql2
$ rm TestData0001.sql
$ mv TestData0001.sql2 TestData0001.sql

After that, the WIX installer worked OK, and all was right with the world. I hope this helps you if you run into the same problem.

I can’t answer the question of why WIX/MSI fails to work with non-ASCII files (other than to say that Unicode blondness is a common problem of software written by Anglophones).

Automata-Based Programming With Petri Nets – Part 1

Petri Nets are extremely powerful and expressive, but they are not as widely used in the software development community as deterministic state machines. That’s a pity – they allow us to solve problems beyond the reach of conventional state machines. This is the first in a mini-series on software development with Petri Nets. All of the code for a full feature-complete Petri Net library is available online at GitHub. You’re welcome to take a copy, play with it and use it in your own projects. Code for this and subsequent articles can be found at

Continue reading

Quantum Reasoners Hold Key to Future Web

Last year, a company called DWave Systems announced their quantum computer (the ‘Orion’) – another milestone on the road to practical quantum computing. Their controversial claims seem worthy in their own right but they are particularly important to the semantic web (SW) community. The significance to the SW community was that their quantum computer solved problems akin to Grover’s Algorithm speeding up queries of disorderly databases.

Semantic web databases are not (completely) disorderly and there are many ways to optimize the search for matching triples to a graph pattern. What strikes me is that the larger the triple store, the more compelling the case for using some kind of quantum search algorithm to find matches. DWave are currently trialing 128qbit processors, and they claim their systems can scale, so I (as a layman) can see no reason why such computers couldn’t be used to help improve the performance of queries in massive triple stores.

 What I wonder is:

  1. what kind of indexing schemes can be used to impose structure on the triples in a store?
  2. how can one adapt a B-tree to index each element of a triple rather than just a single primary key – three indexes seems extravagant.
  3. are there quantum algorithms that can beat the best of these schemes?
  4. is there is a place for quantum superposition in a graph matching algorithm (to simultaneously find matching triples then cancel out any that don’t match all the basic graph patterns?)
  5. if DWave’s machines could solve NP-Complete problems, does that mean that we would then just use OWL-Full?
  6. would the speed-ups then be great enough to consider linking everyday app data to large scale-upper ontologies?
  7. is a contradiction in a ‘quantum reasoner’ (i.e. a reasoner that uses a quantum search engine) something that can never occur because it just cancels out and never appears in the returned triples? Would any returned conclusion be necessarily true (relative to the axioms of the ontology?)

Any thoughts?

DWave are now working with Google to help them improve some of their machine learning algorithms. I wonder whether there will be other research into the practicality of using DWave quantum computing systems in conjunction with inference engines? This could, of course, open up whole new vistas of services that could be provided by Google (or their competitors). Either way, it gives me a warm feeling to know that every time I do a search, I’m getting the results from a quantum computer (no matter how indirectly). Nice.

Semantic Overflow Highlights I

Semantic Overflow has been active for a couple of weeks. We now have 155 users and 53 questions. We’ve already had some very interesting questions and some excellent detailed and thoughtful responses. I thought, on Egon’s instigation, to  bring together, from the site’s BI stats, some of the highlights of last week.

The best loved question this week came from Jerven Bolleman who wanted to know whether there was a “Simple CLI useable OWL Reasoner“. The most popular answer (and highest voted answer) came from Ivan Herman who provided tool suggestions, guidance and insights into the current research directions.

The most viewed question was from Akshat Shrivastava who asked “How do I make my homepage/personal blog semantic ?“. This question garnered 5 very good answers, including a particularly good one from Bill Roberts, who provided a lot of detail on his own experiences of doing just that.

The highest voted answer was from Egon Willighagen in answer to a question by ‘dusoft‘ as to “Why launching SemanticOverflow with too little users with zero reputation sucks?”. He very helpfully explained how to bootstrap your reputation and make a success of the site. I’m glad to say that dusoft’s was the only negative comment so far, and that the general response of other site users has been very positive. Mike McClintock even told me that “SemanticOverflow is my new favorite thing“. Thanks, Mike, I hope it stays that way!

In terms of gathering reputation, Ian Davis of Talis is the clear front runner with the following questions:

He also provided 17 very good answers. Thanks, Ian, and thanks to all the others who are already making Semantic Overflow a great site! – the Web 2.0 Q&A site for all things Web 3.0. is a new site based on the hugely popular, devoted to Q&A on anything related to the semantic web. The site is very new (created today) and I’m trying to get as many people to visit as I can, so please come and post your questions and together we’ll create a thriving community dedicated just to the semantic web. It’s free to join, free to ask questions, and most important of all – free to see the answers. You don’t even have to sign up to view, ask or answer questions. If you do join, all you get is kudos.
StackOverflow has been an overnight sensation because it combines many of the traditional features of wikis, newgroups and social media platforms combined with a reputation system that promotes community involvement and high quality discussion. I’m sure that as the Semantic Web starts to go exponential, Semantic Overflow will be a useful forum for more than just technical questions. I hope you’ll use it for speculation about the directions of the field itself. Unlike StackOverflow, non-technical discussions won’t be moderated out, provided they are on-topic. I think that in such an interesting and emerging space, discussion and dissemination of knowledge is critical. If you agree, please come and join us for a question or two.
Please also tell your friends, colleagues, relatives, acquaintances, pets and neighbors and ask them all to visit the site. If you have any suggestions about other places I could promote the site in – feel free to provide an answer here.