Convert pdf to text on Ubuntu

We had rather an ugly scanned pdf of a very lovely poem over on our feedback website so I thought I would try to post the text from it using Optical Character Recognition using tesseract.

On Ubuntu start with:

sudo apt-get install tesseract-ocr imagemagick

You’ll need to convert the pdf to an image file:

convert -density 600 input.pdf output.tif

and then output to output.txt like this:

tesseract myscan.png out

Do note that if you have problems where an empty text file is returned (as I did) this could be because the margins around the image are too large- crop it down (not too tightly) and it should work nicely (although I would say it returned “l” a lot when in natural English that is obviously going to be an “I”, seems kind of obvious to a user but there must be some technical problem there, I suppose).

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.