Saturday, June 2, 2018

Turn PDF to text on Mac

To turn PDF to text on Mac, first install imagemagick using homebrew. Then install tesseract using homebrew. Download and put "eng.traineddata" into tessdata directory for English text.

Then run the following command:

convert -density 300 [input.pdf] -depth 8 -strip -background white -alpha off [output.tiff]
tesseract [input.tiff] [output] -c load_system_dawg=false -c language_model_penalty_non_freq_dict_word=0.1 -c language_model_penalty_non_dict_word=0.15 -c matcher_bad_match_pad=0.20