ocrmypdf Documentation

ocrmypdf Documentation

Release 5.2

James R. Barlow

2017-06-13

1 Introduction

2 Release notes

3 Installation

4 Installing additional language packs

5 Cookbook

6 Advanced features

7 Batch processing

8 PDF security issues

9 Common error messages

10 Indices and tables

Contents

21

27

29

3

7

33

37

41

43

45

i

ii

ocrmypdf Documentation, Release 5.2

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched.

PDFs are the best format for scanned documents. Unfortunately, PDFs can be difficult to work with. OCRmyPDF makes it easy to apply image processing and OCR to existing PDFs.

Contents 1


2 Contents

CHAPTER

1

Introduction

OCRmyPDF is a Python 3 package that adds OCR layers to PDFs.

About OCR

Optical character recognition is technology that converts images of typed or handwritten text, such as in a scanned document, to computer text that can be searched and copied.

OCRmyPDF uses Tesseract , the best available open source OCR engine, to perform OCR.

About PDFs

PDFs are page description files that attempts to preserve a layout exactly. They can contain vector graphic files that can contain raster objects such as scanned images. Because PDFs can contain multiple pages (unlike many image formats) and can contain fonts and text, it is a good formats for exchanging scanned documents.

A PDF page might contain multiple images, even if it only appears to have one image. Some scanners or scanning software will segment pages into monochromatic text and color regions for example, to improve the compression ratio and appearance of the page.

Rasterizing a PDF is the process of generating an image suitable for display or analyzing with an OCR engine. OCR engines like Tesseract work with images, not vector objects.

About PDF/A

PDF/A is an ISO-standardized subset of the full PDF specification that is designed for archiving (the ‘A’ stands for

Archive). PDF/A differs from PDF primarily by omitting features that would make it difficult to read the file in the future, such as embedded Javascript, video, audio and references to external fonts. All fonts and resources needed

3


to interpret the PDF must be contained within it. Because PDF/A disables Javascript and other types of embedded content, it is probably more secure.

There are various conformance levels and versions, such as “PDF/A-2b”.

Generally speaking, the best format for scanned documents is PDF/A. Some governments and jurisdictions, US Courts in particular, mandate the use of PDF/A for scanned documents.

Since most people who scan documents are interested in reading them indefinitely into the future, OCRmyPDF generates PDF/A-2b by default.

PDF/A has a few drawbacks. Some PDF viewers include an alert that the file is a PDF/A, which may confuse some users. It also tends to produce larger files than PDF, because it embeds certain resources even if they are commonly available. PDF/A files can be digitally signed, but may not be encrypted, to ensure they can be read in the future.

Fortunately, converting from PDF/A to a regular PDF is trivial, and any PDF viewer can view PDF/A.

What OCRmyPDF does

OCRmyPDF analyzes each page of a PDF to determine the colorspace and resolution (DPI) needed to capture all of the information on that page without losing content. It uses Ghostscript to rasterize the page, and then performs on

OCR on the rasterized image. It is not enough to simply extract the images from each page and run OCR on them individually. Of course one could use Ghostscript or another PDF rasterizer and then pass the image to Tesseract.

OCRmyPDF automates this process and produces a minimally changed output file that contains the same information, colorspace and resolution.

The Tesseract OCR engine can output ‘hOCR’ files, which are XML files that contain a description of the text it found on the page. OCRmyPDF will render a new PDF that contains only the hidden text layer, and merge this with the original page.

Alternately, OCRmyPDF can use the Tesseract OCR engine to directly output PDFs for each page, then merge them.

By default, OCRmyPDF will convert the file to a PDF/A. This behavior can be disabled with the --output-type pdf argument.

Depending on the settings selected, OCRmyPDF may “graft” the OCR layer into the existing PDF, or reconstruct a visually equivalent new PDF.

Why you shouldn’t do this manually

A PDF is similar to an HTML file, in that it contains document structure along with images. Sometimes a PDF does nothing more than present a full page image, but often there is additional content that would be lost.

A manual process could work like either of these:

1. Rasterize each page as an image, OCR the images, and combine the output into a PDF. This preserves the layout of each page, but resamples all images (possibly losing quality, increasing file size, introducing compression artifacts, etc.).

2. Extract each image, OCR, and combine the output into a PDF. This loses the context in which images are used in the PDF, meaning that cropping, rotation and scaling of pages may be lost. Some scanned PDFs use multiple images segmented into black and white, grayscale and color regions, with stencil masks to prevent overlap, as this can enhance the appearance of a file while reducing file size. Clearly, reassembling these images will be easy. This also loses and text or vector art on any pages in a PDF with both scanned and pure digital content.

In the case of a PDF that is nothing other than a container of images (no rotation, scaling, cropping, one image per page), the second approach can be lossless.

4 Chapter 1. Introduction


OCRmyPDF uses several strategies depending on input options and the input PDF itself, but generally speaking it rasterizes a page for OCR and then grafts the OCR back onto the original. As such it can handle complex PDFs and still preserve their contents as much as possible.

Limitations

OCRmyPDF is limited by the Tesseract OCR engine. As such it experiences these limitations, as do any other programs that rely on Tesseract:

• The OCR is not as accurate as commercial solutions such as Abbyy.

• It is not capable of recognizing handwriting.

• It may find gibberish and report this as OCR output.

• If a document contains languages outside of those given in the -l LANG arguments, results may be poor.

• It is not always good at analyzing the natural reading order of documents. For example, it may fail to recognize that a document contains two columns and join text across the columns.

• Poor quality scans may produce poor quality OCR. Garbage in, garbage out.

• PDFs that use transparent layers are not currently checked in the test suite, so they may not work correctly.

OCRmyPDF is also limited by the PDF specification:

• PDF encodes the position of text glyphs but does not encode document structure. There is no markup that divides a document in sections, paragraphs, sentences, or even words (since blank spaces are not represented).

As such all elements of document structure including the spaces between words must be derived heuristically.

Some PDF viewers do a better job of this than others.

Ghostscript also imposes some limitations:

• PDFs containing JBIG2-encoded content will be converted to CCITT Group4 encoding, which has lower compression ratios, if Ghostscript PDF/A is enabled.

• PDFs containing JPEG 2000-encoded content will be converted to JPEG encoding, which may introduce compression artifacts, if Ghostscript PDF/A is enabled.

• Ghostscript may transcode grayscale and color images, either lossy to lossless or lossless to lossy, based on an internal algorithm. This behavior can be suppressed by setting --pdfa-image-compression to jpeg or lossless to set all images to one type or the other. Ghostscript has no option to maintain the input image’s format.

OCRmyPDF is currently not designed to be used as a Python API; it is designed to be run as a command line tool.

import ocrmypf currently attempts to process the command line on sys.argv at import time so it has side effects that will interfere with its use as a package. The API it presents should not be considered stable.

Similar programs

To the author’s knowledge, OCRmyPDF is the most feature-rich and thoroughly tested command line OCR PDF conversion tool. If it does not meet your needs, contributions and suggestions are welcome. If not, consider one of these similar open source programs:

• pdf2pdfocr

• pdfsandwich

• pypdfocr

1.6. Limitations 5


• pdfbeads

If you are looking for a micro web-frontend for OCRmyPDF, consider the third-party OCRmyPDF-web . Bear in mind that OCRmyPDF is not designed to be secure against malware-bearing PDFs (see

‘Using OCRmyPDF online‘_

).

6 Chapter 1. Introduction

CHAPTER

2

Release notes

OCRmyPDF uses semantic versioning for its command line interface.

The OCRmyPDF package itself does not contain a public API, although it is fairly stable and breaking changes are usually timed with a major release. A future release will clearly define the stable public API.

v5.2

• When using Tesseract 3.05.01 or newer, OCRmyPDF will select the “sandwich” PDF renderer by default, unless another PDF renderer is specified with the --pdf-renderer argument. The previous behavior was to select

--pdf-renderer=hocr

.

• The “tesseract” PDF renderer is now deprecated, since it can cause problems with Ghostscript on Tesseract

3.05.00

• The “tess4” PDF renderer has been renamed to “sandwich”. “tess4” is now a deprecated alias for “sandwich”.

v5.1

• Files with pages larger than 200” (5080 mm) in either dimension are now supported with

--output-type=pdf with the page size preserved (in the PDF specification this feature is called UserUnit scaling). Due to Ghostscript limitations this is not available in conjunction with PDF/A output.

v5.0.1

• Fixed issue #169, exception due to failure to create sidecar text files on some versions of Tesseract 3.04, including the jbarlow83/ocrmypdf Docker image

7


v5.0

• Backward incompatible changes

– Support for Python 3.4 dropped. Python 3.5 is now required.

– Support for Tesseract 3.02 and 3.03 dropped. Tesseract 3.04 or newer is required. Tesseract 4.00 (alpha) is supported.

– The OCRmyPDF.sh script was removed.

• Add a new feature, --sidecar, which allows creating “sidecar” text files which contain the OCR results in plain text. These OCR text is more reliable than extracting text from PDFs. Closes #126.

• New feature: --pdfa-image-compression, which allows overriding Ghostscript’s lossy-or-lossless image encoding heuristic and making all images JPEG encoded or lossless encoded as desired. Fixes #163.

• Fixed issue #143, added --quiet to suppress “INFO” messages

• Fixed issue #164, a typo

• Removed the command line parameters -n and --just-print since they have not worked for some time

(reported as Ubuntu bug #1687308 )

v4.5.6

• Fixed issue #156, ‘NoneType’ object has no attribute ‘getObject’ on pages with no optional /Contents record.

This should resolve all issues related to pages with no /Contents record.

• Fixed issue #158, ocrmypdf now stops and terminates if Ghostscript fails on an intermediate step, as it is not possible to proceed.

• Fixed issue #160, exception thrown on certain invalid arguments instead of error message

v4.5.5

• Automated update of macOS homebrew tap

• Fixed issue #154, KeyError ‘/Contents’ when searching for text on blank pages that have no /Contents record.

Note: incomplete fix for this issue.

v4.5.4

• Fix --skip-big raising an exception if a page contains no images (#152) (thanks to @TomRaz)

• Fix an issue where pages with no images might trigger “cannot write mode P as JPEG” (#151)

v4.5.3

• Added a workaround for Ghostscript 9.21 and probably earlier versions would fail with the error message

“VMerror -25”, due to a Ghostscript bug in XMP metadata handling

8 Chapter 2. Release notes


• High Unicode characters (U+10000 and up) are no longer accepted for setting metadata on the command line, as Ghostscript may not handle them correctly.

• Fixed an issue where the tess4 renderer would duplicate content onto output pages if tesseract failed or timed out

• Fixed tess4 renderer not recognized when lossless reconstruction is possible

v4.5.2

• Fix issue #147. --pdf-renderer tess4 --clean will produce an oversized page containing the original image in the bottom left corner, due to loss DPI information.

• Make “using Tesseract 4.0” warning less ominous

• Set up machinery for homebrew OCRmyPDF tap

v4.5.1

• Fix issue #137, proportions of images with a non-square pixel aspect ratio would be distorted in output for

--force-ocr and some other combinations of flags

v4.5

• Exotic PDFs containing “Form XObjects” are now supported (issue #134; PDF reference manual 8.10), and images they contain are taken into account when determining the resolution for rasterizing

• The Tesseract 4 Docker image no longer includes all languages, because it took so long to build something would tend to fail

• OCRmyPDF now warns about using --pdf-renderer tesseract with Tesseract 3.04 or lower due to issues with Ghostscript corrupting the OCR text in these cases

v4.4.2

• The Docker images (ocrmypdf, ocrmypdf-polyglot, ocrmypdf-tess4) are now based on Ubuntu 16.10 instead of

Debian stretch

– This makes supporting the Tesseract 4 image easier

– This could be a disruptive change for any Docker users who built customized these images with their own changes, and made those changes in a way that depends on Debian and not Ubuntu

• OCRmyPDF now prevents running the Tesseract 4 renderer with Tesseract 3.04, which was permitted in v4.4

and v4.4.1 but will not work

v4.4.1

• To prevent a TIFF output error caused by img2pdf >= 0.2.1 and Pillow <= 3.4.2, dependencies have been tightened

2.9. v4.5.2

9


• The Tesseract 4.00 simultaneous process limit was increased from 1 to 2, since it was observed that 1 lowers performance

• Documentation improvements to describe the --tesseract-config feature

• Added test cases and fixed error handling for --tesseract-config

• Tweaks to setup.py to deal with issues in the v4.4 release

v4.4

• Tesseract 4.00 is now supported on an experimental basis.

– A new rendering option --pdf-renderer tess4 exploits Tesseract 4’s new text-only output PDF mode. See the documentation on PDF Renderers for details.

– The --tesseract-oem argument allows control over the Tesseract 4 OCR engine mode (tesseract’s

--oem

). Use --tesseract-oem 2 to enforce the new LSTM mode.

– Fixed poor performance with Tesseract 4.00 on Linux

• Fixed an issue that caused corruption of output to stdout in some cases

• Removed test for Pillow JPEG and PNG support, as the minimum supported version of Pillow now enforces this

• OCRmyPDF now tests that the intended destination file is writable before proceeding

• The test suite now requires pytest-helpers-namespace to run (but not install)

• Significant code reorganization to make OCRmyPDF re-entrant and improve performance. All changes should be backward compatible for the v4.x series.

– However, OCRmyPDF’s dependency “ruffus” is not re-entrant, so no Python API is available. Scripts should continue to use the command line interface.

v4.3.5

• Update documentation to confirm Python 3.6.0 compatibility. No code changes were needed, so many earlier versions are likely supported.

v4.3.4

• Fixed “decimal.InvalidOperation: quantize result has too many digits” for high DPI images

v4.3.3

• Fixed PDF/A creation with Ghostscript 9.20 properly

• Fixed an exception on inline stencil masks with a missing optional parameter



v4.3.2

• Fixed a PDF/A creation issue with Ghostscript 9.20 (note: this fix did not actually work)

v4.3.1

• Fixed an issue where pages produced by the “hocr” renderer after a Tesseract timeout would be rotated incorrectly if the input page was rotated with a /Rotate marker

• Fixed a file handle leak in LeptonicaErrorTrap that would cause a “too many open files” error for files around hundred pages of pages long when --deskew or --remove-background or other Leptonica based image processing features were in use, depending on the system value of ulimit -n

• Ability to specify multiple languages for multilingual documents is now advertised in documentation

• Reduced the file sizes of some test resources

• Cleaned up debug output

• Tesseract caching in test cases is now more cautious about false cache hits and reproducing exact output, not that any problems were observed

v4.3

• New feature --remove-background to detect and erase the background of color and grayscale images

• Better documentation

• Fixed an issue with PDFs that draw images when the raster stack depth is zero

• ocrmypdf can now redirect its output to stdout for use in a shell pipeline

– This does not improve performance since temporary files are still used for buffering

– Some output validation is disabled in this mode

v4.2.5

• Fixed an issue (#100) with PDFs that omit the optional /BitsPerComponent parameter on images

• Removed non-free file milk.pdf

v4.2.4

• Fixed an error (#90) caused by PDFs that use stencil masks properly

• Fixed handling of PDFs that try to draw images or stencil masks without properly setting up the graphics state

(such images are now ignored for the purposes of calculating DPI)

2.18. v4.3.2

11


v4.2.3

• Fixed an issue with PDFs that store page rotation (/Rotate) in an indirect object

• Integrated a few fixes to simplify downstream packaging (Debian)

– The test suite no longer assumes it is installed

– If running Linux, skip a test that passes Unicode on the command line

• Added a test case to check explicit masks and stencil masks

• Added a test case for indirect objects and linearized PDFs

• Deprecated the OCRmyPDF.sh shell script

v4.2.2

• Improvements to documentation

v4.2.1

• Fixed an issue where PDF pages that contained stencil masks would report an incorrect DPI and cause

Ghostscript to abort

• Implemented stdin streaming

v4.2

• ocrmypdf will now try to convert single image files to PDFs if they are provided as input (#15)

– This is a basic convenience feature. It only supports a single image and always makes the image fill the whole page.

– For better control over image to PDF conversion, use img2pdf (one of ocrmypdf’s dependencies)

• New argument --output-type {pdf|pdfa} allows disabling Ghostscript PDF/A generation

– pdfa is the default, consistent with past behavior

– pdf provides a workaround for users concerned about the increase in file size from Ghostscript forcing

JBIG2 images to CCITT and transcoding JPEGs

– pdf preserves as much as it can about the original file, including problems that PDF/A conversion fixes

• PDFs containing images with “non-square” pixel aspect ratios, such as 200x100 DPI, are now handled and converted properly (fixing a bug that caused to be cropped)

• --force-ocr rasterizes pages even if they contain no images

– supports users who want to use OCRmyPDF to reconstruct text information in PDFs with damaged Unicode maps (copy and paste text does not match displayed text)

– supports reinterpreting PDFs where text was rendered as curves for printing, and text needs to be recovered

– fixes issue #82



• Fixes an issue where, with certain settings, monochrome images in PDFs would be converted to 8-bit grayscale, increasing file size (#79)

• Support for Ubuntu 12.04 LTS “precise” has been dropped in favor of (roughly) Ubuntu 14.04 LTS “trusty”

– Some Ubuntu “PPAs” (backports) are needed to make it work

• Support for some older dependencies dropped

– Ghostscript 9.15 or later is now required (available in Ubuntu trusty with backports)

– Tesseract 3.03 or later is now required (available in Ubuntu trusty)

• Ghostscript now runs in “safer” mode where possible

v4.1.4

• Bug fix: monochrome images with an ICC profile attached were incorrectly converted to full color images if lossless reconstruction was not possible due to other settings; consequence was increased file size for these images

v4.1.3

• More helpful error message for PDFs with version 4 security handler

• Update usage instructions for Windows/Docker users

• Fix order of operations for matrix multiplication (no effect on most users)

• Add a few leptonica wrapper functions (no effect on most users)

v4.1.2

• Replace IEC sRGB ICC profile with Debian’s sRGB (from icc-profiles-free) which is more compatible with the

MIT license

• More helpful error message for an error related to certain types of malformed PDFs

v4.1

• --rotate-pages now only rotates pages when reasonably confidence in the orientation. This behavior can be adjusted with the new argument --rotate-pages-threshold

• Fixed problems in error checking if unpaper is uninstalled or missing at run-time

• Fixed problems with “RethrownJobError” errors during error handling that suppressed the useful error messages

v4.0.7

• Minor correction to Ghostscript output settings

2.27. v4.1.4

13


v4.0.6

• Update install instructions

• Provide a sRGB profile instead of using Ghostscript’s

v4.0.5

• Remove some verbose debug messages from v4.0.4

• Fixed temporary that wasn’t being deleted

• DPI is now calculated correctly for cropped images, along with other image transformations

• Inline images are now checked during DPI calculation instead of rejecting the image

v4.0.4

Released with verbose debug message turned on. Do not use. Skip to v4.0.5.

v4.0.3

New features

• Page orientations detected are now reported in a summary comment

Fixes

• Show stack trace if unexpected errors occur

• Treat “too few characters” error message from Tesseract as a reason to skip that page rather than abort the file

• Docker: fix blank JPEG2000 issue by insisting on Ghostscript versions that have this fixed

v4.0.2

Fixes

• Fixed compatibility with Tesseract 3.04.01 release, particularly its different way of outputting orientation information

• Improved handling of Tesseract errors and crashes

• Fixed use of chmod on Docker that broke most test cases



v4.0.1

Fixes

• Fixed a KeyError if tesseract fails to find page orientation information

v4.0

New features

• Automatic page rotation (-r) is now available. It uses ignores any prior rotation information on PDFs and sets rotation based on the dominant orientation of detectable text. This feature is fairly reliable but some false positives occur especially if there is not much text to work with. (#4)

• Deskewing is now performed using Leptonica instead of unpaper. Leptonica is faster and more reliable at image deskewing than unpaper.

Fixes

• Fixed an issue where lossless reconstruction could cause some pages to be appear incorrectly if the page was rotated by the user in Acrobat after being scanned (specifically if it a /Rotate tag)

• Fixed an issue where lossless reconstruction could misalign the graphics layer with respect to text layer if the page had been cropped such that its origin is not (0, 0) (#49)

Changes

• Logging output is now much easier to read

• --deskew is now performed by Leptonica instead of unpaper (#25)

• libffi is now required

• Some changes were made to the Docker and Travis build environments to support libffi

• --pdf-renderer=tesseract now displays a warning if the Tesseract version is less than 3.04.01, the planned release that will include fixes to an important OCR text rendering bug in Tesseract 3.04.00. You can also manually install ./share/sharp2.ttf on top of pdf.ttf in your Tesseract tessdata folder to correct the problem.

v3.2.1

Changes

• Fixed issue #47 “convert() got and unexpected keyword argument ‘dpi”’ by upgrading to img2pdf 0.2

• Tweaked the Dockerfiles

2.37. v4.0.1

15


v3.2

New features

• Lossless reconstruction: when possible, OCRmyPDF will inject text layers without otherwise manipulating the content and layout of a PDF page. For example, a PDF containing a mix of vector and raster content would see the vector content preserved. Images may still be transcoded during PDF/A conversion. (--deskew and

--clean-final disable this mode, necessarily.)

• New argument --tesseract-pagesegmode allows you to pass page segmentation arguments to Tesseract

OCR. This helps for two column text and other situations that confuse Tesseract.

• Added a new “polyglot” version of the Docker image, that generates Tesseract with all languages packs installed, for the polyglots among us. It is much larger.

Changes

• JPEG transcoding quality is now 95 instead of the default 75. Bigger file sizes for less degradation.

v3.1.1

Changes

• Fixed bug that caused incorrect page size and DPI calculations on documents with mixed page sizes

v3.1

Changes

• Default output format is now PDF/A-2b instead of PDF/A-1b

• Python 3.5 and macOS El Capitan are now supported platforms - no changes were needed to implement support

• Improved some error messages related to missing input files

• Fixed issue #20 - uppercase .PDF extension not accepted

• Fixed an issue where OCRmyPDF failed to text that certain pages contained previously OCR’ed text, such as

OCR text produced by Tesseract 3.04

• Inserts /Creator tag into PDFs so that errors can be traced back to this project

• Added new option --pdf-renderer=auto, to let OCRmyPDF pick the best PDF renderer. Currently it always chooses the ‘hocrtransform’ renderer but that behavior may change.

• Set up Travis CI automatic integration testing



v3.0

New features

• Easier installation with a Docker container or Python’s pip package manager

• Eliminated many external dependencies, so it’s easier to setup

• Now installs ocrmypdf to /usr/local/bin or equivalent for system-wide access and easier typing

• Improved command line syntax and usage help (--help)

• Tesseract 3.03+ PDF page rendering can be used instead for better positioning of recognized text

(--pdf-renderer tesseract)

• PDF metadata (title, author, keywords) are now transferred to the output PDF

• PDF metadata can also be set from the command line (--title, etc.)

• Automatic repairs malformed input PDFs if possible

• Added test cases to confirm everything is working

• Added option to skip extremely large pages that take too long to OCR and are often not OCRable (e.g. large scanned maps or diagrams); other pages are still processed (--skip-big)

• Added option to kill Tesseract OCR process if it seems to be taking too long on a page, while still processing other pages (--tesseract-timeout)

• Less common colorspaces (CMYK, palette) are now supported by conversion to RGB

• Multiple images on the same PDF page are now supported

Changes

• New, robust rewrite in Python 3.4+ with ruffus pipelines

• Now uses Ghostscript 9.14’s improved color conversion model to preserve PDF colors

• OCR text is now rendered in the PDF as invisible text. Previous versions of OCRmyPDF incorrectly rendered visible text with an image on top.

• All “tasks” in the pipeline can be executed in parallel on any available CPUs, increasing performance

• The -o DPI argument has been phased out, in favor of --oversample DPI, in case we need -o

OUTPUTFILE in the future

• Removed several dependencies, so it’s easier to install. We no longer use:

– GNU parallel

– ImageMagick

– Python 2.7

– Poppler

– MuPDF tools

– shell scripts

– Java and JHOVE

– libxml2

2.43. v3.0

17


• Some new external dependencies are required or optional, compared to v2.x:

– Ghostscript 9.14+

– qpdf 5.0.0+

–

Unpaper 6.1 (optional)

– some automatically managed Python packages

Release candidates

• rc9:

– fix issue #118: report error if ghostscript iccprofiles are missing

– fixed another issue related to #111: PDF rasterized to palette file

– add support image files with a palette

– don’t try to validate PDF file after an exception occurs

• rc8:

– fix issue #111: exception thrown if PDF is missing DocumentInfo dictionary

• rc7:

– fix error when installing direct from pip, “no such file ‘requirements.txt”’

• rc6:

– dropped libxml2 (Python lxml) since Python 3’s internal XML parser is sufficient

– set up Docker container

– fix Unicode errors if recognized text contains Unicode characters and system locale is not UTF-8

• rc5:

– dropped Java and JHOVE in favour of qpdf

– improved command line error output

– additional tests and bug fixes

– tested on Ubuntu 14.04 LTS

• rc4:

– dropped MuPDF in favour of qpdf

– fixed some installer issues and errors in installation instructions

– improve performance: run Ghostscript with multithreaded rendering

– improve performance: use multiple cores by default

– bug fix: checking for wrong exception on process timeout

• rc3: skipping version number intentionally to avoid confusion with Tesseract

• rc2: first release for public testing to test-PyPI, Github

• rc1: testing release process



Compatibility notes

• ./OCRmyPDF.sh script is still available for now

• Stacking the verbosity option like -vvv is no longer supported

• The configuration file config.sh has been removed. Instead, you can feed a file to the arguments for common settings: ocrmypdf input .

pdf output .

pdf

@settings

.

txt where settings.txt contains one argument per line, for example:

l deu

-author

A .

Merkel

-pdf renderer tesseract

Fixes

• Handling of filenames containing spaces: fixed

Notes and known issues

• Some dependencies may work with lower versions than tested, so try overriding dependencies if they are “in the way” to see if they work.

• --pdf-renderer tesseract will output files with an incorrect page size in Tesseract 3.03, due to a bug in Tesseract.

• PDF files containing “inline images” are not supported and won’t be for the 3.0 release. Scanned images almost never contain inline images.

v2.2-stable (2014-09-29)

OCRmyPDF versions 1 and 2 were implemented as shell scripts. OCRmyPDF 3.0+ is a fork that gradually replaced all shell scripts with Python while maintaining the existing command line arguments. No one is maintaining old versions.

For details on older versions, see the final version of its release notes .

2.44. Compatibility notes 19



CHAPTER

3

Installation

OCRmyPDF requires Python 3.5 (or newer) and Tesseract 3.04 (or newer).

Installing on Debian and Ubuntu 16.10 or newer

Users of Debian 9 (“stretch”) or later or Ubuntu 16.10 or later may simply apt-get install ocrmypdf

Installing on macOS

A Homebrew tap is available for macOS: brew tap jbarlow83/ocrmypdf brew install ocrmypdf

Warning: Users who previously installed OCRmyPDF on macOS using pip install ocrmypdf should remove the pip version (pip3 uninstall ocrmypdf) before switching to the Homebrew version.

Installing the Docker image

For many users, installing the Docker image will be easier than installing all of OCRmyPDF’s dependencies. For

Windows, it is the only option.

If you have Docker installed on your system, you can install a Docker image of the latest release.

21


Follow the Docker installation instructions for your platform. If you can run this command successfully, your system is ready to download and execute the image: docker run hello-world

OCRmyPDF will use all available CPU cores. By default, the VirtualBox machine instance on Windows and macOS has only a single CPU core enabled. Use the VirtualBox Manager to determine the name of your Docker engine host, and then follow these optional steps to enable multiple CPUs:

# Optional step for Mac OS X users docker-machine stop "yourVM"

VBoxManage modifyvm "yourVM" --cpus 2 # or whatever number of core is desired docker-machine start "yourVM" eval $( docker-machine env "yourVM"

)

Assuming you have a Docker engine running, you can download one of the three available images:

Image name ocrmypdf docker pull jbarlow83/ocrmypdf ocrmypdfpolyglot ocrmypdftess4

Download command docker pull jbarlow83/ ocrmypdf-polyglot docker pull jbarlow83/ ocrmypdf-tess4

Notes

Latest ocrmypdf with Tesseract 3.04. Includes English, French,

German, Spanish.

As above, with all available language packs.

Latest ocrmypdf with Tesseract 4.00.00alpha and English, French,

German, Spanish, Portuguese, Chinese Simplified, Arabic and Russian

(the top 8).

For example: docker pull jbarlow83/ocrmypdf-tess4

Then tag it to give a more convenient name, just ocrmypdf: docker tag jbarlow83/ocrmypdf-tess4 ocrmypdf

The alternative “polyglot” image provides all available language packs .

You can then run ocrmypdf using the command: docker run --rm ocrmypdf --help

To execute the OCRmyPDF on a local file, you must provide a writable volume to the Docker image , and both the input and output file must be inside the writable volume. This example command uses the current working directory as the writable volume: docker run --rm -v "

$(

pwd) :/home/docker" <other docker arguments> ocrmypdf <your

˓→ arguments to ocrmypdf>

In this worked example, the current working directory contains an input file called test.pdf and the output will go to output.pdf: docker run --rm -v "

$(

pwd) :/home/docker" ocrmypdf --skip-text test.pdf output.pdf

Note: The working directory should be a writable local volume or Docker may not have permission to access it.

22 Chapter 3. Installation


Note that ocrmypdf has its own separate -v VERBOSITYLEVEL argument to control debug verbosity. All Docker arguments should before the ocrmypdf image name and all arguments to ocrmypdf should be listed after.

For convenience, a shell alias can hide the docker command: alias ocrmypdf = 'docker run --rm -v "$(pwd):/home/docker" ocrmypdf' ocrmypdf --version # runs docker version

Or in the wonderful fish shell : alias ocrmypdf 'docker run --rm -v (pwd):/home/docker ocrmypdf' funcsave ocrmypdf

Manual installation on macOS

These instructions probably work on all macOS supported by Homebrew.

If it’s not already present, install Homebrew .

Update Homebrew: brew update

Install or upgrade the required Homebrew packages, if any are missing:

# image libraries brew install libpng openjpeg jbig2dec libtiff brew install qpdf brew install ghostscript brew install python3 brew install libxml2 libffi leptonica brew install unpaper # optional

Python 3.5 and 3.6 are supported.

Install the required Tesseract OCR engine with the language packs you plan to use: brew install tesseract

˓→

Spanish

# Option 1: for English, French, German, brew install tesseract --with-all-languages # Option 2: for all language packs

Update the homebrew pip and install Pillow: pip3 install --upgrade pip pip3 install --upgrade pillow

You can then install OCRmyPDF from PyPI: pip3 install ocrmypdf

The command line program should now be available: ocrmypdf --help

3.4. Manual installation on macOS 23


Installing on Ubuntu 16.04 LTS

No package is currently available for Ubuntu 16.04, but you can install the dependencies manually: sudo apt-get update sudo apt-get install

\

unpaper

\

ghostscript

\

tesseract-ocr

\

qpdf

\

python3-pip

\

python3-cffi

If you wish install OCRmyPDF to the system Python, then install as follows (note this installs new packages into your system Python, which could interfere with other programs): sudo pip3 install ocrmypdf

If you wish to install OCRmyPDF to a virtual environment to isolate the system Python, you can follow these steps.

python3 -m venv venv-ocrmypdf source venv-ocrmypdf/bin/activate pip3 install ocrmypdf

Installing on Ubuntu 14.04 LTS

Installing on Ubuntu 14.04 LTS (trusty) is more difficult than some other options, because of bugs in Python package installation and because OCRmyPDF depends on some packages newer than are available in the main distribution.

Add new “apt” repositories needed for backports of Ghostscript 9.16, libav-11 (for unpaper 6.1) and Tesseract 4.00

(alpha). This will replace Ghostscript and Tesseract 3.x on your system. If you prefer to not modify your system in this matter, consider using a Docker container.

sudo add-apt-repository ppa:vshn/ghostscript -y sudo add-apt-repository ppa:heyarje/libav-11 -y sudo add-apt-repository ppa:alex-p/tesseract-ocr

Update apt-get: sudo apt-get update

Install system dependencies: sudo apt-get install

\

software-properties-common python-software-properties

\

zlib1g-dev

\

libjpeg-dev

\

libffi-dev

\

libavformat56 libavcodec56 libavutil54

\

ghostscript

\

qpdf

\

python3-pip

\

python3-pil

\

python3-pytest

\

python3-reportlab

\



python3-wheel

\

python3-venv

\

tesseract-ocr

\

tesseract-ocr-eng

If you wish install OCRmyPDF to the system Python, then install as follows (note this installs new packages into your system Python, which could interfere with other programs): sudo pip3 install ocrmypdf

If you wish to install OCRmyPDF to a virtual environment to isolate the system Python, you can follow these steps.

This includes a workaround for a known, unresolved issue in Ubuntu 14.04’s ensurepip package : sudo apt-get install python3-venv python3 -m venv venv-ocrmypdf --without-pip source venv-ocrmypdf/bin/activate wget -O - -o /dev/null https://bootstrap.pypa.io/get-pip.py | python deactivate python3 -m venv --system-site-packages venv-ocrmypdf source venv-ocrmypdf/bin/activate pip install ocrmypdf

These installation instructions omit the optional dependency unpaper, which is only available at version 0.4.2 in

Ubuntu 14.04. The author could not find a backport of unpaper, and created a .deb package to do the job of installing unpaper 6.1 (for x86 64-bit only): wget -q 'https://www.dropbox.com/s/vaq0kbwi6e6au80/unpaper_6.1-1.deb?raw=1' -O

˓→ unpaper_6.1-1.deb

sudo dpkg -i unpaper_6.1-1.deb

Installing on ArchLinux

The author is aware of an ArchLinux package for ocrmypdf . It seems like the following command might work.

pacman -S ocrmypdf

Installing on Windows

Direct installation on Windows is not possible. Install the Docker container as described above. Ensure that your command prompt can run the docker “hello world” container.

Running on Windows

The command line syntax to run ocrmypdf from a command prompt will resemble: docker run -v /c/Users/sampleuser:/home/docker ocrmypdf --skip-text test.pdf output.

˓→ pdf where /c/Users/sampleuser is a Unix representation of the Windows path C:\Users\sampleuser, assuming a user named

“sampleuser” is running ocrmypdf on a file in their home directory, and the files “test.pdf” and “output.pdf” are in the sampleuser folder. The Windows user must have read and write permissions.

3.7. Installing on ArchLinux 25


Bash on Ubuntu on Windows should also be a viable route for running the OCRmyPDF Docker container.

Installing HEAD revision from sources

If you have git and Python 3.5 or newer installed, you can install from source. When the pip installer runs, it will alert you if dependencies are missing.

To install the HEAD revision from sources in the current Python 3 environment: pip3 install git+https://github.com/jbarlow83/OCRmyPDF.git

Or, to install in development mode , allowing customization of OCRmyPDF, use the -e flag: pip3 install -e git+https://github.com/jbarlow83/OCRmyPDF.git

On certain Linux distributions such as Ubuntu, you may need to use run the install command as superuser: sudo pip3 install [ -e ] git+https://github.com/jbarlow83/OCRmyPDF.git

Note that this will alter your system’s Python distribution. If you prefer to not install as superuser, you can install the package in a Python virtual environment: git clone -b master https://github.com/jbarlow83/OCRmyPDF.git

python3 -m venv source venv/bin/activate cd OCRmyPDF pip3 install .

However, ocrmypdf will only be accessible on the system PATH after you activate the virtual environment.

To run the program: ocrmypdf --help

If not yet installed, the script will notify you about dependencies that need to be installed. The script requires specific versions of the dependencies. Older version than the ones mentioned in the release notes are likely not to be compatible to OCRmyPDF.


CHAPTER

4

Installing additional language packs

OCRmyPDF uses Tesseract for OCR, and relies on its language packs for languages other than English.

Tesseract supports most languages .

For Linux users, you can often find packages that provide language packs:

Debian and Ubuntu users

# Display a list of all Tesseract language packs apt-cache search tesseract-ocr

# Debian/Ubuntu users apt-get install tesseract-ocr-chi-sim # Example: Install Chinese Simplified language

˓→ back

You can then pass the -l LANG argument to OCRmyPDF to give a hint as to what languages it should search for.

Multiple languages can be requested using either -l eng+fre (English and French) or -l eng -l fre.

macOS users

You can install additional language packs by

installing Tesseract using Homebrew with all language packs

.

Docker users

Users of the Docker image may use the alternative

“polyglot” container

which includes all languages.

27


Known limitations

As of v4.2, users of ocrmypdf working languages outside the Latin alphabet should use the following syntax: ocrmypdf -l eng+gre --output-type pdf --pdf-renderer tesseract

The reasons for this are:

• The latest version of Ghostscript (9.19 as of this writing) has unfixed bugs in Unicode handling that generate invalid character maps, so Ghostscript cannot be used for PDF/A conversion

• The default “hocr” PDF renderer does not handle Asian fonts properly

28 Chapter 4. Installing additional language packs

CHAPTER

5

Cookbook

Basic examples

Help!

ocrmypdf has built-in help.

ocrmypdf --help

Add an OCR layer and convert to PDF/A

ocrmypdf input.pdf output.pdf

Add an OCR layer and output a standard PDF

ocrmypdf --output-type pdf input.pdf output.pdf

Create a PDF/A with all color and grayscale images converted to JPEG

ocrmypdf --output-type pdfa --pdfa-image-compression jpeg input.pdf output.pdf

Modify a file in place

The file will only be overwritten if OCRmyPDF is successful.

29


ocrmypdf myfile.pdf myfile.pdf

Correct page rotation

OCR will attempt to automatic correct the rotation of each page. This can help fix a scanning job that contains a mix of landscape and portrait pages.

ocrmypdf --rotate-pages myfile.pdf myfile.pdf

You can increase (decrease) the parameter --rotate-pages-threshold to make page rotation more (less) aggressive.

OCR languages other than English

By default OCRmyPDF assumes the document is English.

ocrmypdf -l fre LeParisien.pdf LeParisien.pdf

ocrmypdf -l eng+fre Bilingual-English-French.pdf Bilingual-English-French.pdf

Language packs must be installed for all languages specified. See

Installing additional language packs

.

Produce PDF and text file containing OCR text

This produces a file named “output.pdf” and a companion text file named “output.txt”. The pdftotext program from Poppler is used to extract text from the finished PDF.

ocrmypdf input.pdf - | tee output.pdf | pdftotext - output.txt

Note: To get pdftotext, Debian/Ubuntu users may apt-get install poppler-utils and macOS users may brew install poppler respectively.

OCR images, not PDFs

Use a program like img2pdf to convert your images to PDFs, and then pipe the results to run ocrmypdf: img2pdf my-images*.jpg | ocrmypdf - myfile.pdf

img2pdf also has features to control the position of images on a page, if desired.

For convenience, OCRmyPDF can convert single images to PDFs on its own. If the resolution (dots per inch, DPI) of an image is not set or is incorrect, it can be overridden with --image-dpi. (As 1 inch is 2.54 cm, 1 dpi = 0.39

dpcm).

ocrmypdf --image-dpi 300 image.png myfile.pdf

If you have multiple images, you must use img2pdf to convert the images to PDF.

30 Chapter 5. Cookbook


Note: ImageMagick convert can also convert a group of images to PDF, but in the author’s experience it takes a long time, transcodes unnecessarily and gives poor results.

You can also use Tesseract 3.04+ directly to convert single page images or multi-page TIFFs to PDF: tesseract my-image.jpg output-prefix pdf

Image processing

OCRmyPDF perform some image processing on each page of a PDF, if desired. The same processing is applied to each page. It is suggested that the user review files after image processing as these commands might remove desirable content, especially from poor quality scans.

• --rotate-pages attempts to determine the correct orientation for each page and rotates the page if necessary.

• --remove-background attempts to detect and remove a noisy background from grayscale or color images.

Monochrome images are ignored. This should not be used on documents that contain color photos as it may remove them.

• --deskew will correct pages were scanned at a skewed angle by rotating them back into place. Skew determination and correction is performed using Postl’s variance of line sums algorithm as implemented in Leptonica .

• --clean uses unpaper to clean up pages before OCR, but does not alter the final output. This makes it less likely that OCR will try to find text in background noise.

• --clean-final uses unpaper to clean up pages before OCR and inserts the page into the final output. You will want to review each page to ensure that unpaper did not remove something important.

Note: In many cases image processing will rasterize PDF pages as images, potentially losing quality.

Warning:

--clean-final and -remove-background may leave undesirable visual artifacts in some images where their algorithms have shortcomings. Files should be visually reviewed after using these options.

OCR and correct document skew (crooked scan)

Deskew: ocrmypdf --deskew input.pdf output.pdf

Image processing commands can be combined. The order in which options are given does not matter. OCRmyPDF always applies the steps of the image processing pipeline in the same order (rotate, remove background, deskew, clean).

ocrmypdf --deskew --clean --rotate-pages input.pdf output.pdf

5.3. Image processing 31


Improving OCR quality

The

Image processing

features can improve OCR quality.

Rotating pages and deskewing helps to ensure that the page orientation is correct before OCR begins. Removing the background and/or cleaning the page can also improve results. The --oversample DPI argument can be specified to resample images to higher resolution before attempting OCR; this can improve results as well.

OCR quality will suffer if the resolution of input images is not correct (since the range of pixel sizes that will be checked for possible fonts will also be incorrect).

32 Chapter 5. Cookbook

CHAPTER

6

Advanced features

Control of OCR options

OCRmyPDF provides many features to control the behavior of the OCR engine, Tesseract.

When OCR is skipped

If a page in a PDF seems to have text, by default OCRmyPDF will exit without modifying the PDF. This is to ensure that PDFs that were previously OCRed or were “born digital” rather than scanned are not processed.

If --skip-text is issued, then no OCR will be performed on pages that already have text. The page will be copied to the output. This may be useful for documents that contain both “born digital” and scanned content, or to use

OCRmyPDF to normalize and convert to PDF/A regardless of their contents.

If --force-ocr is issued, then all pages will be rasterized to images, discarding any hidden OCR text, and rasterizing any printable text. This is useful for redoing OCR, for fixing OCR text with a damaged character map (text is selectable but not searchable), and destroying redacted information.

Time and image size limits

By default, OCRmyPDF permits tesseract to run for only three minutes (180 seconds) per page. This is usually more than enough time to find all text on a reasonably sized page with modern hardware.

If a page is skipped, it will be inserted without OCR. If preprocessing was requested, the preprocessed image layer will be inserted.

If you want to adjust the amount of time spent on OCR, change --tesseract-timeout. You can also automatically skip images that exceed a certain number of megapixels with --skip-big. (A 300 DPI, 8.5×11” page is 8.4

megapixels.)

# Allow 300 seconds for OCR; skip any page larger than 50 megapixels ocrmypdf --tesseract-timeout 300 --skip-big 50 bigfile.pdf output.pdf

33


Overriding default tesseract

OCRmyPDF checks the environment variable OCRMYPDF_TESSERACT for the full path to the tesseract executable first.

For example, if you are testing tesseract 4.00 and don’t wish to disturb your tesseract 3.04 installation, you can launch

OCRmyPDF as follows: env

\

OCRMYPDF_TESSERACT = /home/user/src/tesseract4/api/tesseract

\

TESSDATA_PREFIX = /home/user/src/tesseract4

\

ocrmypdf --tesseract-oem 2 input.pdf output.pdf

• TESSDATA_PREFIX directs tesseract 4.0 to use LSTM training data. This is a tesseract environment variable.

• --tesseract-oem 1 requests tesseract 4.0’s new LSTM engine. (Tesseract 4.0 only.)

Overriding other support programs

In addition to tesseract, OCRmyPDF uses the following external binaries:

• gs (Ghostscript)

• unpaper

• qpdf

In each case OCRmyPDF will check the environment variable OCRMYPDF_{program} before asking the system to find {program} on the PATH. For example, you could redirect OCRmyPDF to OCRMYPDF_GS to override

Ghostscript.

Changing tesseract configuration variables

You can override tesseract’s default control parameters with a configuration file.

As an example, this configuration will disable Tesseract’s dictionary for current language. Normally the dictionary is helpful for interpolating words that are unclear, but it may interfere with OCR if the document does not contain many words (for example, a list of part numbers).

Create a file named “no-dict.cfg” with these contents: load_system_dawg 0 language_model_penalty_non_dict_word 0 language_model_penalty_non_freq_dict_word 0 then run ocrmypdf as follows (along with any other desired arguments): ocrmypdf --tesseract-config no-dict.cfg input.pdf output.pdf

Warning: Some combinations of control parameters will break Tesseract or break assumptions that OCRmyPDF makes about Tesseract’s output.

34 Chapter 6. Advanced features


Changing the PDF renderer

rasterizing Converting a PDF to an image for display.

rendering Creating a new PDF from other data (such as an existing PDF).

OCRmyPDF has three PDF renderers: sandwich, hocr, tesseract. The renderer may be selected using

--pdf-renderer

. The default is auto which lets OCRmyPDF select the renderer to use. Currently, auto selects sandwich for Tesseract 3.05.01, and newer, hocr for older versions of Tesseract.

The sandwich renderer

The sandwich renderer uses Tesseract’s new text-only PDF feature, which produces a PDF page that lays out the

OCR in invisible text. This page is then “sandwiched” onto the original PDF page, allowing lossless application of

OCR even to PDF pages that contain other vector objects.

When image preprocessing features like --deskew are used, the original PDF will be rendered as a full page and the

OCR layer will be placed on top.

This renderer requires Tesseract 3.05.01 or newer.

The hocr renderer

The hocr renderer works with older versions of Tesseract. The image layer is copied from the original PDF page if possible, avoiding potentially lossy transcoding or loss of other PDF information. If preprocessing is specified, then the image layer is a new PDF.

This works in all versions of Tesseract.

The tesseract renderer

The tesseract renderer creates a PDF with the image and text layers precomposed, meaning that it always transcodes, loses image quality and rasterizes and vector objects. It does a better job on non-Latin text and document structure than hocr.

If a PDF created with this renderer using Tesseract versions older than 3.05.00 is then passed through Ghostscript’s pdfwrite feature, the OCR text may be corrupted. The --output-type=pdfa argument will produce a warning in this situation.

This renderer is deprecated and will be removed whenever support for older versions of Tesseract is dropped.

6.2. Changing the PDF renderer 35


36 Chapter 6. Advanced features

CHAPTER

7

Batch processing

This article provides information about running OCRmyPDF on multiple files or configuring it as a service triggered by file system events.

Batch jobs

Consider using the excellent GNU Parallel to apply OCRmyPDF to multiple files at once.

Both parallel and ocrmypdf will try to use all available processors. To maximize parallelism without overloading your system with processes, consider using parallel -j 2 to limit parallel to running two jobs at once.

This command will run all ocrmypdf all files named *.pdf in the current directory and write them to the previous created output/ folder. It will not search subdirectories.

The --tag argument tells parallel to print the filename as a prefix whenever a message is printed, so that one can trace any errors to the file that produced them.

parallel --tag -j 2 ocrmypdf '{}' 'output/{}' ::: *.pdf

Directory trees

This will walk through a directory tree and run OCR on all files in place, printing the output in a way that makes find . --printf '%p' -name '*.pdf' -exec ocrmypdf '{}' '{}' \;

This only runs one ocrmypdf process at a time. This variation uses find to create a directory list and parallel to parallelize runs of ocrmypdf, again updating files in place.

find . -name '*.pdf' | parallel --tag -j 2 ocrmypdf '{}' '{}'

37


Sample script

This user contributed script also provides an example of batch processing.

#!/usr/bin/env python3

# Walk through directory tree, replacing all files with OCR'd version

# Contributed by DeliciousPickle@github

import logging import os import subprocess import sys

script_dir = os .

path .

dirname(os .

path .

realpath( __file__ ))

print

(script_dir + '/ocr-tree.py: Start' )

if

len (sys .

argv) > 1 : start_dir = sys .

argv[ 1 ]

else

: start_dir = '.'

if

len (sys .

argv) > 2 : log_file = sys .

argv[ 2 ]

else

: log_file = script_dir + '/ocr-tree.log' logging .

basicConfig( level = logging .

INFO, format = ' %(asctime)s %(message)s ' , filename = log_file, filemode = 'w' )

for

dir_name, subdirs, file_list

in

os .

walk(start_dir): logging .

info( '\n' ) logging .

info(dir_name + '\n' ) os .

chdir(dir_name)

for

filename

in

file_list: file_ext = os .

path .

splitext(filename)[ 1 ]

if

file_ext == '.pdf' : full_path = dir_name + '/' + filename

print

(full_path) cmd = [ "ocrmypdf" , "--deskew" , filename, filename] logging .

info(cmd) proc = subprocess .

Popen( cmd, stdout = subprocess .

PIPE, stderr = subprocess .

STDOUT) result = proc .

stdout .

read()

if

proc .

returncode == 6 :

print

( "Skipped document because it already contained text" )

elif

proc .

returncode == 0 :

print

( "OCR complete" ) logging .

info(result)

API

OCRmyPDF is currently supported as a command line interface. Due to limitations in one of the libraries OCRmyPDF depends on, it is not yet usable as an API.

38 Chapter 7. Batch processing


Huge batch jobs

If you have thousands of files to work with, contact the author.

Hot (watched) folders

To set up a “hot folder” that will trigger OCR for every file inserted, use a program like Python watchdog (supports all major OS).

One could then configure a scanner to automatically place scanned files in a hot folder, so that they will be queued for

OCR and copied to the destination.

pip install watchdog watchdog installs the command line program watchmedo, which can be told to run ocrmypdf on any .pdf added to the current directory (.) and place the result in the previously created out/ folder.

cd hot-folder mkdir out watchmedo shell-command

\

--patterns = "*.pdf" \

--ignore-directories

\

--command = 'ocrmypdf "${watch_src_path}" "out/${watch_src_path}" ' \

.

# don't forget the final dot

For more complex behavior you can write a Python script around to use the watchdog API.

On file servers, you could configure watchmedo as a system service so it will run all the time.

Caveats

• watchmedo may not work properly on a networked file system, depending on the capabilities of the file system client and server.

• This simple recipe does not filter for the type of file system event, so file copies, deletes and moves, and directory operations, will all be sent to ocrmypdf, producing errors in several cases. Disable your watched folder if you are doing anything other than copying files to it.

• If the source and destination directory are the same, watchmedo may create an infinite loop.

• On BSD, FreeBSD and older versions of macOS, you may need to increase the number of file descriptors to monitor more files, using ulimit -n 1024 to watch a folder of up to 1024 files.

Alternatives

• Watchman is a more powerful alternative to watchmedo.

7.3. Hot (watched) folders 39


40 Chapter 7. Batch processing

CHAPTER

8

PDF security issues

OCRmyPDF should only be used on PDFs you trust. It is not designed to protect you against malware.

Recognizing that many users have an interest in handling PDFs and applying OCR to PDFs they did not generate themselves, this article discusses the security implications of PDFs and how users can protect themselves.

The disclaimer applies: this software has no warranties of any kind.

PDFs may contain malware

PDF is a rich, complex file format. The official PDF 1.7 specification, ISO 32000:2008, is hundreds of packages long and references several annexes each of which are similar in length. PDFs can contain video, audio, JavaScript and other programming, and forms. In some cases, they can open internet connections to pre-selected URLs. All of these possible attack vectors.

In short, PDFs may contain viruses .

This article describes a high-paranoia method which allows potentially hostile PDFs to be viewed and rasterized safely in a disposable virtual machine. A trusted PDF created in this manner is converted to images and loses all information making it searchable and losing all compression. OCRmyPDF could be used restore searchability.

How OCRmyPDF processes PDFs

OCRmyPDF must open and interpret your PDF in order to insert an OCR layer. First, it runs all PDFs through qpdf , a program that repairs PDFs with syntax errors. This is done because, in the author’s experience, a significant number of PDFs in the wild especially those created by scanners are not well-formed files. qpdf makes it more likely that

OCRmyPDF will succeed, but offers no security guarantees. qpdf is also used to split the PDF into single page PDFs.

After qpdf, OCRmyPDF examines each page using PyPDF2 . This library also has no warranties or guarantees.

Finally, OCRmyPDF rasterizes each page of the PDF using Ghostscript in -dSAFER mode.

41


Depending on the options specified, OCRmyPDF may graft the OCR layer into the existing PDF or it may essentially reconstruct (“re-fry”) a visually identical PDF that may be quite different at the binary level. That said, OCRmyPDF is not a tool designed for sanitizing PDFs.

Using OCRmyPDF online

OCRmyPDF is not designed to be deployed “as a service”, in a setting where a user/attacker could upload a file for

OCR processing online. It is not designed to be secure in this case.

Abbyy Cloud OCR is a viable commercial alternative with a web services API. The author also provides professional services that include OCR and building databases around PDFs, and is happy to provide consultation.

Password protection, digital signatures and certification

OCRmyPDF cannot remove password protection from a PDF. qpdf, one of its dependencies, has this capability. After

OCR is applied, password protection is not permitted on PDF/A documents but the file can be converted to regular

PDF.

Many programs exist which are capable of inserting an image of someone’s signature. On its own, this offers no security guarantees. It is trivial to remove the signature image and apply it to other files. This practice offers no real security.

Important documents can be digitally signed and certified to attest to their authorship. OCRmyPDF cannot do this.

Open source tools such as pdfbox (Java) have this capability as does Adobe Acrobat.

42 Chapter 8. PDF security issues

CHAPTER

9

Common error messages

Page already has text

ERROR 1: page already has text! - aborting (use --force-ocr to force OCR)

You ran ocrmypdf on a file that already contains printable text or a hidden OCR text layer (it can’t quite tell the difference). You probably don’t want to do this, because the file is already searchable.

As the error message suggests, your options are:

• ocrmypdf --force-ocr to

rasterize

all vector content and run OCR on the images. This is useful if a previous OCR program failed, or if the document contains a text watermark.

• ocrmypdf --skip-text to skip OCR and other processing on any pages that contain text. Text pages will be copied into the output PDF without modification.

Input file ‘filename’ is not a valid PDF

OCRmyPDF passes files through qpdf, a program that fixes errors in PDFs, before it tries to work on them. In most cases this happens because the PDF is corrupt and truncated (incomplete file copying) and not much can be done.

You can try rewriting the file with Ghostscript or pdftk:

• gs -o output.pdf -dSAFER -sDEVICE=pdfwrite input.pdf

• pdftk input.pdf cat output output.pdf

Sometimes Acrobat can repair PDFs with its Preflight tool .

43


44 Chapter 9. Common error messages

• genindex

• modindex

• search

CHAPTER

10

Indices and tables

45

ocrmypdf Documentation

ocrmypdf Documentation

Release 5.2

James R. Barlow

Contents

CHAPTER

Introduction

CHAPTER

Release notes

New features

Fixes

Fixes

Fixes

New features

Fixes

Changes

Changes

New features

Changes

Changes

Changes

New features

Changes

Release candidates

Fixes

Notes and known issues

CHAPTER

Installation

Running on Windows

CHAPTER

Installing additional language packs

CHAPTER

Cookbook

Help!

Add an OCR layer and convert to PDF/A

Add an OCR layer and output a standard PDF

Create a PDF/A with all color and grayscale images converted to JPEG

Modify a file in place

Correct page rotation

OCR languages other than English

Produce PDF and text file containing OCR text

OCR and correct document skew (crooked scan)

CHAPTER

Advanced features

When OCR is skipped

Time and image size limits

Overriding default tesseract

Overriding other support programs

Changing tesseract configuration variables

The sandwich renderer

The hocr renderer

The tesseract renderer

CHAPTER

Batch processing

Sample script

API

Huge batch jobs

Caveats

Alternatives

CHAPTER

PDF security issues

CHAPTER

Common error messages

CHAPTER

Indices and tables

Related manuals

Kofax

RPA 11.2.0

Kofax

RPA 11.3.0

Table of contents