2021.2

Table Of Contents

PDF text extraction tolerance factors

When extracting text from a PDF (for example, through a data selection), a lot more happens in

the background than what can be seen on the surface. Reading a PDF file for text will generally

return text fragments, separated by a certain amount of space. Sometimes the text will be

shifted up or down, spacing will be different, etc. In some cases, every letter is considered to be

a different fragment.

Text formatting features such as kerning, bold, exponential, etc, may cause these fragments to

be considered as separate even if, to the naked eye, they obviously belong together.

The PDF Text Extraction Tolerance Factors is used to modify the behavior of data selections

made from PDF data files from within PlanetPress Workflow. Each factor available in this

window will determine if two fragments of text in the PDF should be part of the same data

selection or not.

Warning

The default values are generally correct for the greatest majority of PDF data files. Only

change these values if you understand what they are for.

Delta Width

Defines the tolerance for the distance between two text fragments, either positive (space

between fragments) or negative (kerning text where letters overlap). When this value is at 0, the

two fragments will need to be exactly one beside the other with no space or overlap between

them.

When this value is at 1, a very large space or overlap will be accepted. This may case "false

positives" and separate words and text blocks may be considered as a single word if the value

is too high.

Accepted values range from 0 to 1. The default value is 0.3, recommended values are between

0.05 and 0.30.

Page 788