2021.2

Table Of Contents
PDF text extraction tolerance factors
When extracting text from a PDF (for example, through a data selection), a lot more happens in
the background than what can be seen on the surface. Reading a PDF file for text will generally
return text fragments, separated by a certain amount of space. Sometimes the text will be
shifted up or down, spacing will be different, etc. In some cases, every letter is considered to be
a different fragment.
Text formatting features such as kerning, bold, exponential, etc, may cause these fragments to
be considered as separate even if, to the naked eye, they obviously belong together.
The PDF Text Extraction Tolerance Factors is used to modify the behavior of data selections
made from PDF data files from within PlanetPress Workflow. Each factor available in this
window will determine if two fragments of text in the PDF should be part of the same data
selection or not.
Warning
The default values are generally correct for the greatest majority of PDF data files. Only
change these values if you understand what they are for.
Delta Width
Defines the tolerance for the distance between two text fragments, either positive (space
between fragments) or negative (kerning text where letters overlap). When this value is at 0, the
two fragments will need to be exactly one beside the other with no space or overlap between
them.
When this value is at 1, a very large space or overlap will be accepted. This may case "false
positives" and separate words and text blocks may be considered as a single word if the value
is too high.
Accepted values range from 0 to 1. The default value is 0.3, recommended values are between
0.05 and 0.30.
Page 788