1.5

Table Of Contents
l
Field separator: Defines what character separates each fields in the file.
l
Text delimiter: Defines what character surrounds text fields in the file, preventing the
Field separator from being interpreted within those text delimiters.
l
Comment delimiter: Defines what character starts a comment line.
l
Encoding: Defines what encoding is used to read the Data Source (US-ASCII, ISO-
8859-1, UTF-8, UTF-16, UTF-16BE or UTF-16LE ).
l
Lines to skip: Defines a number of lines in the CSV that will be skipped and not used as
Source Records.
l
Set tabs as a field separator: Overwrites the Field separator option and sets the Tab
character instead for tab-delimited files.
l
First row contains field names: Uses the first line of the CSV as headers, which
automatically names all extracted fields.
l
Ignore unparseable lines: Ignores any line that does not correspond to the settings
above.
For a PDF File
PDF files already have a clear and unmovable delimiter: pages. So the settings in the input
area are not used to set delimiters of PDF files. Instead, this opportunity can be taken to add
some options on how text is read from the PDF when creating data selections. These options
determine how PDF words, lines and paragraphs are detected. For instance, the line spacing
option determines the spacing between lines of text. The default value is "1", meaning the
space between the top of each line must be equal to at least the average character height.
Note
PDF Files have a natural, static delimiter in the form of Pages, so the options here are interpretation
settings for text in the PDF file. Each value represents a fraction of the average font size of text in a
data selection, meaning "0.3" represents 30% of the height or width.
l
Word spacing: Determines the spacing between words. As PDF text spacing is
somehow done through positioning instead of actual text spaces, text position is what is
used to find new words. This option determines what percentage of the average width of a
single character needs to be empty to consider a new word has started. Default value is
"0.3", meaning a space is assumed if there is a blank area of 30% of the width of the
Page 139