User guide
Creating a Discovery Job
www.iprotech.com Ipro eCapture User Guide 5-45
877-324-4776 Q1 2014
for the email (headers...). Any attachments are not included in
that index.
OCR
• OCR images as necessary - Images will be OCRed for indexing/
language identification if necessary. The OCR text obtained from
the image is then passed on to dtSearch for indexing. The OCR will
be indexed and available to be searched on in the Flex Processor.
• OCR PDF documents - PDFs with no embedded text: perform
OCR prior to indexing or language identification. PDFs with embed-
ded text (text-behind) will have text extracted anyway. The OCR
text is added to any extracted text from the PDF. The text obtained
through OCR, along with the extracted text from the PDF, is passed
to dtSearch for indexing. The OCR will be indexed and available to
be searched in the Flex Processor. Note: Selecting this option will
impact the time for the Discovery process. OCR Text obtained
through OCR could contain duplicate words as appended to
extracted text file. Search hits could be inflated by these results.
Optionally, select PDF page character threshold to perform OCR
on image-based PDFs that may contain a small amount of embed-
ded text, such as an image key. The default value is 25. The maxi-
mum value is 10000. If there is less than this amount of
characters retrieved, the PDF will be OCRed.
• OCR PowerPoint Documents: Turn this option on to perform
OCR on Microsoft PowerPoint files during indexing to get text from
embedded content in the slides. This will result in slower indexing
speeds for PowerPoint files, but more accurate search results.
• Minimum average OCR confidence [1-100]: The level range
settings are from 1 up to 100. The default is 50. The confidence
level is the average percentage of confidence per document for all
pages within a document on which OCR was performed. Success or
failure of a document for index preparation is based on the aver-
age confidence level of the document. If the average confidence
level is below the selected threshold, the document will be consid-
ered as an indexing error and is available for re-queueing. The Dis-
covery Job information panel displays OCR Applied[Errors], where
Applied shows the number of documents that required OCR (not










