User Guide
Ensuring Data Quality
Image preprocessing
Very often form images will contain "garbage" in the form of
excess dots introduced by scanning. Sometimes an image may be
skewed or rotated by 90 degrees from its normal orientation. It is very
important that the influence of such external factors be minimized.
ABBYY FormReader can do the following:
despeckle images, i.e. remove excess dots that hamper recogni
tion (the size of the dots to be removed can be adjusted);
deskew images that have a skew angle of up to 10 degrees;
rotate images by 90 degrees;
invert images, i.e. turn black pixels into white and vice versa.
The program can also detect textured backgrounds consisting
of dots or lines that are much thinner than the characters to be rec
ognized. FormReader will remove such textures before it starts
analysing and recognizing the text. Excess dots will be removed
during preprocessing, and grids of hairwidth lines will be detected
and removed when analysing the structure of the document.
Ensuring the quality of data
Defining data quality
In the previous sections we have often used the phrase "quality
of data". By the quality of data we mean the completeness and
accuracy of captured information. The higher the correspondence
between the data exported into the database and the data entered
into the fields of the paper forms, the higher the quality of data.
The quality of data is the correspondence of the data entered
into the target system to the data entered into the fields of the
paper forms. The quality of data is one of the most important
parameters of a forms processing application.
The following factors may have an adverse effect on the qual
ity of data:
Sloppy writing.
If someone writes carelessly, makes correc
tions or merges some letters, the chances of recognition errors
will increase. There is an obvious remedy: when designing a
form make sure that there is a separate character space for each
letter and digit on the form and that complex fields are broken
down into simpler ones, which are easier for the program to
handle. Follow the recommendations given in "Developing the
Logical Structure of the Form", and sloppy writing will have a
minimal impact on recognition accuracy.
Typos.
When entering data from forms manually, typos are an
important factor. Keyers will inevitably get tired and make
more mistakes towards the end of the day. The only solution is
to give up manual processing altogether. Operators of auto
mated data capture systems experience much less strain, and
even if they do get tired this will have almost no impact on the
quality of the resulting data ABBYY FormReader will use val
idation rules to ensure data integrity. Even if an operator
makes a mistake, the program will easily detect it and alert the
operator.
Recognition errors.
When reading information from the
fields, the program will mark some characters as "uncertainly
recognized". These will be passed on to the operator for verifi
cation. But suppose the program is too selfconfident about
some characters, even though they have been recognized
wrongly. They would not be submitted for manual verification
and incorrect data would be exported into the database. This is
the bane of all data capture applications, but ABBYY develop
ers have successfully tackled this problem of "hidden" errors.
Tests show that chances of error are as low as 0.5% for letters
and 0.1% for marks in check boxes.
To sum up:
FormReader has special methods and tech
niques to ensure the high quality of data. These include
: image preprocessing;
data type checks;
data verification;
data format checks;
validation rules;
document assembly rules
(in ABBYY FormReader 6.0 Enterprise Edition).
An image with a textured background.