User`s guide
How the Matching Algorithm Works
9-11
The Chawathe algorithm matches a particular label by extracting a flat sequence of
elements from the hierarchical document and attempting to match the elements in the
sequences. In the example above, elements of the sequence
(<C> First </C>, <C> Second </C>, <C> Third </C>)
are matched against elements of the sequence
(<C> First </C>, <C> Third </C>, <C> Fourth </C>)
Sequences are matched using a Longest Common Subsequence (LCS) algorithm. For
example, if C elements are matched on their text content, the LCS of the above sequences
is given by:
(<C> First </C>, <C> Third </C>)
You can define a score for matching elements of a particular label in different ways. For
instance, in the above example, C elements can be matched on text content, B can be
matched on text content and on Name, and A on the number of B and C elements they
have in common. To determine whether elements match or not, the Chawathe algorithm
compares the score to a threshold.
The implementation can specify scoring methods, thresholds, the definition of labels, and
the order in which labels are processed. These can be defined separately for each problem
domain or type of XML file. The XML comparison tool provides suitable definitions
for a set of common XML file types, and uses a default definition for any type of XML
document it does not recognize.