User`s guide

How the Matching Algorithm Works

9-11

The Chawathe algorithm matches a particular label by extracting a flat sequence of

elements from the hierarchical document and attempting to match the elements in the

sequences. In the example above, elements of the sequence

(<C> First </C>, <C> Second </C>, <C> Third </C>)

are matched against elements of the sequence

(<C> First </C>, <C> Third </C>, <C> Fourth </C>)

Sequences are matched using a Longest Common Subsequence (LCS) algorithm. For

example, if C elements are matched on their text content, the LCS of the above sequences

is given by:

(<C> First </C>, <C> Third </C>)

You can define a score for matching elements of a particular label in different ways. For

instance, in the above example, C elements can be matched on text content, B can be

matched on text content and on Name, and A on the number of B and C elements they

have in common. To determine whether elements match or not, the Chawathe algorithm

compares the score to a threshold.

The implementation can specify scoring methods, thresholds, the definition of labels, and

the order in which labels are processed. These can be defined separately for each problem

domain or type of XML file. The XML comparison tool provides suitable definitions

for a set of common XML file types, and uses a default definition for any type of XML

document it does not recognize.