User`s guide
9 Comparing XML Files
9-12
Why Use a Heuristic Algorithm?
Chawathe’s algorithm is a heuristic. That is, it cannot guarantee to return the optimal
matching between two sequences. It is the use of a threshold mechanism in combination
with an LCS algorithm that makes the algorithm a heuristic. A heuristic algorithm is
preferable to an optimal matching because the heuristic is much faster.
An algorithm can only guarantee a mathematically optimal matching by exhaustively
computing the score between all pairs of elements in the two sequences and choosing
those pairs that maximize an overall matching score between the two sequences.
This exhaustive approach is computationally very expensive because its running time
increases exponentially with the length of the sequences to be matched.
Also a user's expectations can depend on context information that is not available
to the matching algorithm (e.g., prior knowledge of the precise sequence of changes
applied). This means even a mathematically optimal algorithm might match elements
unexpectedly from a user's perspective.
In contrast with the mathematically optimal approach, Chawathe’s algorithm guarantees
linear running time for sequences that are the same or very similar. The worst-case
scenario is quadratic running time for sequences that are entirely different.
The XML comparison tool performs best when the files to be compared are mostly
similar. It becomes slower for files that contain more differences.
Examples of Unexpected Results
• “Elements Matched in Previous Comparisons Fail to Match” on page 9-12
• “Elements Matched Across Different Parts of the Hierarchy” on page 9-13
• “Two Sequences of Elements Are Cross-Matched” on page 9-15
Elements Matched in Previous Comparisons Fail to Match
Elements could fail to match even if they were matched in comparisons of previous
versions of the documents. A seemingly small change in one of the properties used for
matching can cause this to happen if it tips the score under the threshold.
Consider the following example where
• B elements are scored on the value of x