Smart version controlling: why diff is just not good enough
Everyone that has been involved in some software engineering project probably will have worked with CVS or SVN. One has a central repository, can do checkouts, updates and commits and the version control system will save every revision and report the changes that have been made using the well-known unix tool diff, which states the differences between to text files in a line-by-line fashion. diff is useful. Its brother in arms is patch, which takes a file outputted by diff and applies the changes to a certain set of files. Great stuff, but diff sadly has some severe limitations, because it does nothing more than line-by-line comparison.
Scenario 1. Suppose you’re working on a project, let’s suppose it’s a wiki. Some biologists have a copy of the wiki database in their phones which will sync with the actual wiki that is served to the web as soon as they return from the field where they have no internet connection. On forehand, they have no idea what to expect, thus they have no idea of what changes they will make. Great chance that as soon as the biologists get home and commit their changes a lot of collisions need to be merged — a very tedious thing to do. Line-by-line comparison only helps little: it does not know any grammar nor semantics. To make matter worse, it does not detect any similarities on a single line — putting one simple character in front of a line causes diff to report a replacement of this line while one would want it to detect similarities, shifted sections and to report column numbers where changes have been made.
A frustrating example in which understanding from grammar and semantics would be really helpful is scenario 2. I translate Firefox extensions into dutch and the locales for such extensions are mostly written in xml dtds. The drill is that I get the dtd from an en-US package, copy it, then translate all the entities into dutch and give the modified dtd to the author of the extension so he/she can incorporate it into his product. Of course, these extension developers strive to perfection and therefore write regular updates. Some of these have an impact on the user interface and as such on the things needed in locales. Entities in the en-US dtd are added, removed, rephrased and probably their position in the code will change as well. And it is up to me to see what has happened between the versions.
Using diff is not much help here. I could use it on both my own nl-NL dtd and the new en-US dtd, but this will cause diff to report every single line as the languages differ. A more sensible approach would be comparing the old and new version of the en-US dtd to detect what has changed between them. But still this is not ideal, because of the line-by-line comparison which doesn’t make use of the characteristics of the texts being compared. Xml (as any other formal language) can be parsed. Parsing is done regardless of white spaces, comments and other things irrelevant to the actual things we’re interested in and xml diff tools already have been made available (see links below).
As far as I know no version control system makes use of the specific characteristics of the files being managed to offer more sophisticated change reporting and merging facilities. Of course such facilities would make a system larger, but there certainly are additional benefits. One of these is that coding standards can be enforced and that different users of the repository can view code formatted to their own taste without having to force this upon the others. Having a parser in the versioning system also enforces that builds cannot be broken (that is, at compile time). These things may not have much additional value because in projects the participants are supposed to adhere to good programming practices and project standards. The true power of such a system may save lots of time and effort though: full tracability. Having the versioning system parse files committed into the repository allows for detailed analysis of who is responsible for which changes in the source code. It would be instantly clear who is responsible for certain pieces of a system, what dependencies there are and even better: a report of which other system components are being affected by a certain change. The possibilities are endless.
One question may arise, though: won’t it make a versioning system slow and bloated? I don’t think it will. Modularity is the keyword here: taylor the facilities of the system to the (expected) contents of the repository. Sure, the server will have to do more work generating reports and processing commits, but when it’s implemented properly I’m sure that the benefits will greatly outweigh the costs.
Links:
- X-Diff — Detecting Changes in XML Documents (the paper is interesting in particular)
- xml diff demo (apparently made in C#)
- diffxml (made in Python; when trying it out on some rdf files it seems to crash though)
In XML white space is actually quite significant. Except for attribute values, that are normalized by a conforming XML parser.
A (non-existent) diff tool for SVGs
See also Logilab’s XmlDiff. Recent work is Erich Schubert’s thesis Stucture-Preserving Difference Search in Semistructured Data and accompanied SSDDiff implementation.
Another example where usual diff capabilities of source version control systems fail are Microsoft Word documents which have built-in support for change and user tracking. This is not exploited in any general purpose versioning system I know. I’m sure similar things are provided by various other data formats.
Hmmm, when the various office formats will become xml (which I really hope) extracting such information should be a piece of cake. And when the printers work at the university I’ll see if I can print Schubert’s paper, it looks too interesting to be read from a screen ;) . I heard from Zef that someone at our university (the RuG) recently got his MSc degree for investigating a similar subject, I’ll ask him about it when I see him.
eclipse has java structure compare for this purpose: http://www.awprofessional.com/content/images/chap3_0321159640/elementLinks/03fig34.gif
they might expand this to any language where they have an AST, who knows..
Semantic diff and merge would be a huge improvement.
Just to add: Philip Greenspun recently made a case that we really needed versioned file systems.
I think it will be a great advantage when versioning is part of the file system and when applications could be written to take advantage of this ability since the application knows the semantic meaning. This would then also solve once
and for all the file type problem.
(There is however much more room for improvement in data storage and retrieval then this alone)