Robert, 25 March 2010

Displaying the differences between versions of a document

Two questions we're often asked are: "What happens if I upload a new version of a document. Do the notes carry over?" and "If I change the document and upload again, can you show what has changed?".

Until now, the answer has had to be: "Sorry. Each upload is treated as an independent document so you can't see the earlier notes or the differences." For many people, this works OK. You get a batch of notes on one version, process all those notes making changes in the document as you go, and then start the next review cycle with a clean new version.

But in other cases, such as legal documents, annotations are so important that the counterparty wants to know exactly how their comments were addressed, and, indeed, if anything else has changed in the document.

We've been thinking about how best to do this for some time. Do you focus on MS Word or OpenOffice documents which are drafted with track changes? Or can you do something more generic that works equally well? The APIs for getting the changes out of OpenOffice documents ("redlines" in OpenOffice parlance) look to be still under development, and are accompanied by a few less than encouraging comments. My favorite is in the OO documentation: "Calling XPropertySet.getPropertySetInfo() on a redline property set crashes the office.". Nice to know.

The alternative is to work just with PDF, which seems a little rash, given that you are deliberately ignoring information that might be present in the upload. But if it can be made to work on the PDF alone, then there are big advantages too as it makes for a much more general solution.

So, given two PDFs of different version of a document, how do you find the differences? It turns out to be rather easier than we expected. There is an extensive literature on text differencing algorithms, and the place to start is definitely Eugene W. Myers "An O(ND) Difference Algorithm and its Variations" (1986). This is a neat, and very short, algorithm for finding the shortest edit script that will turn one list of characters or words into another. Once you have the edit script, it can be used both to display how the text has changed, and to work out where the text that a note was attached to is located in the new document. So it provides the core of a solution for both the document differences and note positions, something that would be a lot tricker with the redlines because they are so far removed from the final PDF where the notes are applied.

For PDFs, there's rather more to do than the basic Myers algorithm, because of things like headers and footers. For example, you probably don't want to know that there used to be a page number in the middle of paragraph two but now it is in paragraph four. This gives some chunks of text that should be excluded from the matching, based on their positions on the page and similarity across versions. As usual, the volume of code required for this kind of logic needed to actually apply the algorithm for a real world case far exceeds that for the basic algorithm itself (and displaying the differences takes more again) but essentially it is all down to a neat idea developed by Eugene Myers in the 80's.

Here's how it looks at present, with the insertions marked on the document, and deletions in movable boxes with arrows pointing to where they were.

N.B. This is still in development and will be out later this year. Get in touch if you want to preview it before then.