Patterns in encoded data

Matching diagrams based on exact repetitions are well-suited to analyzing binary files and other encoded data. While they probably won't be of much use in cryptanalysis, they do have the potential to help in understanding and describing complex data formats of all types.

To see how matching diagrams can find structure in sequences that might otherwise be difficult to decipher, consider the next diagram, which shows the main Java class file for the diagram-generating application itself:


Java class file
Java class file, patterns of 5 or more bytes

The diagram shows that the file has two main sections. This won't be a surprise to Java hackers; class files do indeed have two main large areas of information, the constant pool and the bytecode. What is surprising, at least to me, is that the constant pool takes up significantly more memory than the code itself! Although this picture can't replace a straightforward byte-level specification, it does provide interesting quantitative and qualitative information about the file format--and is obviously preferable to a hexadecimal file printout.

Matching diagrams can also show structure in text. Take the HTML file that you're reading now, whose diagram is below.


This file
This file, patterns of 10 or more characters.

The three basic sections of this page are delineated clearly by wide arcs. (These correspond to the HTML code for images and tables.) At the same time, the finer-grained detail is also revealing. For instance, the diagram shows that the last section of text has many connections to previous parts of the page, with especially strong connections to the beginning. So just from looking at a picture you know that I won't finish up with a digression!

One of the most important sources of sequence data is DNA. There has already been a great deal of work on visualizing strings of base pairs, but matching diagrams may be useful as well. The next picture is a matching diagram for a section of DNA containing 200 base pairs. A variant of this picture might be helpful to a biologist studying the human genome. (To make it genuinely useful, of course, it should match codons instead of base pairs.)


DNA
DNA: 200 base-pair sequence, showing patterns of 5 or more base pairs

These examples of show how visualizing repetitive structure can pick out interesting features in binary files and other encoded data. Obviously, for a detailed study of a particular kind of data, it would make sense to tailor the matching criterion to the problem domain. But matching diagrams based on exact repetitions are a handy tool for providing a quick overview of information whose structure is not known in advance.


Intro | Musical structure | Patterns in encoded data | Related methods | Implementation
Martin Wattenberg, July 1999