Seeing Sequences

From melodies to text to DNA, many interesting types of information come in the form of a sequence of symbols. Although it's natural to try to analyze these sequences graphically, they present a unique challenge for data visualization. A good visualization method should provide a broad overview of the "shape" of the data--yet with sequences, the only structure ab initio is the completely uniform micro-level connection of each symbol to its immediate neighbors.

This page describes a technique, the matching diagram, designed to deduce and display macro-level structure in long sequences. (If you want to skip this description, go directly to these pictures of music.)

Matching diagrams are based on the fact that sequences often represent a hierarchy of ideas. Melodies, for instance, are usually based on combinations of smaller repeated musical passages; text has repeated words and phrases. A natural way to find macro-level structure, therefore, is to use these repeated units as signposts.

Unfortunately the human eye is bad at spotting repetition. (Try to find the longest repeated subsequence in 28746391479735648274639137. It's irritating.) Matching diagrams avoid this difficulty by having the computer find the repetitions and highlighting them explicitly.

To create the diagram, we start by representing the sequence as a line. If two subsequences "match" according to a given criterion, connect the corresponding intervals on the line with an arc. For instance, here is the matching diagram for the numerical sequence in the previous paragraph, where "match" means an exact repetition.

The arc makes it obvious where the repeated subsequence is. What if a sequence is repeated more than once? In that case I connect each successive repetition with an arc, as in the next picture.

An interesting ambiguity arises when a pattern is repeated many times in immediate succession. Take the sequence 1010101010101010. Should it be viewed as "10" repeated 8 times? As "10101010" repeated twice? Or something else, perhaps a combination of overlapping repeated areas? Informal experimentation suggests the best way to go is to break a sequence of repetition into the smallest possible units, so the diagram for this sequence would look like the picture below, with the sequence parsed as "10" repeated 8 times.

Most interesting sequences will combine several different repeated patterns. To make a diagram for a complex sequence, we overlay all the appropriate arcs, but with a degree of translucency so no match is completely obscured. For instance the sequence abcd111110000011111abcd would have this diagram:

This diagram shows how a pattern of matches can provide a bird's-eye view of the sequence's structure. For instance, although the symbols don't form an exact palindrome, it's visually obvious that at the macro level the sequence is symmetric.

A long sequence made from a small set of symbols generally has a huge number of repeated sequences, creating an uninformative jumble as in this example:

One way of reducing this complexity is to display only repeated subsequences that are longer than a given limit. For instance, below is the same sequence but this time I've filtered out all repetitions of less than 10 symbols. The result is a simple diagram that highlights a single repeated region of 15 symbols.

Applied to real-life data, matching diagrams produce some intriguing pictures. I'm most interested in their applications to visualizing music but they also can reveal structure in many kinds of encoded data. Of course, this is not the first attempt to visualize sequences; I've written a quick summary of some other methods. Finally, you can read some implementation details if you wish.

Intro | Musical structure | Patterns in encoded data | Other methods | Implementation
Martin Wattenberg, July 1999