A recent article in the Economist summarizes work done in Cambridge by Dr. Goldman, Dr. Birney and their teams. They came up with a mapping for data into base pairs for storage in DNA; while today DNA storage is quite expensive, it is quickly decreasing in cost...and DNA has the advantage that information does not decay anywhere near as fast as other current storage mediums; 10's of thousands of years vs a handful. The DNA format is also unlikely to change, so it may remain a more consistent technology than others.
The team in Cambridge came up with a way to ensure that base pairs do not repeat, as sequencing errors are much higher when there are strings of the same base. The encoding table is shown above. First the digital data is converted to base 3, and then a differential coding is used to write so that no base pair ever repeats. For example, if the last base was A, and we are looking to encode a 2, then we would write a T.
It struck me immediately that this was not a very efficient coding. Two digits in base 3 have 3^2 = 9 permutations, but the encoding table above has 12 possible transitions. So, there is 25% wasted space in this encoding mechanism.
The obvious way to optimize this is to use a (digital) base of sqrt(12) = 2*sqrt(3). With sqrt(3) being irrational, this might be a challenge :-) However, it is easy to take a fixed estimate slightly smaller than sqrt(12), say 3.4641, and use this as the digital base; then we are only wasting a tiny fraction of the 12 transitions. [It would be even easier to use 3.45; the "Digits to be encoded", in the table above, would then be 0, 1.15, and 2.30, as shown below]
Previous | 0.00 | 1.15 | 2.30 |
---|---|---|---|
A | C | G | T |
C | G | T | A |
G | T | A | C |
T | A | C | G |
This also points to the fact that although the DNA "format" may be static for many years, the encoding mechanism is most likely going to change over time. Perhaps we need a standard header using a very simple A and C are 0, G and T are 1, in order to specify the encoding format for the rest of the string.