Skip to main content

Storing Information in DNA (efficiently)

A recent article in the Economist summarizes work done in Cambridge by Dr. Goldman, Dr. Birney and their teams.  They came up with a mapping for data into base pairs for storage in DNA; while today DNA storage is quite expensive, it is quickly decreasing in cost...and DNA has the advantage that information does not decay anywhere near as fast as other current storage mediums; 10's of thousands of years vs a handful.  The DNA format is also unlikely to change, so it may remain a more consistent technology than others.  
The team in Cambridge came up with a way to ensure that base pairs do not repeat, as sequencing errors are much higher when there are strings of the same base.  The encoding table is shown above.  First the digital data is converted to base 3, and then a differential coding is used to write so that no base pair ever repeats.  For example, if the last base was A, and we are looking to encode a 2, then we would write a T.

It struck me immediately that this was not a very efficient coding. Two digits in base 3 have 3^2 = 9 permutations, but the encoding table above has 12 possible transitions.  So, there is 25% wasted space in this encoding mechanism.  

The obvious way to optimize this is to use a (digital) base of  sqrt(12) = 2*sqrt(3).  With sqrt(3) being irrational, this might be a challenge :-)  However, it is easy to take a fixed estimate slightly smaller than sqrt(12), say 3.4641, and use this as the digital base; then we are only wasting a tiny fraction of the 12 transitions. [It would be even easier to use 3.45; the "Digits to be encoded", in the table above, would then be 0, 1.15, and 2.30, as shown below]

Previous 0.00   1.15 2.30 
A C G T
C G T A
G T A C
T A C G

This also points to the fact that although the DNA "format" may be static for many years, the encoding mechanism is most likely going to change over time.  Perhaps we need a standard header using a very simple A and C are 0, G and T are 1, in order to specify the encoding format for the rest of the string.


Popular posts from this blog

Gliese 581g

So...there is probably intelligent life out there.  As the old Monty Python saying goes, "I hope so, cause there certainly isn't much here on earth."  Case in point.  The video for Gliese581g is on MSNBC, and works fine in IE, but crashes in Chrome [ here ].

Acsoi - Land Grab Economics

"Adjusted Consolidated Segment Operating Income" ( Acsoi ), is a measure of what a companies profits would be if they were not spending like crazy to acquire a space:  in GroupOn's case, this would be retailers. To me, using Acsoi as a measure is really an admission that a company has no staying power beyond brand awareness.  So, they need to grab and own as much mindshare as they can, as quickly as they can, to increase the barrier to entry for competitors.  Without intellectual property to help protect them, and with the cost of switching (for a user) being effectively zero, building a global brand, and relying on brand stickiness, is the best way forward. Companies like Amazon that have been effective at this have also built in other "sticky" factors over time: recommendation engines, one-click purchasing, etc.  This increases the cost for the user to switch, and allows the company to stop pouring money into marketing and acquisition costs.  You also buil...

Schrodinger's Cat is still Alive...and Dead

With Borders scaling back so many of their stores, I have ended up buying more books than I normally would. One I picked up is "What is Life?" by Ed Regis.  It is a good short read, although it is now 2 years old, which is a long time given the rate that "artificial life" is moving at.  Since the book came out, Craig Ventor claims to have created the first artificial life . One thing I did not know was that Schrodinger wrote a book with the same title in 1944 which predicted the existence of a DNA-like molecule;  Crick actually credits Schrodinger with inspiring some of his early work. Ed Regis points out that Schrodinger's book never actually defines what life is; that is left hanging.  Interestingly, I felt the same about Regis's book.  While he argues that life is defined by having an "embedded metabolism" that argument still seems weak.  Carl Sagan pointed out, many years ago, that cars have a metabolism, which is hard to argue against....