Skip to main content

Storing Information in DNA (efficiently)

A recent article in the Economist summarizes work done in Cambridge by Dr. Goldman, Dr. Birney and their teams.  They came up with a mapping for data into base pairs for storage in DNA; while today DNA storage is quite expensive, it is quickly decreasing in cost...and DNA has the advantage that information does not decay anywhere near as fast as other current storage mediums; 10's of thousands of years vs a handful.  The DNA format is also unlikely to change, so it may remain a more consistent technology than others.  
The team in Cambridge came up with a way to ensure that base pairs do not repeat, as sequencing errors are much higher when there are strings of the same base.  The encoding table is shown above.  First the digital data is converted to base 3, and then a differential coding is used to write so that no base pair ever repeats.  For example, if the last base was A, and we are looking to encode a 2, then we would write a T.

It struck me immediately that this was not a very efficient coding. Two digits in base 3 have 3^2 = 9 permutations, but the encoding table above has 12 possible transitions.  So, there is 25% wasted space in this encoding mechanism.  

The obvious way to optimize this is to use a (digital) base of  sqrt(12) = 2*sqrt(3).  With sqrt(3) being irrational, this might be a challenge :-)  However, it is easy to take a fixed estimate slightly smaller than sqrt(12), say 3.4641, and use this as the digital base; then we are only wasting a tiny fraction of the 12 transitions. [It would be even easier to use 3.45; the "Digits to be encoded", in the table above, would then be 0, 1.15, and 2.30, as shown below]

Previous 0.00   1.15 2.30 
A C G T
C G T A
G T A C
T A C G

This also points to the fact that although the DNA "format" may be static for many years, the encoding mechanism is most likely going to change over time.  Perhaps we need a standard header using a very simple A and C are 0, G and T are 1, in order to specify the encoding format for the rest of the string.


Popular posts from this blog

Echo vs Home

We love Alexa! We have had the Amazon Echo for well over a year.  Recently we also got a Google Home, to test it against our Alexa experience. The quick summary:  Interacting with Alexa is like interacting with a person.  Interacting with Home is like interacting with a computer.  Alexa is fun; Home is useful.  If you took away Alexa, I would be upset - I would be losing a friend.  If you took away Home, I wouldn't care too much.  It was very strange, but I actually felt like I might be offending Alexa when I purchased Home. Here are the two main differences: Wake-up words.  "Alexa" is friendly, easy to say, and evokes emotion.  Alex personifies the system - I am talking with someone.  "OK Google" is awkward, and constantly reminds you that you are talking to a machine - I am talking to something.  Of course, Google will update Home to allow us to customize the wake-up word, but the current out of box experience is less tha...

Gliese 581g

So...there is probably intelligent life out there.  As the old Monty Python saying goes, "I hope so, cause there certainly isn't much here on earth."  Case in point.  The video for Gliese581g is on MSNBC, and works fine in IE, but crashes in Chrome [ here ].