Skip to main content

Storing Information in DNA (efficiently)

A recent article in the Economist summarizes work done in Cambridge by Dr. Goldman, Dr. Birney and their teams.  They came up with a mapping for data into base pairs for storage in DNA; while today DNA storage is quite expensive, it is quickly decreasing in cost...and DNA has the advantage that information does not decay anywhere near as fast as other current storage mediums; 10's of thousands of years vs a handful.  The DNA format is also unlikely to change, so it may remain a more consistent technology than others.  
The team in Cambridge came up with a way to ensure that base pairs do not repeat, as sequencing errors are much higher when there are strings of the same base.  The encoding table is shown above.  First the digital data is converted to base 3, and then a differential coding is used to write so that no base pair ever repeats.  For example, if the last base was A, and we are looking to encode a 2, then we would write a T.

It struck me immediately that this was not a very efficient coding. Two digits in base 3 have 3^2 = 9 permutations, but the encoding table above has 12 possible transitions.  So, there is 25% wasted space in this encoding mechanism.  

The obvious way to optimize this is to use a (digital) base of  sqrt(12) = 2*sqrt(3).  With sqrt(3) being irrational, this might be a challenge :-)  However, it is easy to take a fixed estimate slightly smaller than sqrt(12), say 3.4641, and use this as the digital base; then we are only wasting a tiny fraction of the 12 transitions. [It would be even easier to use 3.45; the "Digits to be encoded", in the table above, would then be 0, 1.15, and 2.30, as shown below]

Previous 0.00   1.15 2.30 
A C G T
C G T A
G T A C
T A C G

This also points to the fact that although the DNA "format" may be static for many years, the encoding mechanism is most likely going to change over time.  Perhaps we need a standard header using a very simple A and C are 0, G and T are 1, in order to specify the encoding format for the rest of the string.


Popular posts from this blog

The Fourth R.

Reading, wRiting, aRithmetic, and algoRithms.  My wife and I were just brainstorming about this: how coding should be the next "basic" skill.  Of course, someone was ahead of us and posted this .  It is awesome to see Mozilla Hackasaurus referenced in this article.  It is a small world. In the early days of the printing press, scholars wrote the books; the press was simply used for production (see this article ).  As time went on, "average" people became familiar with the medium, and used it for their own messages.  We are at just that point with the Web.  Software Engineers write the code, and the Web distributes it.   Software Engineers are the algoRithm scholars of today.  They won't be for long.  Soon algoRithms will be taught starting in elementary school, along with the other three R's.

Timed math tests

You have 3.2 seconds to figure out the problem below. Alan knows 90% of the concepts behind the math test, and can do those 90% very quickly.  He always gets 90% on timed math tests. Bob knows 100% of the concepts, but is a slow worker.  In the timed math test, he gets 75%, but, if given an extra 10 minutes, would get 100%. Alan graduates with an A; Bob with a C. You are building a bridge. Who would you hire? Seems like everyone from Gates to Zuckerberg has problems with how education is carried out today.  I wish I had some of their clout and could help to change the system.