Skip to main content

Storing Information in DNA (efficiently)

A recent article in the Economist summarizes work done in Cambridge by Dr. Goldman, Dr. Birney and their teams.  They came up with a mapping for data into base pairs for storage in DNA; while today DNA storage is quite expensive, it is quickly decreasing in cost...and DNA has the advantage that information does not decay anywhere near as fast as other current storage mediums; 10's of thousands of years vs a handful.  The DNA format is also unlikely to change, so it may remain a more consistent technology than others.  
The team in Cambridge came up with a way to ensure that base pairs do not repeat, as sequencing errors are much higher when there are strings of the same base.  The encoding table is shown above.  First the digital data is converted to base 3, and then a differential coding is used to write so that no base pair ever repeats.  For example, if the last base was A, and we are looking to encode a 2, then we would write a T.

It struck me immediately that this was not a very efficient coding. Two digits in base 3 have 3^2 = 9 permutations, but the encoding table above has 12 possible transitions.  So, there is 25% wasted space in this encoding mechanism.  

The obvious way to optimize this is to use a (digital) base of  sqrt(12) = 2*sqrt(3).  With sqrt(3) being irrational, this might be a challenge :-)  However, it is easy to take a fixed estimate slightly smaller than sqrt(12), say 3.4641, and use this as the digital base; then we are only wasting a tiny fraction of the 12 transitions. [It would be even easier to use 3.45; the "Digits to be encoded", in the table above, would then be 0, 1.15, and 2.30, as shown below]

Previous 0.00   1.15 2.30 
A C G T
C G T A
G T A C
T A C G

This also points to the fact that although the DNA "format" may be static for many years, the encoding mechanism is most likely going to change over time.  Perhaps we need a standard header using a very simple A and C are 0, G and T are 1, in order to specify the encoding format for the rest of the string.


Popular posts from this blog

The Centre Cannot Hold

Some thoughts on decentralization .  With all of the blockchain and Ethereum news, along with the dramatic uptick of ICO's, it is worth building a framework for decentralization.  The linked post makes a start on that.

Acsoi - Land Grab Economics

"Adjusted Consolidated Segment Operating Income" ( Acsoi ), is a measure of what a companies profits would be if they were not spending like crazy to acquire a space:  in GroupOn's case, this would be retailers. To me, using Acsoi as a measure is really an admission that a company has no staying power beyond brand awareness.  So, they need to grab and own as much mindshare as they can, as quickly as they can, to increase the barrier to entry for competitors.  Without intellectual property to help protect them, and with the cost of switching (for a user) being effectively zero, building a global brand, and relying on brand stickiness, is the best way forward. Companies like Amazon that have been effective at this have also built in other "sticky" factors over time: recommendation engines, one-click purchasing, etc.  This increases the cost for the user to switch, and allows the company to stop pouring money into marketing and acquisition costs.  You also buil

Gliese 581g

So...there is probably intelligent life out there.  As the old Monty Python saying goes, "I hope so, cause there certainly isn't much here on earth."  Case in point.  The video for Gliese581g is on MSNBC, and works fine in IE, but crashes in Chrome [ here ].