Skip to main content

Storing Information in DNA (efficiently)

A recent article in the Economist summarizes work done in Cambridge by Dr. Goldman, Dr. Birney and their teams.  They came up with a mapping for data into base pairs for storage in DNA; while today DNA storage is quite expensive, it is quickly decreasing in cost...and DNA has the advantage that information does not decay anywhere near as fast as other current storage mediums; 10's of thousands of years vs a handful.  The DNA format is also unlikely to change, so it may remain a more consistent technology than others.  
The team in Cambridge came up with a way to ensure that base pairs do not repeat, as sequencing errors are much higher when there are strings of the same base.  The encoding table is shown above.  First the digital data is converted to base 3, and then a differential coding is used to write so that no base pair ever repeats.  For example, if the last base was A, and we are looking to encode a 2, then we would write a T.

It struck me immediately that this was not a very efficient coding. Two digits in base 3 have 3^2 = 9 permutations, but the encoding table above has 12 possible transitions.  So, there is 25% wasted space in this encoding mechanism.  

The obvious way to optimize this is to use a (digital) base of  sqrt(12) = 2*sqrt(3).  With sqrt(3) being irrational, this might be a challenge :-)  However, it is easy to take a fixed estimate slightly smaller than sqrt(12), say 3.4641, and use this as the digital base; then we are only wasting a tiny fraction of the 12 transitions. [It would be even easier to use 3.45; the "Digits to be encoded", in the table above, would then be 0, 1.15, and 2.30, as shown below]

Previous 0.00   1.15 2.30 
A C G T
C G T A
G T A C
T A C G

This also points to the fact that although the DNA "format" may be static for many years, the encoding mechanism is most likely going to change over time.  Perhaps we need a standard header using a very simple A and C are 0, G and T are 1, in order to specify the encoding format for the rest of the string.


Popular posts from this blog

Timed math tests

You have 3.2 seconds to figure out the problem below. Alan knows 90% of the concepts behind the math test, and can do those 90% very quickly.  He always gets 90% on timed math tests. Bob knows 100% of the concepts, but is a slow worker.  In the timed math test, he gets 75%, but, if given an extra 10 minutes, would get 100%. Alan graduates with an A; Bob with a C. You are building a bridge. Who would you hire? Seems like everyone from Gates to Zuckerberg has problems with how education is carried out today.  I wish I had some of their clout and could help to change the system.

Decentralization, Democracy, and Well-Being

Those of us raised in Democratic societies take it for granted that those societies provide better well-being (for common individuals) than other forms of governance. At the heart of democracy is personal freedom and autonomy, backed by the rule of law. We also take for granted the interplay of decentralized versus centralized authority. Decentralization can mean many things, but here we refer to it in terms of power, authority, and decision making. The more authority individuals have, the more decentralized the power system in which they are operating.  Almost by definition the more democratic a system, the more decentralized it is, with the caveat that some agreed upon axioms exist, such as the rule of law and its enforcement. Of course, authority can be too decentralized leading to "every man for themselves", so we put limits on decentralization through that same rule of law. With the advent of decentralizing technologies , which make possible more decentraliz...