Could DNA store all the world's data?

Across several news outlets this week, there’s been talk of storing the world’s data on DNA strands. You can read one of those articles here while Neil Lamb, HudsonAlpha’s director of educational outreach, gives a bit more insight into exactly how the process would work…

Scientists have developed a method for storing documents, images and sound files inside the strands of the DNA double helix. The technology could open new avenues to keep copies of your favorite photos, that short story you wrote in fifth grade or those home movies of Christmas and birthday parties. Best of all, the technology would be safe for thousands of years and would take up less space than a tube of lipstick.
Let’s back up for a moment and discuss storing data. Information, whether from text, image or sound, is digitally encoded as long strings of 0s and 1s. Eight of these digits make up a “byte” of information. A typed page is made up of 2,000 bytes while a movie download contains about a billion bytes. It’s been estimated that all of the world’s digital data takes up roughly three zettabytes (a billion trillion bytes).
DNA also uses a code to store information. In this case the code is four chemical “bases” – adenosine (A), thymine (T), guanine (G) and cytosine (C). Several years ago, scientists began to look at how the digital code of 1s and 0s could be stored inside the DNA. The digital string of 0s and 1s is rewritten as a series of A,T,C and G. (Keep in mind, the DNA fragments used for storage have no biological function and are kept inside a vial rather than inside a cell.) When stored under particular conditions, the DNA is stable for tens of thousands of years. When it’s time to recover the information, the DNA is sequenced and the order of the bases converted back to the corresponding bytes.
Early attempts to store information as DNA code directly mapped 0s and 1s onto the bases – for example, a 0 was represented by A or C and a 1 by T and G. Unfortunately, this approach is problematic when the string of 0s and 1s leads to a repeat in the DNA sequence – like CCCCC. Current DNA sequencing technology struggles to correctly identify these repeat regions, miscalculating how many “Cs” are present and introducing errors into the numerical data.
Here’s where the recent media attention comes into play. Nick Goldman and colleagues at the European Bioinformatics Institute in the UK have devised a method to minimize the likelihood of copying errors. Rather than use a direct link between 0s and 1s and DNA bases, they devised an intermediate code that prevents repeating bases. To further reduce errors, the original code is split into fragments four different ways, with the breakpoints occurring at different locations each time. This way, if an error does occur, other copies of the same region can be used as comparison.
The scientific team encoded multiple files, including part of an MP3 recording of Martin Luther King’s “I have a dream” speech, a text file of all the sonnets of William Shakespeare and a PDF of the 1953 paper by Francis Crick and James Watson describing the structure of DNA. All told, 757,000 bytes of information were encoded on over 153,000 DNA fragments. The scientists estimate their approach, which is described online in the journal Nature, can store over two petabytes (or two million billion bytes) of information on a single gram of DNA. That’s a mind-boggling amount of information contained in something about the size of 15 grains of sugar.
Speed and cost are the two biggest drawbacks to DNA-based storage. It took four days to synthesize the code into DNA and the process of sequencing and decoding the fragments required two weeks. The synthesis and decoding process costs $12,620 per megabyte of information – millions of times more expensive than storing data on magnetic tape. However, as technology continues to improve, both the price and timeframe are expected to drop dramatically. If current trends continue, the researchers estimate that in less than a decade DNA-based storage will be cost-effective for information stored 50 years or more. This could be especially useful for long-term archiving of governmental, historical or scientific data that only rarely would be accessed.
If you’ve ever had to search for a way to pull data from an old floppy disk, zip drive or VHS tape, you know how quickly digital storage technologies change. The researchers note DNA has been storing biological information for more than 3 billion years, meaning the odds are high it will be around in the future, available for conversion into whatever new technology civilizations are using to share data. Hang on to your CDs, DVDs and thumb drives a little bit longer, but this technology is certainly worth watching.

Dr. Neil Lamb is HudsonAlpha’s director of 
educational outreach. Trained as a human geneticist, 
he now focuses his energy on creating programs and 
activities that help Alabama’s teachers, students and the public understand genetics and biotechnology.