Two Ways to Solve a Genomics Jigsaw Puzzle

How do we determine the sequence of a DNA molecule that might be hundreds of thousands to millions of base pairs in size? Not all at one time. In this episode, we will learn about two strategies that emerged for sequencing small pieces of DNA and then merging them together into a virtual copy of an entire genome.
Once the goal of obtaining a human genome sequence had been set by research scientists and several government agencies around the world, the big question was how to organize the effort. Any genome of a cellular organism, but especially the human genome, is a massive amount of information. How do you gather the information and how do you piece it all back together at the end? There was no technology available in the late 1980’s and early 1990’s, and there is still none to this day, that allows one to jump onto a giant DNA strand and determine its sequence. You have to break the genome into lots of pieces, figure out the sequence of all the pieces and put all the sequences back together in the right order so that the virtual genome equals the real physical genome.

Two approaches ended up in a race with each other to sequence the human genome. The larger group was a public consortium of government-funded labs around the world, but mainly in the U.S., the U.K., and Japan. This effort was first led by James Watson of Watson & Crick fame, then by Francis Collins who saw it through to completion. The public effort focused on separating the human genome into individual chromosomes and sub-chromosome pieces to organize the sequencing and simultaneously developing really fine-scale physical maps of each chromosome to help assemble the sequence reads back in the right order. Now there was quite a bit of mapping information known for the human genome already, but much more detail was needed for this mapped-based strategy.

The second, smaller effort was a private affair led by the for-profit company Celera Genomics and several big corporate donors. Celera Genomics and its sister non-profit research organization called The Institute for Genomics Research, TIGR for short, were founded by Craig Venter, a very successful biochemist turned entrepreneur who had once worked at NIH. Venter and his team felt that they had a better strategy – faster, cheaper, and more applicable to any genome of interest. Why wait to develop fine-scale physical maps of a genome? Why not just break the genome into random pieces and sequence them, but here is the rub. You don’t know which random pieces you are sequencing until you have sequenced them. How many random pieces do you have to sequence in order to get virtually all of them? In other words, how hard do you have to work to achieve your goal?

This is actually a problem we have all dealt with on more than one occasion since we were little kids. Think about a really big bag of M&Ms of your favorite flavor. You know that there are seven colors represented in the bag. If you randomly pour out seven M&Ms into your hand, the probability that each color should be represented once is not one. There is an element of random chance in terms of which M&Ms fall out of the bag or in the case of a genome, which DNA fragments you randomly sequence. Craig Venter and his colleagues knew this was true with their so-called shotgun strategy to genome sequencing. In fact, they made use of a statistical distribution, called the Poisson distribution, that simulates such random events. The Poisson distribution can be used to understand random events through the an equation that allows us to calculate the probability of a particular outcome. For example, if the average number of any particular M&M color in your sample is one, what is this probability that a any given M&M color was not seen at all? Using m to represent the average and x to represent the number of interest, the Poisson distribution equation is:

 Px,m = (mx . e-m)/x! For the M&M question, P0,1 = (10 . e-1)/0! = (1 . 0.37)/1 = 0.37

This means that there is a 37% probability that if we pour only 7 M&Ms out of the bag that a given color will not be represented. That is not good enough, whether our goal is getting one of each color of M&M or of getting every piece of a genome represented. We can use the Poisson distribution equation to determine how hard we would need to look. We can try different values for m, the average number of times we have seen a particular M&M color or genome fragment, in order to determine the probability of getting no hits. As we have already seen, the probability of getting no hits for a particular M&M color or genome fragment given an average number of hits of 1 is 0.37. For an average number of hits of 2, the probability of getting no hits for a particular genome fragment is 0.14. For an average of hits of 3, the probability of no hits for a particular genome fragment is 0.05. Now we are getting somewhere. If we sequence enough genome fragments to represent 3 times the number needed to cover the entire genome, we should have 95% of it done. If we go up to an average of hits of 5, the probability of no hits for a particular genome fragment is 0.01. Now we have 99% of it done. No need to map a genome first. Just sequence enough pieces to cover the genome 5 times or more.

We will see the shotgun strategy in action in our next episode – the first genome sequenced from a cellular organism. See you then.