Genetics Before Genomes

How genes first came to be described, then connected to the proteins they usually encode and eventually to two methods for sequencing DNA, the stuff of genes.
I am Brad Goodner. Welcome back to Genomics Revolution. To fully understand the impact of having an organism’s complete DNA sequence, its genome, we need to put it into the proper context set by the previous 150 years. Genetics as an experimental science got its start in the middle of the 19th century with Mendel’s inheritance trials on pea plant phenotypes and with Meischer’s biochemical isolation of nuclein, what we now call DNA. Mendel’s ideas on the rules of inheritance in sexually reproducing eukaryotes was generalized into the concept of a gene as a definable unit of genetic information controlling a particular phenotype in the early 20th century, before DNA was confirmed as the genetic material. The work of Beadle and Tatum cemented this concept into “one gene encodes one protein which catalyzes one particular biochemical reaction, typically one step in a biochemical pathway.”

For most of the 20th century, scientists studied one gene at a time. Their typical approach Involved isolating mutants – individual organisms with one or more mutations in a gene of interest that had a noticeable impact on a particular organismal phenotype. Mutations are nothing more than changes in a DNA sequence, but we didn’t have ways to determine a DNA sequence until the 1960’s. Scientists figured out that changes in a DNA sequence can potentially change the sequence of amino acid residues in a protein encoded by that DNA sequence.

By the time I was in high school in the late 1970’s, two groups had worked out methods that allowed labs all over the world to sequence DNA routinely. One method, the Maxam-Gilbert chemical method, started with a DNA strand labeled at one end with a radioactive phosphorus in the 5’ phosphate group. Four tubes containing large amounts of the labeled DNA strand are each exposed to different chemical conditions that lead to breaks in a DNA strand at specific nucleotide residues. In one tube, breaks occurred at purine nucleotide residues. Remember that A and G are the bases in purine nucleotides. In another tube, breaks occurred only at G residues. In a 3rd tube, breaks occurred at pyrimidine, C or T, nucleotide residues. In a 4th tube, breaks occurred only at C residues. Imagine a DNA strand 24 nucleotide residues long with A, C, G, and T residues alternating. ACGTACGTACGT… and so on.
In the first tube, breaks will be induced at A or G purines. Some of the DNA strands will be broken at position 1, others at position 3, others at position 5, others at position 7, and so on. In the second tube where breaks only occur at G residues, some of the strands will be broken at position 3, others at 7, and so on at a 4 base interval. In the 3rd tube, breaks will occur at C or T pyrimidines. Some DNA strands will be broken at position 2, others at position 4, and so on. In the 4th tube, breaks will occur at C residues – some at position 4, others at position 8, and so on at a 4 base interval. If we run the contents of each tube through a jello-like sieving matrix that separates DNA molecules on the basis of size, the smallest DNA fragments will run fastest. Remember our starting DNA strand, a 24-mer ACGTACGTACGT… The fragment breaking at position 1 will run the fastest and will only show up in the tube 1 lane. The fragment breaking at position 2 will be next but it will show up in both the tube 3 lane and the tube 4 lane. You could usually read 100-200 bases of sequence from one gel run. Several reactions to carry out, lots of radioactivity involved that no one wanted to be exposed to. 

In 1979, Sutcliffe published the complete DNA sequence of one of the earliest recombinant DNA molecules, the cloning plasmid pBR322. Plasmids are nonessential extra DNA molecules, usually circles, found in Bacteria, Archaea, and some Eucarya. The plasmid pBR322 is a man-made recombinant molecule, built from several natural DNA pieces. Sutcliffe sequenced pBR322 using the Maxam-Gilbert chemical method. It’s 4362 base pair long sequence was one of the first DNA sequences I worked with when I started graduate school in 1983. The next year, I went to my first research conference where I heard Richard Barker give a talk about the sequence of a key DNA sequence. The T-DNA or transferred DNA is a piece of bacterial DNA involved in the plant disease crown gall. Barker had almost single-handedly used the Maxam-Gilbert chemical method to determine a sequence of 24,595 nucleotide residues that encoded 14+ proteins. This was a tremendous feat at that time. My fellow grad students and I were awed, but we also fearfully joked that we hoped Barker had already sired his children because of all the radioactivity and nasty chemicals involved. Because of these risks, the Maxam-Gilbert chemical method lost out over time to an enzymatic method of DNA sequencing perfected by Fred Sanger and colleagues.

Sanger’s group built their enzymatic method around the way that cells naturally make new DNA strands by using the enzyme DNA Polymerase. This enzyme needs three components to build a new DNA strand. One, an old single-strand of DNA is needed as a template. The template strand is complementary to the new strand that will be made. By that I mean, A’s on the template strand will interact with T’s on the new strand and vice versa. G’s on the template strand will interact with C’s on the new strand and vice versa. Two, DNA Polymerase cannot start a new DNA strand from scratch, rather it has to add onto a pre-existing piece of single-stranded RNA in the cell or a piece of single-stranded DNA in the test tube. This starting piece is called a sequencing primer. By choosing the right sequencing primer, one can determine the sequence at different places along a large DNA strand. Three, DNA Polymerase catalyzes the formation of the new DNA strand using deoxyribonucleotides, the monomer subunits in a DNA strand polymer. In the Sanger enzymatic method, 4 tubes are set up with the same template DNA strand, the same starting complementary sequencing primer, and all 4 deoxyribonucleotides. The sequencing primer carried a radioactive label on one end. In the first tube, a little bit of a modified A nucleotide was added. The A was different in that once it was added onto a growing DNA strand, no more nucleotides can be added to it. So in this tube, the new DNA strands will each end with a modified A residue, but since the modified A is rare the termination of DNA strand grow will be rare and random in terms of which A is the termination point. In the second tube, a little bit of a modified C nucleotide was added. A little bit of modified G in tube 3 and a little bit of modified T in tube 4. Similar to the Maxam-Gilbert chemical method, the final labeled DNA strands in each tube are separated by size by running them through a jello-like sieving matrix called a sequencing gel. If we consider the same 24-base long DNA strand as before, ACGTACGTACGT…, then the lane on the sequencing gel using the results from tube 1 will show labeled fragments of the sequencing primer plus 1, plus 5, plus 9, plus 13, plus 17, and plus 21 in size. The lane for results from tube 2 will show fragments of the sequencing primer plus 2, plus 6, plus 10, plus 14, plus 18, and plus 22 in size. Eventually, the Sanger method became the method of choice for automated DNA sequencing machines and the radioactivity involved was replaced with four different fluorescent tags added to the four modified DNA terminating nucleotides. A laser at the end of the sequencing gel excites the fluorescent tag on each DNA fragment as it exits and the resulting fluorescence color tells us which nucleotide was at the end of the fragment.

In 1978, Sanger and coworkers published a paper reporting the complete sequence of the DNA virus phiX174, the first viral genome sequenced. The viral DNA of 5386 nucleotide residues encodes 10 proteins.

In 1982, the U.S. National Institutes of Health, NIH, established a public database for DNA and protein sequences that came to be called GenBank. By the end of 1982, there were 606 DNA sequences deposited in GenBank totaling over 680 thousand bases. Within a year, the amount of DNA available had tripled. By the end of 1987, there was over 10 million bases of DNA sequences deposited. By 1992, there was over 100 million bases available in GenBank. Most scientists sequenced just one to a few genes at a time, but change was coming and in 1995 it happened. More on that in later episodes. See you next time.