The First Cellular Genome - Part 2
The first genome sequence of a cellular organism was published in 1995. In this episode, we see what a sequence call tell us about the biology of an organism.
Welcome back to Genomics Revolution. I am Brad Goodner. Last time we were together, we walked through the strategy used by Craig Venter’s team at TIGR, The Institute for Genomics Research, to sequence the first genome of a cellular organism, Haemophilus influenzae strain Rd.
Today, we will finish up our analysis of the July 1995 SCIENCE article by focusing on the biological implications of knowing the complete sequence of an organismal genome. The H. influenzae strain Rd genome is a single circular chromosome of 1,830,137 base pairs. Previous to this work, the sequence of 122 protein-coding genes and their surrounding noncoding regions had been deposited in GenBank, the world’s foremost database of gene data. The authors of the genome paper, Robert Fleischmann and 39 coworkers, used a published computer algorithm and the previously known coding and noncoding sequences from H. influenzae to construct a model of how the coding sequences differed from the noncoding sequences. This may sound odd, but it turns out that the parts of any given genome that code for proteins, regardless of the specific proteins involved, share key characteristics such as certain dinucleotides, trinucleotides, tetranucleotides, etc. that are more or less abundant than predicted by the single nucleotide base composition of the genome. These characteristics are unique to each species. Once the computer algorithm had been “trained” to distinguish coding from noncoding regions, Fleischmann and coworkers put the entire genome sequence through the algorithm to predict putative protein-coding genes. For the H. influenzae Rd genome, the algorithm predicted 1743 protein-coding genes or about 1 protein-coding gene per every 1000 base pairs. This rough estimate has held up remarkably well since then across the entire Bacteria and Archaea domains, but it is much smaller that that seen in the Eucarya domain.
Of the 1743 predicted protein-coding genes, 1354 of them had 30% or greater protein sequence identity to genes previously sequenced in other organisms. Evolution keeps what works! However, that does not mean that we know what all of these proteins actually do. 1007 of these genes were similar to known genes that encode proteins of known function. 347 of them were similar to genes encoding “hypothetical” or “conserved hypothetical” proteins. That leaves 389 protein-coding genes with no similarity to previously sequenced genes. Some of these genes turned out to be shared with other organisms but just hadn’t been sequenced yet. However, some of them appear to be unique to the genus Haemophilus. This point appears to be true for all sequenced genomes. Evolution is also eternally creative.
Fleischmann and coauthors found many other interesting biological features from their analysis of the H. influenzae Rd genome. Remember that the Rd strain is a nonpathogenic relative of known pathogenic strains. The Rd genome shows evidence of its pathogenic heritage as some virulence genes and regulatory sequences remain, but it also shows several losses of key virulence genes.
Every genome sequenced since this 1995 breakthrough answers some longstanding questions, illuminates some previously unknown biological capacities, and brings up even more questions and hypotheses for future work.
In future episodes, we will learn more about both well-known organisms and recently discovered ones through their genomes. See you next time.
Today, we will finish up our analysis of the July 1995 SCIENCE article by focusing on the biological implications of knowing the complete sequence of an organismal genome. The H. influenzae strain Rd genome is a single circular chromosome of 1,830,137 base pairs. Previous to this work, the sequence of 122 protein-coding genes and their surrounding noncoding regions had been deposited in GenBank, the world’s foremost database of gene data. The authors of the genome paper, Robert Fleischmann and 39 coworkers, used a published computer algorithm and the previously known coding and noncoding sequences from H. influenzae to construct a model of how the coding sequences differed from the noncoding sequences. This may sound odd, but it turns out that the parts of any given genome that code for proteins, regardless of the specific proteins involved, share key characteristics such as certain dinucleotides, trinucleotides, tetranucleotides, etc. that are more or less abundant than predicted by the single nucleotide base composition of the genome. These characteristics are unique to each species. Once the computer algorithm had been “trained” to distinguish coding from noncoding regions, Fleischmann and coworkers put the entire genome sequence through the algorithm to predict putative protein-coding genes. For the H. influenzae Rd genome, the algorithm predicted 1743 protein-coding genes or about 1 protein-coding gene per every 1000 base pairs. This rough estimate has held up remarkably well since then across the entire Bacteria and Archaea domains, but it is much smaller that that seen in the Eucarya domain.
Of the 1743 predicted protein-coding genes, 1354 of them had 30% or greater protein sequence identity to genes previously sequenced in other organisms. Evolution keeps what works! However, that does not mean that we know what all of these proteins actually do. 1007 of these genes were similar to known genes that encode proteins of known function. 347 of them were similar to genes encoding “hypothetical” or “conserved hypothetical” proteins. That leaves 389 protein-coding genes with no similarity to previously sequenced genes. Some of these genes turned out to be shared with other organisms but just hadn’t been sequenced yet. However, some of them appear to be unique to the genus Haemophilus. This point appears to be true for all sequenced genomes. Evolution is also eternally creative.
Fleischmann and coauthors found many other interesting biological features from their analysis of the H. influenzae Rd genome. Remember that the Rd strain is a nonpathogenic relative of known pathogenic strains. The Rd genome shows evidence of its pathogenic heritage as some virulence genes and regulatory sequences remain, but it also shows several losses of key virulence genes.
Every genome sequenced since this 1995 breakthrough answers some longstanding questions, illuminates some previously unknown biological capacities, and brings up even more questions and hypotheses for future work.
In future episodes, we will learn more about both well-known organisms and recently discovered ones through their genomes. See you next time.