There is a realistic chance that we will sequence the entire diversity of genomes on Earth during our lifetime—or, in the worst-case scenario, the next generation. The Human Genome Project was completed back in 2003, after all. That’s quite an achievement: In less than 20 years, we have progressed from the first draft of a single human genome to being able to easily and cheaply sequence the genome of any interested individual.
This gives us the potential to revolutionize our understanding of the biosphere and will allow us to harness microbes for all kinds of purposes. Very soon, for instance, genome sequencing and analysis will become an integral part of a standard medical check-up and enable personalized medicine.
However, all this will only happen if we master the analysis of the deluge of data we’re creating. And that might be a bigger challenge than sequencing all the genomes themselves.
The genomic revolution started in 1995 when J Craig Venter’s team at the Institute for Genome Research published the first genome sequences of two tiny, humble bacteria: Haemophilus influenzae and Mycoplasma genitalium. Since then, the accumulation of genome sequences of many life forms—from bacteria to fungi to plants to animals (including, of course, Homo sapiens)—has grown exponentially. Thanks to the startling advances of metagenomics—whereby researchers sequence all DNA from a particular environmental sample, without isolating organisms and growing them in the laboratory—we’re closer to achieving our goal of total genomic understanding.
These sequencing methods have made a prospect realistic that, until now, would have be considered sheer fantasy: decoding the language that makes up the entire diversity of life on earth. This does not mean that literally all unique genomes will be sequenced, however: The number of species on Earth is unknown, but current estimates point to several million species of eukaryotes (organisms with complex cells containing a nucleus, like ourselves or fungi), and perhaps several billion of species of prokaryotes (organisms with simple cells like bacteria). However, a collection of representative genomes for all existing species seems a realistic goal—and for some particularly important species, including humans, there will be not one but thousands, if not millions, of representative genome sequences.
And then, what? What could—and should—we do with all that genomic data that stores the information of all organisms on earth?
To begin with, we will know the entire range of biodiversity on Earth. This will allow reconstruction of the biochemical networks that define the functioning of every ecosystem and will eventually allow us to manipulate ecology.
Unprecedented, deep understanding of the evolution of life will become possible as well. Complete knowledge of the history of life is unachievable because we will never have access to extinct evolutionary intermediates, but reconstruction by comparison of extant genomes will yield detailed information on these long-gone life forms. This might seem a somewhat abstract goal, but “Nothing in biology makes sense except in the light of evolution.” Such understanding will revolutionize biological science.
Analysis of the complete genomic database will also enable understanding of the emergence of pathogenic bacteria and viruses. The relevance of such advances in this case is obvious—what if we could predict and control potential viral outbreaks?
But the most immediate value will be a complete catalog of all the proteins genes encode. Scientists are already well aware of the most abundant genes and proteins (think, for instance, of hemoglobin, which fills our blood and allows us to breathe). However, there is a huge mass of rare ones that could prove fascinating and, in quite a few cases, very useful for various technologies.
For example, CRISPR. The recent identification of this relatively rare gene is a perfect example of this type of protein discovery. Discovered only a year or two ago, these enzymes are already changing the practice of genome engineering. The same applies to genes responsible for the synthesis of novel antibiotics, which will be essential to cope with antibiotic-resistant bacteria, which are currently on the rise.
But will these bright prospects realize?
It’s far from being a given. Even though we live in an information-dominated era, the principal bottleneck lies in the area of algorithms and computing power.
The computational power of human civilization keeps growing exponentially, but the amount of data we are generating grows even faster. And algorithms are struggling to keep up.
Think of it this way: The more genomes we sequence, the more time is required to compare each genome sequence to each other. When the computation required to do this grows proportionally to the square of the number of sequences, this basic task becomes insurmountable quite quickly. This deeper problem plagues genome research even today, and will become central in all genome analysis.
Up to now, all true achievements of genome analysis, be it inference of key evolutionary events or discovery of new enzymes for genome editing, have been driven by a combination of automatic computation and human expertise. This approach is being challenged and will become completely impractical in the new era of total sequencing. Because of the sheer number of sequenced genomes, all analysis will have to be fully automated.
If this sounds like a call for replacement of human expertise with artificial intelligence, it is. To master the giga-databases of the future, new generations of AI are essential, as is hardware such as quantum computers.
Sequencing the entire range of Earth’s biodiversity is not a pipe dream anymore. In fact, it might become tangible reality within the lifetime of the current generation of scientists. However, to turn this wealth of data into valuable information that can help society, as opposed to useless noise, tremendous—and not entirely foreseeable—advances in computer science and technology are essential. Such are the paradoxes of the information age.