The ability of scientists to successfully adapt Covid-19 vaccines for use against coronavirus variants of concern will turn in part on the ability to spot infectious mutations in the virus’s genetic makeup quickly. For that, a computer that comprehends human language may help.
Arrangements of amino acids that form viral proteins can be analogized to sequences of words that imbue languages such as English with meaning, according to researchers at the Massachusetts Institute of Technology, who are using machine-learning algorithms developed for natural language to assess which mutations hold the potential to evade the body’s immune defenses.
The MIT researchers have trained such algorithms in a task they call constrained semantic change search (CSCS) that enables them to study viral mutations, including those that develop into highly infectious coronavirus variants such as those that first emerged in the UK and South Africa. The insights carry particular urgency for regions such as Africa, where the spread of the novel coronavirus among largely unvaccinated populations increases the opportunity for concerning mutations to occur.
“A good analogy can go a long way,” Bryan Bryson, a researcher at Boston’s Ragon Institute of MGH, MIT and Harvard and one of the scientists leading the initiative, explained recently. “A virus can mutate to retain functions required for survival, or preserve grammar, while managing to look different to the immune system and undergo high semantic change.”
Bryson compares the process of viral evolution to the structure of a sentence that relies on grammatical rules and sequence, or semantics, to convey meaning. He illustrates such evolution as follows:
Sticking with the analogy, a viral mutation must be grammatically correct and retain meaning to be able to replicate successfully. As with the change in the second sentence (from left) above, the so-called spike protein on the surface of the coronavirus that enables it latch on to human receptor cells may mutate slightly but still resemble the original enough for the immune system to recognize and attack it.
In contrast, the protein may deviate, as suggested by the third sentence from left, so that, by analogy, it’s neither grammatically correct nor makes sense, and can no longer be “read” by receptors; that is, bind to them. Or, as with “eats,” in the sentence at far right, a mutation may observe “protein grammar” but change sufficiently that antibodies made by the immune system may no longer bind to it, as if the virus appears in disguise. That can result in a more infectious variant.
“We can think about this landscape that a virus explores as it mutates as subject to constraints, where we want to preserve grammar but change semantics in order to survive,” says Bryson. “Our language model is learning the probability of a specific amino acid given the sequence context.”
Bryson and his colleagues trained the algorithms to assess mutations in three proteins: one found on the surface of the influenza virus, another found on the surface of HIV, and a third on the coronavirus spike.
For all three viruses, CSCS identified mutations that showed the highest potential for escape based on variations in their sequences. Among 891 distinct coronavirus spike protein sequences the researchers surveyed, one came from a strain that reinfected someone who had recovered last year from Covid-19. Only three other sequences in the set showed both higher semantic change and so-called grammaticality.
Besides being able to quantify the potential for mutations to escape, the research may pave the way for vaccines that broaden the body’s defenses against variants or that protect recipients against more than one virus, such as flu and the novel coronavirus, in a single shot.
Sign up to the Quartz Africa Weekly Brief here for news and analysis on African business, tech, and innovation in your inbox.