With a strongly interdisciplinary approach to its work, DeepMind has brought together experts from the fields of structural biology, physics, and machine learning to apply cutting-edge techniques to predict the 3D structure of a protein based solely on its genetic sequence.
The system, AlphaFold, which we have been working on for the past two years, builds on years of prior research in using vast genomic data to predict protein structure. The 3D models of proteins that AlphaFold generates are far more accurate than any that have come before – making significant progress on one of the core challenges in biology.
What is the protein folding problem?
Proteins are large, complex molecules essential in sustaining life. Nearly every function our body performs – contracting muscles, sensing light, or turning food into energy – can be traced back to one or more proteins and how they move and change. The recipes for those proteins – called genes – are encoded in our DNA.
What any given protein can do depends on its unique 3D structure. For example, antibody proteins that make up our immune systems are ‘Y-shaped’, and are akin to unique hooks. By latching on to viruses and bacteria, antibody proteins are able to detect and tag disease-causing microorganisms for extermination. Similarly, collagen proteins are shaped like cords, which transmit tension between cartilage, ligaments, bones, and skin. Other types of proteins include CRISPR and Cas9, which act like scissors and cut and paste DNA; antifreeze proteins, whose 3D structure allows them to bind to ice crystals and prevent organisms from freezing; and ribosomes that act like a programmed assembly line, which help build proteins themselves.
But figuring out the 3D shape of a protein purely from its genetic sequence is a complex task that scientists have found challenging for decades. The challenge is that DNA only contains information about the sequence of a protein’s building blocks called amino acid residues, which form long chains. Predicting how those chains will fold into the intricate 3D structure of a protein is what’s known as the “protein folding problem”.
The bigger the protein, the more complicated and difficult it is to model because there are more interactions between amino acids to take into account. As noted in Levinthal’s paradox, it would take longer than the age of the universe to enumerate all the possible configurations of a typical protein before reaching the right 3D structure.
Why is protein folding important?
The ability to predict a protein’s shape is useful to scientists because it is fundamental to understanding its role within the body, as well as diagnosing and treating diseases believed to be caused by misfolded proteins, such as Alzheimer’s, Parkinson’s, Huntington’s and cystic fibrosis.
DeepMind is especially excited about how it might improve its understanding of the body and how it works, enabling scientists to design new, effective cures for diseases more efficiently. As DeepMind acquires more knowledge about the shapes of proteins and how they operate through simulations and models, it opens up new potential within drug discovery while also reducing the costs associated with experimentation. That could ultimately improve the quality of life for millions of patients around the world.
An understanding of protein folding will also assist in protein design, which could unlock a tremendous number of benefits. For example, advances in biodegradable enzymes – which can be enabled by protein design – could help manage pollutants like plastic and oil, helping us break down waste in ways that are more friendly to our environment. In fact, researchers have already begun engineering bacteria to secrete proteins that will make waste biodegradable, and easier to process.
To catalyse research and measure progress on the newest methods for improving the accuracy of predictions, a biennial global competition called the Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP) was established in 1994, and has become the gold standard for assessing techniques.
How can AI make a difference?
Over the past five decades, scientists have been able to determine shapes of proteins in labs using experimental techniques like cryo-electron microscopy, nuclear magnetic resonance or X-ray crystallography, but each method depends on a lot of trial and error, which can take years and cost tens of thousands of dollars per structure. This is why biologists are turning to AI methods as an alternative to this long and laborious process for difficult proteins.
Fortunately, the field of genomics is quite rich in data thanks to the rapid reduction in the cost of genetic sequencing. As a result, deep learning approaches to the prediction problem that rely on genomic data have become increasingly popular in the last few years. DeepMind’s work on this problem resulted in AlphaFold, which it submitted to CASP. It is proud to be part of what the CASP organisers have called “unprecedented progress in the ability of computational methods to predict protein structure,” placing first in rankings among the teams that entered (DeepMind’s entry is A7D).
DeepMind’s team focused specifically on the hard problem of modelling target shapes from scratch, without using previously solved proteins as templates. It achieved a high degree of accuracy when predicting the physical properties of a protein structure, and then used two distinct methods to construct predictions of full protein structures.
Using neural networks to predict physical properties
Both of these methods relied on deep neural networks that are trained to predict properties of the protein from its genetic sequence. The properties DeepMind’s networks predict are: (a) the distances between pairs of amino acids and (b) the angles between chemical bonds that connect those amino acids. The first development is an advance on commonly used techniques that estimate whether pairs of amino acids are near each other.
DeepMind trained a neural network to predict a separate distribution of distances between every pair of residues in a protein. These probabilities were then combined into a score that estimates how accurate a proposed protein structure is. DeepMind also trained a separate neural network that uses all distances in aggregate to estimate how close the proposed structure is to the right answer.