Algorithms Reveal DNA

“Cracking” the code to our genome, understanding the external factors that control gene expression, determining the mechanisms responsible for causing genetic diseases; this is delicate and laborious work that could soon be automated thanks to new techniques in artificial intelligence. Bringing genomics into a whole new era.


In this image, each line correspond to a breast tumor. Each column represents a gene more or less expressed in this tumor: in green, it is very much; in red, little. Unsupervised classification algorithms reveal groups of tumors with similar generic characteristics and categorize them into five categories, wich helps doctors to choose a suitable treatment.


Imagine a text of 6 billion letters, 600 times more than in Marcel Proust’s Remembrance of Things Past. Now imagine that this text uses a four-letter alphabet (A, T, C, G) instead of our 26-letter Latin alphabet. This text, previously indecipherable, is our genome. Specific to each individual, it encodes a message needed by our cells to operate properly. Some sequences in the code may also be harmful and cause the appearance of diseases. Understanding this text is a sort of Holy Grail in biology, and especially in genomics: the study of the structure, function and evolution of genomes. It took half a century of scientific discovery and technological prowess before we first managed to sequence the human genome at the start of the 2000s. This project, sometimes known as the “Apollo project of biology”, opened the way to analysing this immense text.

Since then, technology has come on in leaps and bounds, so much so that sequencing one human (or non-human) DNA has almost become routine, possible in a few hours at a reasonable cost. At the same time, other technologies have been developed. Analysing the epigenome (the molecular modifications that act on DNA operation without altering the code) is one such technology. Another is studying the transcriptome, i.e. all the little molecules (RNA) produced following transcription of the genome which play a crucial role in the production of proteins and the operation of the cell. Together, all this data forms what is known as a molecular portrait.

How can we analyse and make sense of the huge quantities of data produced by these high-speed technologies? Using artificial intelligence of course! And in particular statistical learning algorithms. They “learn” and improve by using vast quantities of data. In this way, they can accomplish complex tasks, such as annotating genomic data. This delicate work consists of identifying functional elements in the genome: genes or gene-regulating sequences that fulfil a specific biological function.

Imagine opening one of the chapters in the human genome: a long series of letters A, T, C, G, with no apparent structure, appears before you. How can you decipher this language and understand the message encoded in the text? How can you identify the regions coding the genes and their fine structures, identify positions on the DNA where the proteins regulating the expression of these genes are fixed? By following the biologist’s approach, you would probably start by looking for repetitions or irregularities on various scales, to gradually identify hidden structures and work out a sort of grammar.


This network represents the interactions (arrows) between proteins in a human cell. It was deduced from the analysis of messenger RNAs by a learning algorithm. This approach could make it possible to identify new therapeutic targets for certain deseases.




The strength of statistical learning algorithms is to replicate this approach to automatically process the 6 billion letters in the genome. One class of algorithms called graphic models is especially efficient at doing this. They enable scientists to include their knowledge in probabilistic modelling of the data, and to infer relevant information by allowing the algorithm to optimise the parameters of the model using real data. For annotating DNA, special graphical models, called Markov random fields, are used. They can be used to automatically infer the annotation of the genome using regularities discovered by the model in the DNA sequence. These models are part of a category of methods known as unsupervised learning methods, as they learn to annotate the genome without being supplied with specific information about certain parts of the genome whose annotation is already known. Graphical models are highly flexible and can be used in a variety of situations. For example, another application of these methods consists of extracting epigenetic data, i.e. information regarding the molecular modifications around the DNA. The international project Encode did this in 2012, aiming at drawing up precise annotation of the functional parts of the human genome using molecular portraits measured in various cell types (1).

However, the best way of decoding DNA is to compare genomes. Continuing the literary metaphor, analysing a long book can partly decipher the secrets of a language, based on repeated words or grammatical structures within the text. But it is only when two or more books are compared that meaning begins to emerge. By grouping words by subject when they appear frequently together, we see similarities appear between some books depending on their content or on their author. Similarly, so-called comparative genomics analyses genomes by comparing them and is one of the most powerful approaches to acquire knowledge using genomics data.




Historically, comparative genomics first looked at comparing species, enabling the tree of life proposed by Darwin to be reconstructed and the genes identified whose functions are specifically associated with a family of species. The graphical models used to identify the structure of a single genome can also be extended to simultaneously process several genomes. Rather than compare the genomes of several species, such as humans and mice, we can also compare the molecular portraits of different individuals within the same species. Using this approach, correlations can be made between the variations observed in a molecular portrait and properties such as the yield of a plant or the risk of developing a disease.

To do so, comparative genomics mainly uses statistical models and unsupervised learning algorithms. The goal? To identify the similarities and variations between genomic variations. For example, dimension-reduction techniques or unsupervised classification can be used to identify uniform subgroups within a non-uniform population. These techniques were being used in cancer research at the start of the 2000s, when it was possible to analyse the full transcriptomes of several hundreds of tumours. They enabled comparisons to be made that revealed the huge molecular differences in certain types of tumours. Breast cancers were divided into five major classes according to their molecular profile. The prognosis and recommended treatment differ according to these classes (2).

Today, this classification goes even further. We are now able to sequence different samples within the same tumour, and even sequence unique cells. This sheds light on the molecular differences within the tumour of one patient. Using unsupervised learning tools such as graphical models or matrix factorisation techniques, we can reconstruct the molecular history of the tumour using these data, and automatically identify the processes involved in its appearance and progression. For example, we can determine if a cancer appeared following exposure to the sun or tobacco, by analysing the mutations observed in a tumour’s DNA. Surprisingly, the matrix factorisation techniques used in this type of investigation are similar to those used by on-demand video platforms like Netflix to personalise their recommendations. In genomics, these precious data can help doctors to better characterise the disease for a given patient, and thus offer personalised treatment. As well as medical data, some statistical learning algorithms can infer more fundamental knowledge.

As in any science, biology accumulates knowledge by comparing hypotheses with observations. Historically, hypotheses were formulated by scientists using their intuition, and experiments were conducted to confirm or deny them. By producing huge quantities of data, genomics has somewhat reversed this model of research; it is now common to start by generating a lot of data, for example sequencing hundreds of genomes, and then analyse them using automatic methods based on statistics and artificial intelligence. In this way, hypotheses emerge from the data.




Of course, these hypotheses then need to be confirmed using further targeted experiments. Let’s look at the example of gene expression regulation. Since the work of François Jacob, Jacques Monod and André Lwoff, who won the Nobel Prize for Medicine in 1965, we know that each of the 20,000 genes encoded in our DNA may be expressed or not, i.e. copied in the form of messenger RNA to produce a protein, depending on the presence or absence of other proteins, known as transcription factors. By attaching to a strand of DNA, transcription factors control the expression of the target gene. For each target gene, how can the transcription factors that control it and all the factors that affect gene expression be identified? One solution consists of collecting the transcriptomics data from several hundreds of samples subject to varying experimental conditions, and comparing them. If it is observed that a target gene A is systematically expressed in experimental conditions where a transcription factor B is also expressed, we can suppose that factor B controls A. But when several target genes and several transcription factors are to be considered at the same time, the situation is more complicated. And this is where algorithms can be very useful.




Bayesian Networks in particular offer a rigorous statistical framework for inferring the interactions between several genes and clarifying the relations a specific gene has with a specific transcription factor. Bayesian Networks are special graphical models that combine graph theory (*) with statistics to infer causal relations, such as the fact that one gene is regulated by another gene.

For several years, other methods based on random forests or lasso regressions (two popular statistical learning techniques), have also shown their advantage for this task; they received a best performance award in an international competition to reconstruct as accurately as possible the regulatory network for bacterial organisms and yeast (3). This opens avenues for numerous applications in biotechnology and medicine, such as identifying new treatment targets.

As well as understanding these interactions, artificial intelligence excels in the art of prediction. Predicting the yield of a plant based on its DNA; evaluating the risk of a cancer recurring, and adapting the treatment as a result using gene expression and DNA mutations in a biopsy; predicting the efficacy of a treatment using the molecular portrait of a cancer, etc.

Today these multiple predictive tasks are mainly carried out by supervised statistical learning methods. Take the example of evaluating the risk of recurrence of a cancer; this approach consists of collecting molecular portraits of a tumour from groups of patients at the moment of the initial diagnosis, and then following these patients over several years. A “recurrence” label is associated with the molecular portraits of patients who suffer from a new cancer within five years, and a “non-recurring” label to the others. Then, using these “labelled” data, a learning algorithm is taught to predict the category of tumour (recurring and non-recurring) according to the molecular portrait made at the moment of the initial diagnosis. In practice, these genomic data are combined with other information available about the disease, such as the size of the tumour or the patient’s age, which can also affect the risk of recurrence. Often, this classification task is characterised by the fact that, for each patient, we have a large number of molecular data (the level of expression of 20,000 genes, mutations to millions of positions in the DNA, etc.). However, the number of patients included in such experiments is often limited to just a few hundreds.

This imbalance between the incredible quantity of data per individual and the more modest number of individuals, poses a problem for the effectiveness of learning algorithms. To make up for what the statisticians call the “dimensionality curse”, there are projects to collect data from large cohorts of individuals (see opposite). At the same time, research in mathematics and computers to improve large-dimension statistical learning techniques is continuing apace!


(1) M. Hoffman et al., Nature Methods, 18, 473, 2012.

(2) C. Perou et al., Nature, 406, 747, 2001.

(3) D. Marbach et al., Nature Methods, 9, 796, 2012.

(*) Graph theory is a discipline that studies abstract models of networks comprised of nodes connected by lines. Studying the interactions between genes can be formalised as a graph problem.



Like the United States, the United Kingdom and China, in 2016 France launched the Genomic Medicine France 2025 plan, an ambitious programme to develop the use of genomics in the care pathway. Sequencing platforms have been created throughout the country capable of sequencing 235,000 genomes every year by 2020.

The huge quantities of data collected will be processed in calculation centres and used to improved knowledge of treated diseases, in particular rare diseases and cancers. This approach will help to personalise treatment of patients.



Strength in union. This saying is especially true for the International Cancer Genome Consortium, bringing together 88 research teams around the world. What is the aim of this organisation? It is to study in-depth the genome of 25,000 tumours and all the factors that regulate the expression of their genes. All these molecular portraits of cancers, obtained using techniques including artificial intelligence, are used to understand the mechanisms at work in the development of various categories of tumours. And better treat them in the future.




Jean-Philippe Vert


Jean-Philippe Vert is a professor in the mathematics and applications department of the École Normale Supérieure, Director of Research at Mines ParisTech where he leads the bioinformatics centre, and is head of a team working on modelling cancer at Institut Curie Paris.

Share on Facebook
Share on Twitter
Please reload


November 11, 2018

November 11, 2018

Please reload