Genome Data Science


// Institute for Research in Biomedicine (IRB Barcelona) / Cancer Science programme / @GenomeDataLab

We're funded by the ERC StG "HYPER-INSIGHT"

In the @GenomeDataLab, we strive to understand the links between mutational processes, natural selection, gene function and phenotype by means of statistical genome analyses.

In particular, we use cutting-edge computational techniques and machine learning methodologies for analyses of massive genomic, epigenomic and transcriptomic data sets.

We aim to answer outstanding questions in biology and medicine by insightful analysis of data originating from human tumors (somatic mutations, chromosomal alterations, transcriptomes), human populations (germline variants), metagenomes (including human microbiomes), microbial genomes and phenomics data.

Group members:

IRB group leader, ICREA Professor, EMBO YIP

publication list


PhD student

FPI/Severo Ochoa fellow


Thomas Vatter


Juan de la Cierva fellow


PROBIST Marie Curie fellow

"The best thing about being a statistician is that you get to play in everyone's backyard." -- John Tukey.

Marcel McCullough Figureras

PhD student

Ignasi Toledano Martín

PhD student

Daniel Ortiz Martinez

senior research assistant

Miguel Martín Álvarez


PhD student

AGAUR/FI fellow

: - )

Bruno Fito Lopez

MSc student

"An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem." -- John Tukey

Associated lab members:

: - )

Francisco Fuster Tormo

bioinformatician / PhD student

based at the Solé lab @ Institut Josep Carreras

: - )

Mischan Vali-Pour Jamnani

PhD student

based at Lehner lab @ Centre for Genomic Regulation (CRG)

Former GenomeDataScience group members:

  • Jose Espinosa Carrasco, senior research assistant / bioinformatician
  • Ekaterina Zhuravleva, visiting PhD student
  • Michel Owusu, postdoc
  • Aleksandra Karolak, postdoc
  • Marta Consuegra Martinez, postdoc
  • Ingrid Tomljanović, ERASMUS+ Master Student
  • Jordi Piqué Sellés, summer student (Math4life program)
  • Matej Mihelčić, visiting PhD student
  • Albert Lahat, research assistant

The research interests of the Genome Data Science group are organized into four themes:


Unraveling mutational processes. Mutations are the fuel of carcinogenesis and it is imperative to learn what causes them and how they drive evolution in general, and cancer evolution in particular. We have shown that somatic mutations are unevenly distributed across the human genome due to differential activity of DNA mismatch repair (MMR), which preferentially protects gene-rich regions (Supek & Lehner 2015 Nature).

Moreover, motivated by the discoveries of APOBEC3 mutagenesis in tumors, we found another prevalent process that creates clustered mutations in many cancer types -- error-prone MMR, evident as the mutational signature of DNA polymerase eta (POLH). The histone mark H3K36me3 is an important determinant of both the standard, error-free MMR and the non-canonical, error-prone MMR (Supek & Lehner 2017 Cell).


Genomic signatures of natural selection. Most somatic mutations found in cancer cells are ‘passengers’ , with little phenotypic consequence. Detecting the few mutations among those which are ‘drivers’ is challenging, yet crucial to understand carcinogenic transformation. We have previously discovered that synonymous mutations ie. those that occur in gene coding regions but do not change the amino acid sequence, commonly drive cancer by affecting splicing patterns of oncogenes (Supek et al. 2014 Cell).

Moreover, we have learnt how the quality control pathway of nonsense-mediated mRNA decay (NMD) decides which mRNAs to degrade (Lindeboom et al. 2016 Nat Genet), and used these rules of NMD to reveal patterns of positive and negative selection on tumor suppressor genes and on essential genes.


Automated inference of gene function. Genome sequencing technologies are rapidly advancing, providing an abundance of genomes of prokaryotic and eukaryotic species, and also of populations thereof. This presents an opportunity to learn about the function of the ~1/3 of the genes for which, remarkably, a biological role is still not known.

We have devised a methodology to infer gene function from evolutionary patterns in codon biases, which serve as proxy for evolution of gene expression levels (Krisko et al. 2014 Genome Biol). We also proposed 'metagenome phyletic profiles', a compact representation of environmental DNA sequencing data -- including human microbiomes -- that can predict gene function using machine learning (Vidulin et al. 2018 Microbiome).


Genetic basis of phenotypes. Various kinds of -omics data accumulate rapidly and are increasingly organized into tidy, structured repositories. In contrast, phenomics data, while very valuable, are less often collected in a systematic manner and encoded in computable formats. This hampers the discovery of genes that underlie various phenotypes.

We have used machine learning to text-mine the scientific literature and annotate microbias species with >400 phenotypic traits (Brbic et al. 2016. Nucl Acids Res) and suggest their genetic basis, including prevalent epistasis in gene repertoires. One example are genomes of pathogenic bacteria, which tend to encode proteomes resistant to unfolding, thereby protecting the microbes from oxidative stress (Vidović et al. 2014 Cell Rep).

Open source software from the group:

  • HyperClust by David Mas-Ponte.
    • A statistical framework to detect clustered mutations in genomes, while accounting for mutation rate heterogenety and for estimated timing of the mutations.
  • BioPanPipe by Daniel Ortiz-Martinez.
    • A genomics pipeline implementing a variety of tools for variant calling, including point mutations, indels, copy number changes and LOH, MSI analysis and structural variants. Additionally, tools for download from genomics databases.
  • FastRandomForest2 (beta) by Jordi Piqué Sellés.
    • A re-implementation of the Random Forest classifier (RF) for the Weka machine learning environment, bringing massive speed and memory use improvements.

Highlighted publications:

human genetic diseases differ in whether NMD typically aggravates or alleviates the effects of PTCs // failure to trigger NMD is a cause of ineffective gene inactivation by CRISPR–Cas9 gene editing // NMD strongly determines the efficacy of cancer immunotherapy, with only transcripts that escape NMD predicting a response

Density of somatic mutations across megabase-sized chromosomal domains can differentiate human tissues // Driver mutations are poor classifiers of cancer (sub)type, while passenger mutations are highly predictive // Mutational signatures and regional mutation density are highly complementary in classifying tumors

Loss of activity of a H3K9 methyltransferase doesn't alter the global landscape of mutations in chemically induced tumors // DNA replication time and H3K36me3 histone mark, not chromatin accesibility, are determinants of mutation rates // H3K9me2/3-depleted tumors are genomically instable, and after a prolonged latency, very agressive

An increasing availability of microbiome DNA sequencing data provides an opportunity to infer gene function in a systematic manner // Metagenome phyletic profiles (MPPs) can accurately predict 826 Gene Ontology functional categories // MPPs derived from diverse environments infer distinct, non-overlapping sets of gene functions

A statistical method, ALFRED, tests Knudson’s two-hit hypothesis to systematically identify inherited cancer predisposing genes // We identify novel genes, such as the chromatin modifier NSD1, which cause cancer through  germline variants and somatic loss-of-heterozygosity // 1 in 50 tumors is associated with novel ALFRED genes

Mutation clusters in cancer genomes provide fingerprints of mutagenic mechanisms // Error-free mismatch repair lowers the mutation rate in H3K36me3-marked active genes // Error-prone repair using POLH also targets H3K36me3, contributing driver mutations // UV and alcohol increase error-prone repair, targeting mutations toward active genes.

Matched exome and transcriptome data can systematically elucidate the rules of NMD targeting in human tumors, explaining ¾ of the variance in NMD efficiency. Applying our NMD model identifies signatures of positive and negative selection on nonsense mutations in human tumors and provides a classification for tumor-suppressor genes.

"Prediction is very difficult, especially about the future." -- Niels Bohr.

Somatic mutation rates exhibit tissue-specificity coupled to regional changes in DNA replication timing and gene expression. A temporal deconvolution of mutational signatures in microsatellite-instable tumors of the colon, stomach and uterus demonstrates that post-replicative MMR is the cause of the megabase-scale mutation rate variability in the human genome.

Enrichments of somatic mutations indicate that ~1 in 5 synonymous mutations in oncogenes are cancer drivers. Involvement in known exonic splicing motifs and association to RNA-Seq data implicates many causal synonymous mutations to altered splicing. The 3’ UTRs of dosage-sensitive oncogenes also harbour causal mutations.

The changes in codon adaptation in orthologous gene families can systematically predict function of many genes by employing machine learning to rule out confounding variables. We have experimentally validated novel roles in adaptation to environmental stressors (oxygen, heat, salinity) for tens of E. coli genes.

We have systematically annotated >3,000 prokaryotic taxa with >400 phenotypes, while drawing on comparative genomics and text mining techniques. This reveals thousands of gene families causally involved in various microbial traits, as well as pervasive epistasis that has shaped gene repertoires of these organisms. 

Comparative analyses of genomes, from bacteria across fungi to humans and human tumors have revealed many links between genes' biological roles and the accrual of synonymous mutations. The evolutionary trace of codon bias patterns across homologous genes may be examined to learn about a gene’s relevance to various phenotypes, or, more generally, its function in the cell.

We gratefully acknowledge our funders:

European Research Council

ERC Starting Grant #757700 HYPER-INSIGHT "Insight into genome maintenance and cancer vulnerabilities provided by an extreme burden of somatic mutations "

The Spanish Ministry of Science, Innovation and Universities, via grant BFU2017-89833-P "RegioMut".

Core funding and a student fellowship are funded by the Severo Ochoa excellence award to the IRB Barcelona.

Fran Supek is funded by the ICREA Research Professor program.

Fran Supek is an EMBO Young Investigator.

The Croatian Science Foundation, via grant AIGEN "Augmented intelligence for prediction, discovery and understanding in genomics and pharmacogenomics"

"Try not. Do… or do not. There is no try." -- Yoda.