research lines

The research interests of the Genome Data Science laboratory are organized into four broad themes:


Unraveling mutational processes. Mutations are the fuel of carcinogenesis and it is imperative to learn what causes them and how they drive evolution in general, and cancer evolution in particular. We have shown that somatic mutations are unevenly distributed across the human genome due to differential activity of DNA mismatch repair (MMR), which preferentially protects gene-rich regions (Supek & Lehner 2015 Nature).

Moreover, motivated by the discoveries of APOBEC3 mutagenesis in tumors, we found another prevalent process that creates clustered mutations in many cancer types -- error-prone MMR, evident as the mutational signature of DNA polymerase eta (POLH). The histone mark H3K36me3 is an important determinant of both the standard, error-free MMR and the non-canonical, error-prone MMR (Supek & Lehner 2017 Cell).


Genomic signatures of natural selection. Most somatic mutations found in cancer cells are ‘passengers’ , with little phenotypic consequence. Detecting the few mutations among those which are ‘drivers’ is challenging, yet crucial to understand carcinogenic transformation. We have previously discovered that synonymous mutations ie. those that occur in gene coding regions but do not change the amino acid sequence, commonly drive cancer by affecting splicing patterns of oncogenes (Supek et al. 2014 Cell).

Moreover, we have learnt how the quality control pathway of nonsense-mediated mRNA decay (NMD) decides which mRNAs to degrade (Lindeboom et al. 2016 Nat Genet), and used these rules of NMD to reveal patterns of positive and negative selection on tumor suppressor genes and on essential genes.


Automated inference of gene function. Genome sequencing technologies are rapidly advancing, providing an abundance of genomes of prokaryotic and eukaryotic species, and also of populations thereof. This presents an opportunity to learn about the function of the ~1/3 of the genes for which, remarkably, a biological role is still not known.

We have devised a methodology to infer gene function from evolutionary patterns in codon biases, which serve as proxy for evolution of gene expression levels (Krisko et al. 2014 Genome Biol). We also proposed 'metagenome phyletic profiles', a compact representation of environmental DNA sequencing data -- including human microbiomes -- that can predict gene function using machine learning (Vidulin et al. 2018 Microbiome).


Genetic basis of phenotypes. Various kinds of -omics data accumulate rapidly and are increasingly organized into tidy, structured repositories. In contrast, phenomics data, while very valuable, are less often collected in a systematic manner and encoded in computable formats. This hampers the discovery of genes that underlie various phenotypes.

We have used machine learning to text-mine the scientific literature and annotate microbias species with >400 phenotypic traits (Brbic et al. 2016. Nucl Acids Res) and suggest their genetic basis, including prevalent epistasis in gene repertoires. One example are genomes of pathogenic bacteria, which tend to encode proteomes resistant to unfolding, thereby protecting the microbes from oxidative stress (Vidović et al. 2014 Cell Rep).

"Try not. Do… or do not. There is no try." -- Yoda.