[4] mining microbial genomes

From E. coli to human, the function of a substantial fraction of the genes is unknown; we think that machine learning approaches can help solve this.

Shedding light onto 'dark matter' genes. Genome sequencing technologies provide an abundance of data, presenting an opportunity to learn about the function of the ~1/3 of the poorly characterized genes in genomes. We inferred gene function from evolutionary patterns in codon usage biases, which can serve as proxy for evolution of gene expression levels, suggesting mechanisms of stress adaptation in E. coli (Krisko et al. 2014 Genome Biol). We also proposed 'metagenome phyletic profiles', a compact representation of environmental DNA sequencing data -- including human microbiomes -- that can predict gene function using machine learning (Vidulin et al. 2018 Microbiome).

Genetic basis of phenotypes. Various kinds of -omics data are increasingly organized into tidy, structured repositories, while phenomics data, are less often collected in a systematic manner and encoded in computable formats.

We have used machine learning to text-mine the scientific literature and annotate microbial species with >400 phenotypic traits (Brbic et al. 2016. Nucl Acids Res) and suggest their genetic basis, including prevalent epistasis in microbial gene repertoires. One example are genomes of pathogenic bacteria, which tend to encode proteomes resistant to unfolding, thereby protecting the microbes from oxidative stress (Vidović et al. 2014 Cell Rep).

(figure from Brbic et al. 2016. Nucl Acids Res quantifying the presence of epistasis in microbial gene repertoires)

for more reading on association of codon usage biases with gene function, have a look at our review article: "The Code of Silence: Widespread Associations Between Synonymous Codon Biases and Gene Function". Supek F (2016) J Mol Evol.

"Try not. Do… or do not. There is no try." -- Yoda.