Phylogenetic profiling

Phylogenetic profiling is a bioinformatics technique in which the joint presence or joint absence of two traits across large numbers of species is used to infer a meaningful biological connection, such as involvement of two different proteins in the same biological pathway. Along with examination of conserved synteny, conserved operon structure, or "Rosetta Stone" domain fusions, comparing phylogenetic profiles is a designated "post-homology" technique, in that the computation essential to this method begins after it is determined which proteins are homologous to which. A number of these techniques were developed by David Eisenberg and colleagues; phylogenetic profile comparison was introduced in 1999 by Pellegrini, et al.^[1]

Method

Over 2000 species of bacteria, archaea, and eukaryotes are now represented by complete DNA genome sequences. Typically, each gene in a genome encodes a protein that can be assigned to a particular protein family on the basis of homology. For a given protein family, its presence or absence in each genome (in the original, binary, formulation) is represented by either 1 (present) or 0 (absent). Consequently, the phylogenetic distribution of the protein family can be represented by a long binary number with a digit for each genome; such binary representations are easily compared with each other to search for correlated phylogenetic distributions. The large number of complete genomes makes these profiles rich in information. The advantage of using only complete genomes is that the 0 values, representing the absence of a trait, tend to be reliable.

Theory

Closely related species should be expected to have very similar sets of genes. However, changes accumulate between more distantly related species by processes that include horizontal gene transfer and gene loss. Individual proteins have specific molecular functions, such as carrying out a single enzymatic reaction or serving as one subunit of a larger protein complex. A biological process such as photosynthesis, methanogenesis, or histidine biosynthesis may require the concerted action of many proteins. If some protein critical to a process is lost, other proteins dedicated to that process would become useless; natural selection makes it unlikely these useless proteins will be retained over evolutionary time. Therefore, should two different protein families consistently tend to be either present or absent together, a likely hypothesis is that the two proteins cooperate in some biological process.

Advances and challenges

Phylogenetic profiling has led to numerous discoveries in biology, including previously unknown enzymes in metabolic pathways, transcription factors that bind to conserved regulatory sites, and explanations for roles of certain mutations in human disease.^[2] Improving the method itself is an active area of scientific research because the method itself faces several limitations. First, co-occurrence of two protein families often represents recent common ancestry of two species rather than a conserved functional relationship; disambiguating these two sources of correlation may require improved statistical methods. Second, proteins grouped as homologs may differ in function, or proteins conserved in function may fail to register as homologs; improved methods for tailoring the size of each protein family to reflect functional conservation will lead to improved results.

Tools

Tools include PLEX (Protein Link Explorer).^[3] (Now defunct) and JGI IMG (Integrated Microbial Genomes) Phylogenetic Profiler (for both single genes and gene cassettes).^[4]

Notes

^ Pellegrini, Matteo; Marcotte, Edward M; Thompson, Michael J; Eisenberg, David; Yeates, Todd O (1999). "Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles". Proceedings of the National Academy of Sciences USA. 96 (8): 4285–4288. doi:10.1073/pnas.96.8.4285. PMC 16324. PMID 10200254.
^ Kensche, Philip R; van Noort, Vera; Dutilh, Bas E; Huynen, Martijn A (2008). "Practical and theoretical advances in predicting the function of a protein by its phylogenetic distribution". Journal of the Royal Society Interface. 5 (19): 151–170. doi:10.1098/rsif.2007.1047. PMC 2405902. PMID 17535793.
^ Date, Shailesh V.; Marcotte, Edward M. (2005-05-15). "Protein function prediction using the Protein Link EXplorer (PLEX)". Bioinformatics. 21 (10): 2558–2559. doi:10.1093/bioinformatics/bti313. ISSN 1367-4803. PMID 15701682.
^ Chen, I.-Min A.; Chu, Ken; Palaniappan, Krishna; Pillay, Manoj; Ratner, Anna; Huang, Jinghua; Huntemann, Marcel; Varghese, Neha; White, James R. (2018-10-05). "IMG/M v.5.0: an integrated data management and comparative analysis system for microbial genomes and microbiomes". Nucleic Acids Research. 47 (D1): D666–D677. doi:10.1093/nar/gky901. ISSN 1362-4962. PMC 6323987. PMID 30289528.

[1] Pellegrini, Matteo; Marcotte, Edward M; Thompson, Michael J; Eisenberg, David; Yeates, Todd O (1999). "Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles". Proceedings of the National Academy of Sciences USA. 96 (8): 4285–4288. doi:10.1073/pnas.96.8.4285. PMC 16324. PMID 10200254.

[2] Kensche, Philip R; van Noort, Vera; Dutilh, Bas E; Huynen, Martijn A (2008). "Practical and theoretical advances in predicting the function of a protein by its phylogenetic distribution". Journal of the Royal Society Interface. 5 (19): 151–170. doi:10.1098/rsif.2007.1047. PMC 2405902. PMID 17535793.

[3] Date, Shailesh V.; Marcotte, Edward M. (2005-05-15). "Protein function prediction using the Protein Link EXplorer (PLEX)". Bioinformatics. 21 (10): 2558–2559. doi:10.1093/bioinformatics/bti313. ISSN 1367-4803. PMID 15701682.

[4] Chen, I.-Min A.; Chu, Ken; Palaniappan, Krishna; Pillay, Manoj; Ratner, Anna; Huang, Jinghua; Huntemann, Marcel; Varghese, Neha; White, James R. (2018-10-05). "IMG/M v.5.0: an integrated data management and comparative analysis system for microbial genomes and microbiomes". Nucleic Acids Research. 47 (D1): D666–D677. doi:10.1093/nar/gky901. ISSN 1362-4962. PMC 6323987. PMID 30289528.

[1]

[2]

[3]

[4]