A fully automated tool for species tree inference
· ScienceDailySource: | University of California - San Diego |
Summary: | Engineers are making it easier for researchers from a broad range of backgrounds to understand how different species are evolutionarily related, and support the transformative biological and medical applications that rely on these species trees. The researchers developed a scalable, automated and user-friendly tool called ROADIES that allows scientists to infer species trees directly from raw genome data, with less reliance on the domain expertise and computational resources currently required. |
A team of engineers at the University of California San Diego is making it easier for researchers from a broad range of backgrounds to understand how different species are evolutionarily related, and support the transformative biological and medical applications that rely on these species trees. The researchers developed a scalable, automated and user-friendly tool called ROADIES that allows scientists to infer species trees directly from raw genome data, with less reliance on the domain expertise and computational resources currently required.
Species trees are critical to solidifying our understanding of how species evolved on a broad scale, but can also help find functional regions of the genome that could serve as drug targets; link physical traits to genomic changes; predict and respond to zoonotic outbreaks; and even guide conservation efforts.
In a new paper published in the journal Proceedings of the National Academy of Sciences on May 2, the researchers, led by UC San Diego electrical and computer engineering professor Yatish Turakhia, showed that ROADIES infers species trees that are comparable in quality with the state-of-the-art studies, but in a fraction of the time and effort. This paper focused on four diverse life forms — placental mammals, pomace flies, birds and budding yeasts — though ROADIES can be used for any species.
"Rapid advances in high-throughput sequencing and computational tools have enabled genome assemblies to be produced at scale," said Anshu Gupta, a computer science PhD student at the Jacobs School of Engineering and the study's first author. "However, accurately inferring species trees is still beyond the reach of many researchers."
"ROADIES is a timely and transformative solution to this problem," added Turakhia. "With its speed, accuracy, and automation, ROADIES has the potential to vastly simplify species tree inference, making it accessible to a broader range of scientists and applications."
A truly automated process
ROADIES — which stands for "Reference-free, Orthology-free, Annotation-free, Discordance-aware Estimation of Species Trees" — stands apart from existing phylogenetic tools because it uses a completely automated pipeline yet produces highly accurate results.
One of ROADIES' key innovations is that instead of using predefined genomic regions with specific characteristics, such as protein-coding genes, ROADIES is based on a random sampling of loci from input genomes. This eliminates the need for genome annotation prior to species tree inference.
"It may seem surprising that reconstructing species trees from randomly selected loci can yield highly accurate results," said Turakhia. "But our results show that this simple approach is effective, and we believe it can even offer unique benefits, including better adherence to models of sequence evolution."
Another strategy that proved key to automation is that ROADIES, unlike many existing methods, is able to take advantage of genes that are present in multiple copies across the genome, a phenomenon that is prevalent for many species. ROADIES does this by integrating methods developed at UC San Diego in the lab of Siavash Mirarab, a professor of electrical and computer engineering and co-author of this PNAS study. This strategy allows ROADIES to eliminate the need to infer orthology, or determine the correspondence of individual gene copies in different species.
By removing the need for two cumbersome steps (genome annotation and orthology inference), ROADIES not only overcomes a major barrier to building reliable, fully automated pipelines, but it also requires significantly less computing power than existing tools. The study highlights ROADIES' scalability to datasets with hundreds of genomes, inferring phylogenies that are concordant with expert-led, large-scale studies, yet requiring a fraction of the time and effort.
The researchers are continuing to improve the capability of ROADIES, including the placement of new taxa on existing species trees and the potential use of GPUs to allow for the processing of tens of thousands of genomes and beyond.
"Large-scale initiatives are already underway to sequence thousands of species — and eventually, potentially every extant eukaryotic species on Earth," said Turakhia. "We want to ensure ROADIES is ready to meet that scale."
This work is supported by an Amazon Research Award (Fall 2022 Call for Proposals), NIH grant 1R35GM142725, and funding from the Hellman Fellowship.