Bayesian inference in phylogenetics
Published: October 26, 2008Tags: evolution molecular biology phylogenetics science bayesian statistics
This blog entry represents quite a substantial departure from my usual subject matter, in that it has a lot to do with molecular biology. To say that this is not my area of expertise would be an understatement. I have no formal education in biology beyond the bare minimum that every Australian high school graduate must get - I ditched it for physics and chemistry at the first possible opportunity. My entry point into the material discussed here is this paper, which I found by virtue of it being cited in this paper, which is substantially more relevant to my current field. So I make no guarantees of complete factual accuracy in what follows, although I'd like to think I haven't misunderstood anything too severely.
Phylogenetics is the study of evolutionary relatedness between organisms - identifying which plants or animals have common ancestors. The end result of such study is the production of phylogenetic trees, or "trees of life", which look something like this and diagrammatically convey our best estimate as to when and where two closely related species have diverged earlier in evolutionary history. Historically, I get the impression this has been a 1very labour intensive process, and one where it has been difficult to get any sort of objective measure of reliability: evolutionary relatedness has had to been inferred based on observed similarities in things like large-scale physical traits or behaviour.
Modern molecular biology, however, has opened the door to, appropriately enough, so-called molecular phylogenetics, in which evolutionary relatedness is determined in a more direct and objective way by comparing organisms in terms of molecular structure - specifically, DNA or some form of RNA. In this case each organism is represented by a GATTACA-style sequence of base pairs and we assess relatedness by using some sort of mathematical model of the evolutionary process itself. As a simplest approximation, we might represent evolution by a sequence of independent mutations, where a particular base pair is picked uniformly at random from the entire sequence and replaced by another base pair picked uniformly at random from the 16 possible pairs. It is possible to refine this first order approximation by making mutations in some areas of the sequence more likely than others and/or making some mutations more likely than others.
With a collection of aligned sequences and such a mathematical model of evolution, we can then proceed to infer phylogenetic trees by any one of a number of possible techniques - by appealing to parsimony, and seeking the tree which minimises the total number of required mutations; by appealing to maximum likelihood and seeking the tree whose underlying sequence of mutations is more probably than that underlying any others; or by using Bayesian inference to compute a posterior probability distribution over trees. These heavily computational approaches to molecular phylogenetics are called (once again, appropriately enough), computational phylogenetics. The Bayesian case is the most immediately interesting to me, because it's the only one I've read about in significant detail and because the specific algorithms used typically belong to the same broad class of algorithms that I use in my own work (Markov Chain Monte Carlo approximations), although Wikipedia suggests that this technique is controversial at best. The trees produced via Bayesian methods in the paper cited above, though, certainly seem reasonable to my non-expert eye.
The awesome explanatory power of the ideas of evolution mapped onto genetics is poignantly apparent when we step back and think about what is being achieved here. We are providing as input to a computation the structure of certain molecules sampled from organisms - invisible, tiny fragments of whole animals - and, making only assumptions about mutation processes on these molecules (which are known for a fact to occur) we are receiving as output tree structures of relatedness which are in excellent broad agreement with the tree structures we made in the pre-genetic days based on our observations of macroscopic physical features of whole animals. How inspiring! How giddyingly exciting it is to think that it is by no means impossible that in the future, after we have refined our techniques of , developed more efficient and accurate numerical methods for Bayesian inference or other statistical techniques, and greatly increased the processing speeds of our computers, we may be able to input into a great machine cell samples from all the plants and animals we have discovered and, after a lengthy process of computation, be rewarded one day with a glance at the complete history of all life currently found on this planet!