Wednesday, December 09, 2015

Visualising the difference between two taxonomic classifications

It's a nice feeling when work that one did ages ago seems relevant again. Markus Döring has been working on a new backbone classification of all the species which occur in taxonomic checklists harvested by GBIF. After building a new classification the obvious question arises "how does this compare to the previous GBIF classification?" A simple question, answering it however is a little tricky. It's relatively easy to compare two text files -- and this function appears in places such as Wikipedia and GitHub -- but comparing trees is a little trickier. Ordering in trees is less meaningful than in text files, which have a single linear order. In other words, as text strings "(a,b,c)" and "(c,b,a)" are different, but as trees they are the same.

Classifications can be modelled as a particular kind of tree where (unlike, say, phylogenies) every node has a unique label. For example, the tips may be species and the internal nodes may be higher taxa such as genera, families, etc. So, what we need is a way of comparing two rooted, labelled trees and finding the differences. Turns out, this is exactly what Gabriel Valiente and I worked on in this paper doi:10.1186/1471-2105-6-208. The code for that paper (available on GitHub) computes an "edit script" that gives a set of operations to convert one fully labelled tree into another. So I brushed up my rusty C++ skills (I'm using "skills" loosely here) and wrote some code to take two trees and the edit script, and create a simple web page that shows the two trees and their differences. Below is a screen shot showing a comparison between the classification of whales in the Mammals Species of the World, and one from GBIF (you can see a live version here).

Treediff

The display uses colours to show whether a nodes has been deleted from the first tree, inserted into the second tree, or moved to a different position. Clicking on a node in one tree scrolls the corresponding node in the other tree (if it exists) to scroll into view. Most of the differences between the two trees are due to the absence of fossils from Mammals Species of the World, but there are other issues such as GBIF ignoring tribes, and a few taxa that are duplicated due to spelling typos, etc.