Monday, June 29, 2009

Wikipedia taxonomy, the good, the bad, and the very ugly

In the previous post I suggested that a productive way to meet EOL's goal of a web page per taxon would be to build upon Wikipedia, rather than go it alone. In a nutshell the arguments were:

  1. Wikipedia has considerable traction and has some richly populated taxon pages

  2. The linked data community uses DBPedia.org as a core source of URIs for entities, as as DBPedia is derived from Wikipedia the later will be the core source of identifiers for taxa

To explore this a little further I grabbed two files from the 20090618 Wikipedia dump, namely page.sql and templatelinks.sql, and extracted page ids and titles for Wikipedia pages containing the Taxobox template. I then queried Wikipedia for the source for each of these pages, and tried to extract the taxonomic information from each page (a tedious and error-prone process at best).

I've put together a shockingly crude web page where you can browse the results (warning, this page is a 10 minute hack with little error checking).

There is some good news. There are over 120,000 taxon pages (I've not got an exact figure because the Taxobox template occurs on some pages that aren't taxon pages, such as documentation and user pages). Some pages are extensive (the largest page is Dinosaur for which the source text is 128K in size), and there are lots of links to external references (I counted 7205 distinct DOIs to papers and/or books, and 3248 distinct ISBNs). This represents a degree of external linkage that puts EOL to shame.

However, there are also some major problems. Firstly, Wikipedia does not have a single, internally consistent classification (i.e., the classification is not a tree). This is not unexpected, given that Wikipedia pages comprise semi-structured text that is (largely) manually entered. It's not a database. If it were, the simplest way to ensure consistency would be to have each child node include a pointer to its parent, and when we want a list of the children of the parent node we simply query the database ("what nodes have this node as their parent?"). Because Wikipedia isn't a database, authors have entered these two relationships ("has parent" and "has child") on different pages, and these often conflict. For a spectacular example of this, take a look at the page for Amphibia. When I scrapped Wikipedia I extracted the "has parent" link, as this is the simplest way to create a tree. This results in over 200 child taxa for Amphibia, yet the Wikipedia page for Amphibia lists only four child taxa. What appears to be happening is that many fossil taxa are being added to Wikipedia, and since we are often hazy about where they go in the tree, authors are listing their parent taxon as (in this case) "Amphibia". Given this direct link, they should also be listed as children of Amphibia (although, of course, that would make a mess of the Amphibia page). Perhaps the solution is to add a "incerta sedis" taxon page for each taxon, and make that the parent of all the taxa that we're aren't sure where to put. This would ensure consistency, but not make the current taxon pages unreadable.

Homonymy (the same name for different taxa) also raises it's ugly head. For example, the page for the crab family Latreilliidae lists the genus Latreillia, which is a fly. In this case, the fly genus Latreillia Robineau-Desvoidy, is a junior homonym of the crab genus Latreillia Roux (see http://biodiversitylibrary.org/page/12221111).

Finally, the page titles (which become the basis of DBPedia.org URIs) are a muddled mixture of common and scientific names.

So, what to do? Well, the idea of simply using Wikipedia as is isn't going to fly, it's too broken. We will have to contemplate a concerted effort to fix it (which will require using bots to clean up the inconsistencies). Another option (assuming that we like the Wiki-style environment) is to use a semantic wiki (see my earlier post), which constrains some of the possible markup, but retains a lot of the freedom that make wikis so powerful.

This isn't an argument for not using Wikipedia as such, it's arguably still much more informative than, say, EOL. It's just that it's showing signs of the limitations of free-form text entry. The trick is to find a way to combine the obvious strengths of this approach (ease of creating and editing pages, massive community support) with the more structured approach needed to avoid the internal inconsistencies that currently bedevil Wikipedia.

Thursday, June 25, 2009

EOL, Wikipedia, TDWG, LinkedData, and the Vision Thing

Time for more half-baked ideas. There's been a lot of discussion on Twitter about EOL, Linked Data (sometimes abbreviated LOD), and Wikipedia. Pete DeVries (@pjd) is keen on LOD, and has been asking why TDWG isn't playing in this space. I've been muttering dark thoughts about EOL, and singing the praises of Wikipedia. On so it goes on. So, here's one vision of where we could (?should) be going with this.

Let's imagine that we do indeed want to play in the Linked Data space. The concern that tends to raised the most is that biodiversity informatics uses LSIDs as the standard GUID, and this doesn't play nice with Linked Data. This is true, but not life threatening. There are various hacks (like this and this that deal with this).

But, the real concern (I think) is that we need a way to link our stuff to the rest of the Linked Data cloud. That is, wherever possible we need to reuse existing identifiers. In the LOD diagram below (for the latest version see here) DBpedia.org is key to linking much of this together, and major players (such as the BBC) are now using DBpedia.org to make connections.



DBpedia.org is based on Wikipedia, so I think you can see where this is going. There are some 120,000+ taxon pages in Wikipedia, so that's some 120,000+ identifiers in DBpedia.org that others interested in organisms can (and will) use to refer to taxa. Given the centrality of Wikipedia and DBpedia to LOD, why don't we adopt DBpedia.org URIs as the default GUID for our taxa? At present we have numerous, competing identifiers (e.g., NCBI tax ids, ITIS tsn's, Catalogue of Life LSIDs, uBio NameBankID's, plus LSIDs from various nomenclators). For users this is a mess -- which one do I use? Deciding requires dealing with issues (such as the difference between nomenclatural codes, and between taxonomic names and concepts, etc., that frankly, nobody outside our community cares about.

So, if we want to play with LOD, we need to make our identifiers play nice (straightforward), and we should think seriously about adopting DBpedia.org URIs as the default GUID for taxa.

Now, where does this leave EOL? Well, frankly, it should get out of the business of making web pages for taxa, because Wikipedia owns that space already. Their pages are fewer, but often much more detailed than the corresponding EOL page, and Wikipedia reacts faster to new discoveries. Wikipedia supports community editing, versioning, and quite sophisticated tools for handling biblographic references.

There's plenty of scope for userful tools and services for EOL to develop, but I think the real game is elsewhere. Now, Wikipedia is far from perfect. It's basically semi-structured text with a God-awful template language, and it would benefit greatly from more structure (e.g., as could be provided by Semantic Mediawiki), but I think we should think about building upon it. We could build our own (and my experiments over at itaxon.org explore this), but the big challenge is getting a community around a project, and if David Shorthouse's pronouncement that The Community is Dead is correct, then maybe we should get on board with the community that already exists. Perhaps what EOL should be doing is talking to Wikipedia, improving the existing templates for taxon pages, and creating bots to automatically populate Wikipedia with more taxon pages.

Sunday, June 14, 2009

Visualising taxonomic classifications using SpaceTrees

The problem of displaying large taxonomic classifications on a web page continues to be an on again-off again obsession. My latest experiment makes use of Nicolas Garcia Belmonte's wonderful JavaScript Infovis Toolkit (JIT), which provides implementations of classic visualisations such as treemaps, hyperbolic trees, and SpaceTrees.

SpaceTrees were developed at Maryland's HCIL lab, and that lab has applied them to biodiversity informatics. The LepTree project has also used them (see LepTaxonTree). I've not been a huge fan, mainly because the existing implementation is a stand-alone Java program, which somewhat limits it's utility. But JIT changes all that.

To get a sense of whether SpaceTrees would be useful, I took Belmonte's second SpaceTree example as a starting point. In this example, nodes are created on demand (rather than loading the entire tree into memory). It proved relatively straightforward (after getting my head around making Ajax requests using Mootools) to modify the example to load nodes from a local copy of the Catalogue of Life 2008 classification.



I've put a live version of the Catalogue of Life SpaceTree up at http://bioguid.info/demos/spacetree. It doesn't do much, beyond display the tree, together with some basic information about the node. But I think it shows the power of Javacsript to create pleasing visualisations, and the potential of SpaceTrees as a simple tool to browse large taxonomic classifications.

Friday, June 05, 2009

ChrisFreeland.com: #ebio09, silverbacks, & haiku

Chris Freeland has written a thoughtful summary of his experiences of the two-day closed session to create a road map for biodiversity informatics, entitled #ebio09, silverbacks, & haiku.

Taxonomy on a hard disk

This post is likely to seem somewhat off the wall, given the rush to getting everything in the cloud, but it's Friday, so let's give it a whirl.

One idea I've been toying with is dispensing with relational databases, wikis, etc. and just storing taxonomic data using files and folders on a disk. There are several reasons for this:

  • File system naturally enforces hierachy

  • There are existing systems for putting files and folders under version control (e.g., CVS, Subversion, git)

  • Native text and image editors handily beat web-based ones

  • Some file systems have great tools for searching on metadata (e.g., "smart folders" and Spotlight on Mac OS X)

  • Some of the visualisations that we would like for classifications (such as treemaps) already exist in very polished form for viewing file systems

By way of background, I've been prompted to think along these lines by David Shorthouse's observation that we could place a taxonomic hierachy under version control (e.g., github) and deal with changes/multiple versions that way. I've also been inspired by tools such as couchdb, which is a schema-less database that one can talk to directly via HTTP. This reflects a trend where people are starting to exploit the untapped power in some basic, well-known technologies (such as the HTTP protocol), avoiding the need for lots of middle-ware in the process (why write stuff in Java/Ruby/PHP, etc. when HTTP GET, PUT, POST, etc. cover the bases?). Another inspiration is Dropbox, which enables replication of files across multiple machines and the web. The web interface to Dropbox is very clean, and essentially mirrors the local folder structure.

So, in some ways this probably sounds silly, and closely resembles the naive way many of us started making digital versions of taxonomies, and it will have many database people rolling their eyes and muttering about "data consistency" and "queries". But, a key thing to remember is that the file system is a database that resides under a graphical user interface, and it maintains some forms of consistency that classical relational databases are poor at handling. For example, file systems enforce hierarchical consistency (if I move a folder to another folder, all the files and folders below that folder move as well). Of course, we can program this with a relational database, but our track record in doing this is pretty miserable. I've found inconsistencies in versions of ITIS (haven't checked recently), and last years' Catalogue of Life database had all sorts of orphans lurking in the tree table.

Then there's the GUI. If I write a taxonomic database in the classical way, I need to write code to talk to the database, edit records, support user authentication, data versioning, etc. If I use the file system, I get this pretty much for free. Authentication? It's called the login screen. Versioning? I put it in a public repository like Google Code or github, and that takes care of that (plus I get online authentication for free). Editing? Well, I can drag and drop items onto folders, and I can open them in native editors.

What I envisage is replicating a taxonomic hierarchy on disk, and representing key-value pairs of attributes (such as taxon name authorship, bibliographic details) as text files where the name of the file is the key (e.g., publishedIn) and the content of the file is the value (e.g., doi:10.1590/S0101-81752005000300004). I could add images and PDFs, and the neat thing is that they have lots of useful metadata embedded inside (where, arguably, it belongs).

I'm also toying with the idea of using symbolic links (Windows users, look away now) to represent relationships such as basionym links to original names.

This is all a bit half-baked at present, but it seems worth pursing. One could argue that having a full taxonomic hierachy is overkill (and raises the issue of which one to use), but binomial names are themselves hierachical (species epithet nested inside genus name), so we need some degree of hierarchy anyway. I like the idea that copying a folder called "behreae" in the folder "Pinnixa" and placing the copy under "Austinixa", then within Austinixa/behreae adding a symbolic link to Pinnixa/behreae pretty much takes care of synonomy. I also like the idea that one could download an entire taxonomy, and using just the native tools on your computer, edit and annotate it, then merge changes with a remote copy. It makes mamnaging the data little different from writing code.

In practise we'll want to add some things. It would be nice to have a web interface for browsing, but this could be as trivial as having a script that read the contents of a folder, display folders as HTML links, and list the files (keys) and their contents (values) in the web page.

Perhaps this is a little silly, but I like the idea of having data on my machine that is trivially easy to edit. I also like the idea of getting functionality for free, rather than having to invent it from scratch.

Wednesday, June 03, 2009

e-Biosphere '09: Twitter rules, and all that


So, e-Biosphere '09 is over (at least for the plebs like me, the grown ups get to spend two days charting the future of biodiversity informatics). It was an interesting event, on several levels. It's late, and I'm shattered, so this post ill cover only a few things.

This was first conference I'd attended where some of the participants twittered during proceedings. A bunch of us settled on the hashtag #ebio09 (you can also see the tweets at search.twitter.com). For the uninitiated, a "hashtag" is a string preceded by a hash symbol (#), to indicate that it is a tag, such as #fail. It provides a simple way to tag tweets so that others interested in that topic can find them.

Twittering created a whole additional layer to the conference. We were able to:

Twitter greatly enhanced the conversation, noticeably when a speaker said something controversial (all too rare, sadly), or when a group rapporteur's summary didn't reflect all the views in that group. It also helped document what was going on, and this can be further exploited. For fun, I grabbed tweets from days 2 and 3 and made a wordle:
As @edwbaker noted @edwbaker @rdmpage The size of 'together', 'people' & 'visionary' is somewhat telling...... In case you're wondering about the prominence of "Knowlton", it's because Nancy Knowlton gave a nice talk highlighting the every increasing number of cases where we have no names for the things we are encountering (for example, when barcoding fresh samples from poorly studied environments). This is just one example of the huge disconnect between the obsession with taxonomic names in biodiversity informatics, and the reality of metagenomics and DNA barcoding. Just as worrying is the lack of resemblance of the taxonomic classification used by the Encyclopedia of Life and our notion of the evolutionary tree of those organisms. A systematist would find much of EOL's classification laughable. I don't want to bash EOL, but it's worrying that they can continue to crank out press releases, but fail to provide something like a modern classification.

But I digress. In many ways this was less of a scientific conference and more of an event to birth a discipline, namely "biodiversity informatics" (which I'm sure some would claim as been around for quite a while). So, the event was to attract attention to the topic, and assure the outside world (and those attending) that the field exists and has something to say. It also was billed as a forum to discuss strategies for its future. Sadly, much of this discussion will take place behind closed doors, and will feature the major players who bring money and influence (but not much innovation) to the table.

Symptomatic of this lack of innovation, in a sense, was the contrast between the official "Online Conference Community", and the twitter feed. When I asked if anybody on twitter had used the official forum, @fak3r replied tellingly: @rdmpage thought we were on it ;) #ebio09. As fun as it is to use the new hotness to conduct a parallel (and slightly subversive) discussion at a conference it's worrying that, in a field that calls itself "informatics" the big beasts probably had little idea what was going on. If we are going to exploit the tools the web provides, we need people who "get it", and I'm unconvinced that the big players in this area truly grasp the web (in all it's forms). There's also a worrying degree of physics envy, which might be cured by reading The Unreasonable Effectiveness of Data (doi:10.1109/mis.2009.36).

I tried to stir things up a little (almost literally as captured in this photo by Chris Freeland), with a couple of questions, but to not much effect (other than apparently driving to despair the poor chap behind me ).


But enough grumbling. It was great to see lots of people attending the event, the were lots of interesting posters and booths (creating a market for this field may go some way towards providing an incentive to provide better, more reliable services), and my challenge entry won joint first prize, so perhaps I should sit back, enjoy the wine Joel Sachs choose as the prize (many thanks for his efforts in putting the challenge event together), and let others say what they thought of the meeting.