Wednesday, November 06, 2013

ZooKeys, GBIF, and GitHub: fixing Darwin Core Archives part 2

Here's another example of a Darwin Core Archive that is "broken" such that GBIF is missing some information. GBIF data set A checklist to the wasps of Peru (Hymenoptera, Aculeata) comes from Pensoft, and corresponds to the paper:
Rasmussen, C., & Asenjo, A. (2009). A checklist to the wasps of Peru (Hymenoptera, Aculeata). ZooKeys, 15(0). doi:10.3897/zookeys.15.196

As with the previous example GBIF says there are 0 georeferenced records in this dataset. This is odd, because the ZooKeys page for this article lists three supplementary files, including KML files for Google Earth. I've used one to create the image below:

GoogleEarth Image

So, clearly there is georeferenced data here. Looking at the Darwin Core Archive (which I've put on GitHub there are a bunch of issues with this data. The occurrence.txt file has decimal latitude and longitude values with a comma rather than a decimal point, the file has some character encoding issues, and the columns with latitude and longitude data are labelled as "verbatim" fields not "decimal" fields. All of this means GBIF lacks all the point data for this dataset (over 2000 records). If we fix these problems, we get a map like this:



This illustrates one problem with publishing data, namely the data is rarely checked in the same way a manuscript is. Peer-review of data is a phrase that always struck me as odd, because you only get to be able to evaluate a data set by using it. In other words, data almost demands post- rather than pre-publication review. It's only when people start trying to use the data that problems emerge.

At the same time, we could improve checking of data prior to publication. In the case of the Darwin Core Archives I've looked at so far, it would be easier to find the problems if we had a simple tool that could take a Darwin Core Archive, extract the information and display it in various ways. If, for example, we have georeferenced records but we don't get a map, we would immediate wonder why that was, and figure out what the problem was. At the moment it seems easy to send data to GBIF, thinking you are contributing important information, whereas in fact that information never makes it onto a GBIF map.