Tuesday, June 19, 2012

70,000 articles extracted from the Biodiversity Heritage Library

Biostor shadowJust noticed that BioStor now has just over 70,000 articles extracted from the Biodiversity Heritage Library. This number is a little "soft" as there are some duplicates in the database that I need to clean out, but it's a nice sounding number. Each article has full text available, and in most cases reasonably complete metadata.

Most of the articles in BioStor have been added using semi-automated methods, but there's been rather more manual entry than I'd like to admit. One task that does have to be done manually is attaching plates to papers. This is largely an issue for older publications, where printing text and figures required different processes, resulting in text and figures often being widely separated in the publication. Technology evolved, and the more recent literature doesn't have this problem.

Future plans include adding the ability to download the articles as searchable PDFs, and to support OCR correction, amongst other things. BioStor also underpins some of my other projects, such as the EOL Challenge entry, which as of now has around 80,000 animal names linked to their original description in BioStor (and some 300,000 in total linked to some form of digital identifier). One day I may also manage to get the article locations into BHL itself, so that when you browse a scanned item in BHL you can quickly find individual articles. Oh, and it would be cool to have all this on the iPad...