Tuesday, October 25, 2011

Accepted Poster Session: Primary Biodiversity Data Records in Legacy Literature Databases

Title: Primary Biodiversity Data Records in Legacy Literature Databases

Authors: Arturo H. AriƱo, Museum of Zoology and Ecology of the University of Navarra
Estrella Robles, Museum of Zoology and Ecology of the University of Navarra

Abstract: Automated extraction of primary biodiversity data records (PBDR) (i.e. the basic triad of taxon/location/time) from existing literature is desirable as a way to help fill gaps and increase fitness-for-use of global repositories of biodiversity data digitally stored from specimens and observations. Current efforts at extracting taxonomic data and their context from legacy literature through digitizing and OCR, such as Global Biodiversity Information Facility’s (GBIF) Global Names Architecture (GNA), TaxonX, Innotaxa, Plazi, Fieldjournal, and other automated XML markup and tagging procedures applied to digitised literature increasingly available at BHL, are yet to produce unambiguous PBDR. Existing historical literature presents a high degree of formal variation which makes modelling in an XML schema quite difficult, so we still rely on manual parsing and digitization or markup for each complete PBDR.

This labor-intensive effort entails selective digitization because of its associated cost, and therefore may result in patterning of the acquired data, with high potential for gaps in knowledge. We explore some of these potential gaps by looking at patterns resulting from manual digitization of primary biodiversity data records into Zootron 4, a vintage taxonomic database including about 200,000 worldwide occurrence records of fauna manually captured from scientific literature over a period of more than two decades by biodiversity researchers according to their own selective interests.


Four broad classes of patterns were found: Taxonomic, geospatial, human-dependent, and chronological. However, these may reflect both intrinsic patterns existing in the examined literature and biases introduced by the researchers’ selective processes. Incremental analysis involving other similarly recorded PBDRs, as well as comparisons with other patterns resulting from alternate sources of PBDRs such as collections and observations, may help recognize the main source of each pattern. For that, a standardization of datasets, for example through an extension of Darwin Terms, may be desirable.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.