Data Tester

Page history last edited by Anonymous 1 yr ago

 

Data Tester

Summary

Type of tool

Framework with tools

Function

Data cleaning and validation

Online / Desktop

Desktop

Computer infrastructure

Platform independent

Development status

Operational and expandable. Last update Oct 2006

Time of use

Pre-filter

Licence

Open Source

A set of tools to assist in checking the quality of biodiversity datasets

 

Description

A generic Java framework targeted to data cleaning and data validation. The idea behind this project has been originally conceived within the biodiversity informatics field. It followed the establishment of the first global networks that served primary data from biological collections. With the increase in the amount of shared data, which included researchers and policy makers among its users, data quality naturally gained importance. In this context, some networks started to develop tools and interfaces to help with data cleaning and data validation issues. The main idea of this project was to gather all knowledge from those first data cleaning tools and to produce a new framework that could serve as a common ground for implementing and running a large number of data tests.

 

The framework has been originally developed as open source software by the Reference Center on Environmental Information (CRIA) with funding from the Global Biodiversity Information Facility (GBIF) and the Gordon and Betty Moore Foundation. Despite being originated from the biodiversity informatics field, it is by no means bound or limited to this area. Its design pursued the following goals:

  • To provide standard ways of interacting with the main components such as data tests, tests results, records and record sets, allowing different implementations for all of them.
  • To be extensible and allow unlimited creation of new data tests that could be readily plugged into the framework.
  • To be able to process record sets coming in different formats and from different sources (XML, relational database, etc).
  • To allow the existence of parameterised data tests so that the same implementation could accept different configurations without the need of writing new tests.
  • To make all data tests produce results in a standard format so that they can be handled programmatically.

 

Two Java packages were created: one containing the framework itself, and another containing a set of generic tests that can be useful in different situations.1

 

Function

This is a suite of data cleaning and data validation tools.

Tests that can be executed include the following:2

  • Reporting unrecognized values for data elements (e.g. country names or basis of record values)
  • Checking that coordinates fall within the boundaries of named geographic areas
  • Finding scientific names that are not known to external lists such as the Catalogue of Life or nomenclators
  • Checking that scientific names have an appropriate format
  • Detecting numerical outliers

 

Why use this tool?

Data quality is extremely important to both data users and data providers.

 

Who will use this tool?

DataTester can be employed directly by data providers, other portals or persons preparing to perform analyses on data retrieved. In fact, the software is not limited to biodiversity data types, but those in fields other than biodiversity informatics can add tests for the kinds of errors that might be found in their data sets.3

 

How will the tool be used?

The software is particularly suited to reporting on XML data sets, but can be applied to other data formats or relational databases. It allows programmers to develop new tests and to generalize tests so that they can work against multiple data standards (e.g. Darwin Core and ABCD schema). Each test may be associated with a severity (error, warning, info) to make it easier to focus on the most significant issues.4

 

Written in Java, this is a desktop application. The tester comes as four files, all platform independent:

  • Framework
  • Tests
  • Source
  • Documents

 

Where in the data chain could this tool be used?

  • Data source
  • ALA central
  • User’s machine

 

When could this tool be used?

  • Before data is made available to ALA
  • While data is stored with ALA
  • As a post process, after data is with the user

 

Availability

 

Comments

GBIF have released three papers that discuss issues related to the quality of data:5

 

 


Comments (0)

You don't have permission to comment on this page.