speciesLink Data Cleaning

Page history last edited by Anonymous 1 yr ago

 

speciesLink Data Cleaning

Summary

Type of tool

Set of tools

Function

Data cleaning

Online / Desktop

Online

Computer infrastructure

 

Development status

Established

Time of use

Before the data is made available to ALA; while data is with the ALA

Licence

Negotiate to use locally

Data cleaning aims at helping curators in identifying possible errors and to standardize data. Records are not modified. The system just presents "suspect" records, recommending that they be checked by each author or curator.1

 

Description

 

Geographic distribution of all records within the speciesLink network.2 This map shows several sets of suspect data: data points on the Greenwich meridian probably have either a missing or zero longitude, similarly data points on the equator probably have either a missing or zero latitude; those data points on the line at a 45° angle to Greenwich meridian and the equator have the same value for both latitude and longitude; sea-based records concentrated on the southern side of this 45° line may have their latitude and longitude reversed; etc.

 

The Data Cleaning tool will summarise and report on:

  • records without coordinates
  • records in the sea
  • repeated records/fields
  • suspect taxonomy at family/genus/species/subspecies/author/duplicate
  • suspect locality names of country/municipality
  • suspect latitude and longitude
  • outliers

 

Function

  • Data cleaning and manipulation
    • Data cleaning
    • Data validating – geography
  • Visualisation tools
    • Maps
  • User interface
    • Personal use
    • Data summary and visual presentation

 

Why use this tool?

  • To help curators identify data errors

 

 Who will use this tool?

  • Data capture
    • Curators
  • Data providers
    • Institutions
    • Private collections
    • Casual users
  • Some skills are required

 

 How will the tool be used?

  • Online tool when used for querying results of analysis
  • User input is required
  • Data is returned as a visual representation on a map, a summary report and data
  • This tool is run on pre-loaded datasets, probably overnight 3
  • This tool should be modified for the ALA and run locally (see discussion below)
  • Data Cleaning includes/links to the tools spOutlier and infoXY 4

 

 Where in the data chain could this tool be used?

  • Data source
  • ALA central

 

 When could this tool be used?

  • Before data is made available to ALA
  • While data is stored with ALA

 

Availability

 

Comments

  • These are online tools for collections held by CRIA - Centro de Referência em Informação Ambiental, Brazil, and others.
  • See also: Chapman, A.D. (2004). Environmental Data Quality – b. Data Cleaning Tools. Appendix I to /Sistema de Informação Distribuído para Coleções Biológicas: A Integração do Species Analyst e SinBiota. FAPESP/Biota process no. 2001/02175-5 March 2003 – March 2004./ Campinas, Brazil: CRIA 57 pp. http://splink.cria.org.br/docs/appendix_i.pdf.

 

Arthur Chapman, Australian Biodiversity Information Services has suggested that Data Cleaning is definitely the type of tool that the ALA needs and should use, and it would be best to obtain the code and run it or a modified version on the ALA. Although CRIA use some external datasets, they would probably not want the responsibility of running Australian data through the same tool, but that would need to be explored between the ALA and CRIA. It would be best to licence and use Data Cleaning for Australian collections.6

 

 


3 Arthur Chapman, Australian Biodiversity Information Services, January 2008

4 Arthur Chapman, Australian Biodiversity Information Services, January 2008

5 Arthur Chapman, Australian Biodiversity Information Services, January 2008

6 Arthur Chapman, Australian Biodiversity Information Services, January 2008

Comments (4)

John Tann said

at 5:30 pm on Feb 3, 2008

These are online tools for collections held by CRIA - Centro de Referência em Informação Ambiental, Brazil, and others.

John Tann said

at 5:31 pm on Feb 3, 2008

Arthur Chapman, Australian Biodiversity Information Services suggested: Data Cleaning is definitely the type of tool that the ALA needs and should use, and it would be best to obtain the code and run it or a modified version on the ALA. Although CRIA use some external datasets, I don't think they would want the responsibility of running the Australian data through the same tool, but that would need to be explored between the ALA and CRIA. It would be best to licence and use Data Cleaning for Australian collections.

Paul Flemons said

at 10:02 am on Feb 4, 2008

John - are the Data Tester and Species Link Data Cleaning the same thing?

John Tann said

at 1:40 pm on Feb 4, 2008

Not quite. Data Cleaning, spOutlier and infoXY were all developed at CRIA. Data Tester was later created for GBIF by CRIA along the same lines. But they remain distinct.

You don't have permission to comment on this page.