What is it?
Assure and control the quality of your data to prevent errors during data transfer or to eliminate existing errors from a dataset. Assurance is important to monitor and maintain data quality throughout the whole data life cycle. Data quality is one of the main challenges when promoting data reuse, therefore quality should be controlled in all steps of data handling - when data are collected, entered and analyzed. As soon as data are ‘fit for use’, they should be accessible, accurate, complete, consistent with other sources, relevant, comprehensive, readable and interpretable.
How to do it?
As soon as the data are entered in spreadsheets/databases, you can start with basic quality assurance.
- Data entered by hand should be double-checked.
- Check the consistency of the format throughout the data set.
- Perform a data cleaning process. Detect and delete errors (data entry errors, measurement errors, distillation errors, data integration errors) and discrepancies to enhance data quality.
- Use visualization-tools (like the GFBio VAT system) that can create a graphical overlay with other datasets to detect errors like missing, impossible, or anomalous values and discrepancies and identify outliers.
- Transform the data. Verification: ensure that your transformation will cure all errors found.
- Backflow of cleaned data into source and replacement of errors.
- Save this cured dataset as a new version to avoid irreparable distillation errors (versioning).
- Document the cleaning process in a script (transformation workflow/mapping rules).
- At later stages of the data life cycle, e.g. during analysis or interpretation of results for a subsequent publication, further discrepancies could be detected and eliminated. Document each change to the data carefully!
- Communicate data quality using either coding/flagging within the data set, or in the metadata. For some data, errors can be indicated very precisely, e.g. for most measurement instruments a measuring accuracy is defined.
Who does it?
Currently every data producer and re-user, who does his own research or is part of a research program.
- Check for accuracy and consistency of data structure and format.
- Detect errors (statistical/graphical analysis).
- Go through the data cleaning process and create a script of it.
- Version your data sets.
- Document the quality of data by flags, metadata, coding.
- Using tools like BExIS2 or Diversity Workbench (DWB) to make quality assurance and control easier
- GFBio VAT-System for visualization, analysis and transformation
Create a graphical overlay with other data to detect errors. Transform from different coordinate systems; create a distribution of your data on a graphical display.
http://www.youtube.com/watch?v=i2jcOJOFUZg (MANTRA Video with Jeff Haywood - Importance of good file management in research)
http://openrefine.org/ (useful tool)