Topic

The qualities of good data

Topic Progress:

Click on the numbered tabs to navigate this lesson.

How to ensure your data is high-quality

Open data management is ultimately about providing high quality, highly usable data. There are several indicators to help us think about and assess data quality and usability.

We’ll discuss each one in detail, and explain how each is addressed in the typical open data workflow.

In this lesson, you will learn

  • What formats to use for open data
  • How to make open data accessible
  • Why it’s important to keep open data up-to-date
  • What documentation should be supplied with open data

Good data is accessible

Perhaps the most obvious indicator of quality is whether data is easily accessible. In other words, users need to be able to find it and understand the terms under which they can use it.

Open data platforms should include search functionality, they should be compatible with commonly-used search engines and they should provide clear terms of use. This is the responsibility of data curators who manage the open data catalogue, which we discuss in more detail in Developing your open data programme.

The second part of accessible data is that it should be provided in one or more open data formats. Providing data in open formats is typically the role of data producers, with guidance from data curators.

Good data is well-documented

Well-documented data includes metadata — additional data that describes the data you are presenting.

Metadata tells the user what the data is about, who collected it, when it was last updated, and how to use and interpret it. Metadata may also include methodology notes or variable documentation.

Each dataset should be accompanied by a set of standard metadata to provide documentation and context. Both data producers and data curators have responsibility for providing metadata.

Essential metadata fields are the dataset title, description, owner, contact information for questions, license and dates of first publication and last modification.

There are many additional metadata fields that you could choose to include. But many metadata fields will only be relevant to certain types of data.

An example of good metadata

Metadata in the description of the Global Mines Action Database

An example of how to implement metadata is the Dublin Core Metadata Schema (DCAT)

The Dublin Core Schema provides a set of standard metadata vocabulary terms. DCAT is a machine-readable implementation of the Dublin Core Schema. Open data catalogs that support DCAT allow users to read metadata in a standardized format. Many off-the-shelf data catalog platforms provide DCAT compatibility as a standard feature.

Good data is timely

Datasets that are time dependent must be updated at regular intervals in order to remain relevant and useful. Very few users will want last week’s weather forecast, or yesterday’s public transportation schedules.

A snapshot of The World Bank’s update feed.

Data publishers are responsible for keeping data up-to-date. Data curators must ensure that the open data catalogue contains the latest dataset and that metadata reflects the most recent update. The update schedule should be included in the metadata for datasets that change regularly.

Good data is well-formatted

Well-formatted data means that it is published in a format that is easily usable by all users. Your data should be machine-readable, with no need for proprietary software, in a file format that allows the data to easily be read by software.

This satisfies the concept of “technically open” that we introduced in Lesson 1. Providing data in open formats is typically the role of data producers, with guidance from data curators.

Some data file formats satisfy the standard of “technically open” better than others. The best format for your data depends on the type of data. It is common for datasets to be provided in multiple formats for greatest flexibility.

Highly recommended formats

Highly recommended formats are designed to be as open as possible.

Here are the different file formats that are highly recommended for open data, as well as a description of each of these formats.

Formats for tabular data

CSV Comma Separated Value files are simple text files in which each line is a record and fields are separated by commas.

TSV Tab Separated Value files are simple text files in which each line is a record and fields are separated by tabs.

Fixed-width text Fixed-width files are simple text files in which each line is a record and each field occupies the same number of characters in every record.

TDP The Tabular Data Package is a special case that combines a CSV data file with a JSON-formatted metadata file.

Formats for multi-dimensional data

JSON Javascript Object Notation is popular with software developers for publishing complex data structures.

XML Extensible Markup Language resembles HTML source code, but is designed for data instead of text.

RDF Resource Description Framework provides a standard model for the Semantic Web, often called “linked data”.

Formats for geospatial data

geoJSON A variation of JSON specifically for geospatial data

KML: A variation of XML specifically for geospatial data, originally designed by Google

Somewhat recommended formats

“Somewhat recommended” formats are usually sufficient because they are supported by freely available software, even though some formats began as proprietary products.

Tabular data

Microsoft Excel files are sufficient for tabular data because they can be read with many different software products. The ability to store multiple sheets can also be advantageous for including metadata

Multi-dimensional data

HTML is sometimes used to publish data if the HTML code is very well structured; but usually XML provides a superior approach.

Geospatial data

The shapefile format was originally developed as a proprietary format and is regulated by ESRI, Inc. Many software products can read and write shapefiles easily.

Formats to avoid

“Not Recommended” formats are not advised for publishing open data. They are only acceptable if data is separately made available in one of the recommended formats.

Here are the formats that are not recommended and a the reasons why:

STATA

STATA files (files with the “.dta” file extension) can only be read by proprietary software.

SAS

Statistical Analysis Software files can only be read by proprietary software. The file extensions are typical. SAS files have a variety of extension classifications, including “.sas” and others.

SPSS

These files can only be read by proprietary software. Files in this format typically end in the “.sav” file extension.

PDF

Portable Document Format files are designed for page layout. It is prohibitively difficult to read data from a PDF

JPEG/JPG, PNG and TIFF

These files are designed for images, not data – computers cannot easily words embedded in images.