Click on the numbered tabs to navigate this lesson.
How to ensure your data is high-quality
Open data management is ultimately about providing high quality, highly usable data. There are several indicators to help us think about and assess data quality and usability.
We’ll discuss each one in detail, and explain how each is addressed in the typical open data workflow.
In this lesson, you will learn
- What formats to use for open data
- How to make open data accessible
- Why it’s important to keep open data up-to-date
- What documentation should be supplied with open data
Good data is accessible
The second part of accessible data is that it should be provided in one or more open data formats. Providing data in open formats is typically the role of data producers, with guidance from data curators.
Good data is well-documented
Metadata tells the user what the data is about, who collected it, when it was last updated, and how to use and interpret it. Metadata may also include methodology notes or variable documentation.
Each dataset should be accompanied by a set of standard metadata to provide documentation and context. Both data producers and data curators have responsibility for providing metadata.
Essential metadata fields are the dataset title, description, owner, contact information for questions, license and dates of first publication and last modification.
There are many additional metadata fields that you could choose to include. But many metadata fields will only be relevant to certain types of data.
An example of good metadata
An example of how to implement metadata is the Dublin Core Metadata Schema (DCAT)
The Dublin Core Schema provides a set of standard metadata vocabulary terms. DCAT is a machine-readable implementation of the Dublin Core Schema. Open data catalogs that support DCAT allow users to read metadata in a standardized format. Many off-the-shelf data catalog platforms provide DCAT compatibility as a standard feature.
Good data is timely
Datasets that are time dependent must be updated at regular intervals in order to remain relevant and useful. Very few users will want last week’s weather forecast, or yesterday’s public transportation schedules.
Data publishers are responsible for keeping data up-to-date. Data curators must ensure that the open data catalogue contains the latest dataset and that metadata reflects the most recent update. The update schedule should be included in the metadata for datasets that change regularly.
Good data is well-formatted
Well-formatted data means that it is published in a format that is easily usable by all users. Your data should be machine-readable, with no need for proprietary software, in a file format that allows the data to easily be read by software.
This satisfies the concept of “technically open” that we introduced in Lesson 1. Providing data in open formats is typically the role of data producers, with guidance from data curators.
Some data file formats satisfy the standard of “technically open” better than others. The best format for your data depends on the type of data. It is common for datasets to be provided in multiple formats for greatest flexibility.
Highly recommended formats
Highly recommended formats are designed to be as open as possible.
Here are the different file formats that are highly recommended for open data, as well as a description of each of these formats.
Formats for tabular data
CSV Comma Separated Value files are simple text files in which each line is a record and fields are separated by commas.
TSV Tab Separated Value files are simple text files in which each line is a record and fields are separated by tabs.
Fixed-width text Fixed-width files are simple text files in which each line is a record and each field occupies the same number of characters in every record.
TDP The Tabular Data Package is a special case that combines a CSV data file with a JSON-formatted metadata file.
Formats for multi-dimensional data
XML Extensible Markup Language resembles HTML source code, but is designed for data instead of text.
RDF Resource Description Framework provides a standard model for the Semantic Web, often called “linked data”.
Formats for geospatial data
geoJSON A variation of JSON specifically for geospatial data
KML: A variation of XML specifically for geospatial data, originally designed by Google
Somewhat recommended formats
“Somewhat recommended” formats are usually sufficient because they are supported by freely available software, even though some formats began as proprietary products.
Microsoft Excel files are sufficient for tabular data because they can be read with many different software products. The ability to store multiple sheets can also be advantageous for including metadata
HTML is sometimes used to publish data if the HTML code is very well structured; but usually XML provides a superior approach.
The shapefile format was originally developed as a proprietary format and is regulated by ESRI, Inc. Many software products can read and write shapefiles easily.
Formats to avoid
“Not Recommended” formats are not advised for publishing open data. They are only acceptable if data is separately made available in one of the recommended formats.
Here are the formats that are not recommended and a the reasons why:
STATA files (files with the “.dta” file extension) can only be read by proprietary software.
Statistical Analysis Software files can only be read by proprietary software. The file extensions are typical. SAS files have a variety of extension classifications, including “.sas” and others.
These files can only be read by proprietary software. Files in this format typically end in the “.sav” file extension.
Portable Document Format files are designed for page layout. It is prohibitively difficult to read data from a PDF
JPEG/JPG, PNG and TIFF
These files are designed for images, not data – computers cannot easily words embedded in images.