Click on the numbered tabs to navigate this lesson.
Why standards are important
The goal of publishing data is to make it as useful as possible for the end user, without compromising the identity of any of the subjects included within the data set. Many open data portals around the world, such as the Kenya Open Data Initiative, include advanced tools for visualisations – such as maps – so that users can generate graphics and charts without having to download anything at all. Before you think about including these kinds of tools, however, it’s important to get the basics of data publishing right.
These include making sure comprehensive metadata is included, and making sure data is clean and tidy before it is published.
In this lesson, you will learn:
- What it means to publish complete, tidy data
- How to compile and publish metadata
- Why data should be disaggregated as far as possible
- Why data should be anonymised
Good data is complete
There are several kinds of gaps that a dataset can contain. Prior to publishing your dataset, check to make sure it includes all the intended records and variables.
Most datasets contain at least some gaps and incomplete data should not be a deterrent to publication. However, data producers should try to address data gaps as much as possible.
Blank or missing records in an Excel document may suggest that separate tables aren’t aligning correctly, or that database keys aren’t accurate. Perhaps the data processing system should be reviewed.
Individual blank cells in an Excel document often result from missing or unavailable data, which is to be expected to some degree. Too many missing cells may indicate a production issue.
Blank or missing columns in an Excel document may suggest that variables aren’t being recognised or calculated correctly. Perhaps the data processing system should be reviewed.
Clean and tidy data
When working with data, you should adhere to the principles of “tidy data”. Your objective when working with raw data is to tidy it into a structured dataset which can be used for analysis. The principles of tidy data are:
- Each variable you measure should be in one column.
- Each different observation of that variable should be in a different row.
- There should be one table for each “kind” of variable.
- If you have multiple tables, they should include a column in the table that allows them to be linked.
In practice, this might mean a table in which one column – for example – is the name of a product and the next is the price. Each row tells you the name and price of a product. The names should all be written with the same spelling conventions and capitalisations, and the prices should all be decimal numbers with no spaces or other characters.
Importantly, in this example, the currency symbol should be included in the column heading, not in the column values, so that you can add the values together or perform other calculations on them.
A human may be able to add “Sch2.12 + Sch3.45”, but a computer needs to see “2.12+3.45” schillings.
Good data is “disaggregated”
Data in publications are often presented as aggregated tables or charts. But open data should be published at the smallest possible aggregation level that doesn’t compromise individual privacy.
For instance, data may be aggregated by gender, region, or economic sector, using averages rather then single observations. This is good for broad analysis, but users should be able to go as far into the original dataset as possible in order to draw their own conclusions.
Say you’re looking at a dataset about wealth distribution. It could be aggregated at the national level (GDP per capita), the provincial level (average income per province), the municipal level or at the township or suburb level. Publishers in this case should look to publish the most detailed level (township or suburb) and allow users to aggregate as they need.
Any further disaggregation – to the street, house or individual level – would be unwise to publish as it would infringe on individual privacy.
Disaggregated data possess greater re-usability than their aggregated counterparts.
Good data is accurate
Sometimes accuracy and precision can be lost as files are converted for publication. Data producers and curators should watch for these common issues.
Here are a few best practices to follow to make sure your data is accurate:
- Use the highest appropriate level of decimal point precision
- Include 4-digit years and times, if relevant
- Use UTF-8/Unicode to avoid character loss
Below you can view a table comparing imprecise and precise data points:
|DECIMAL POINT PRECISION||45||45.2396|
|TEXT ENCODING||S□o Paulo||São Paulo|
Good data is standardised
Standard formats exist for many kinds of data variables to ensure legibility, compatibility and efficient data processing. Adhering to standards is the responsibility of data producers, with guidance from data curators.
Here are some rules and standards for common types of data:
- Dates — Years should include four digits, while months and days should include two digits each.Values should be ordered from largest to smallest (following ISO 8601 principles).
- Coordinates — Latitude and longitude should be in signed degree format, not compass directions or Degree/Minute/Second.
- Numbers — Use full decimal values, not scientific notation. Measurement units should be placed in separate variables or columns, or as metadata, and NOT included with values.
- Undefined or Missing Values — Use an empty field. Values such as 0 or -1 are prone to misinterpretation.
Below is a comparison between non-standard and standard variables in datasets:
|DATES||MARCH 20, 2009, 3/20/2009||2009-03-20|
|COORDINATES||41°25’01″N, 120°58’57″W||41.41694, 120.9825|
|NUMBERS||3.50E+04, 35000 ML||35000|
|Undefined or Missing Values||0, -1|
Organisation and government standards
Many types of data have established standards that specify which types of information to include and how to represent them.
It is worthwhile to do some research to see if there is a standard for your data, and adopt it if possible.
Here are a few examples of some standards used in data published by governments and organisation:
|OCDC||Open Contracting Data Standard||Public procurement and contracting||http://standard.open-contracting.org|
|HXL||Humanitarian Exchange Language||Light-weight standard to improve information sharing in humanitarian crises||http://hxlstandard.org/|
|GTFS||General Feed Transit Specification||Common format for public transportation schedules and associated geographic information||https://developers.google.com/transit/gtfs/|
|IATI||International Aid Transparency Initiative||Publishing framework allowing comparisons of international aid data from different organizations||http://iatistandard.org/|
Good data protects confidentiality
Open data must always protect the privacy and confidentiality of any individuals described in a dataset. The privacy and confidentiality is also important for businesses. It is particularly important to protect privacy in household and business surveys.
Protecting confidentiality starts with knowing what sorts of information could potentially identify an individual or firm, either alone or in combination. These identifiers can either be direct or indirect.
People can be identified in lots of ways. Names, email addresses, identity numbers or cellphone numbers can all be used to link open data to individuals. But knowing a few seemingly anonymous details can also reveal identities: a household may be the only one on a street with a certain number of children, for example
We call these “indirect identifiers“.
Direct versus indirect identifiers
Direct identifiers are bits of information that uniquely identify specific individuals or firms. Examples include:
- Phone numbers
- Role title and employer
- Tax or health registration number
- Passport or national ID number
- Rare health conditions
Indirect identifiers delineate groups, for example based on age, race or income. While this doesn’t immediate identify individuals, a combination of characteristics of indirect identifiers in a dataset could lead to them being identified. Examples include:
- Geo-locators (eg. map data)
- Unusual education
- Unusual occupation
- Unusual physical characteristics
- Dates (eg. birth, admission, discharge)
- Salaries or revenues
The process of using indirect identifiers to reveal the identities of people included in a dataset is called “De-anonymisation“. When the city of New York released open data about taxi journeys, for example, it did so because it believed researched would be able to use this information to improve city planning.
Although passenger details were removed, however, it was possible to identify individual fares based on where they regularly stopped.
Protecting confidentiality in datasets
It is the responsibility of data providers to ensure that appropriate measures are taken to protect confidentiality. The process of data anonymisation is highly specialised, and consists of many different techniques.
Data anonymisation techniques include:
Data aggregation is one common way to protect confidentiality. For instance, when data are aggregated into zones or regions, many individual characteristics are replaced with averages.
Even then, small areas or extreme outliers can make it possible to identify individuals.