Standards and anonymisation: best practice when publishing data

Topic Progress:

Click on the numbered tabs to navigate this lesson.

Why standards are important

The goal of publishing data is to make it as useful as possible for the end user, without compromising the identity of any of the subjects included within the data set. Many open data portals around the world, such as the Kenya Open Data Initiative, include advanced tools for visualisations – such as maps – so that users can generate graphics and charts without having to download anything at all. Before you think about including these kinds of tools, however, it’s important to get the basics of data publishing right.

These include making sure comprehensive metadata is included, and making sure data is clean and tidy before it is published.

In this lesson, you will learn:

  • What it means to publish complete, tidy data
  • How to compile and publish metadata
  • Why data should be disaggregated as far as possible
  • Why data should be anonymised

Good data is complete

There are several kinds of gaps that a dataset can contain. Prior to publishing your dataset, check to make sure it includes all the intended records and variables.

Most datasets contain at least some gaps and incomplete data should not be a deterrent to publication. However, data producers should try to address data gaps as much as possible.

Blank or missing records in an Excel document may suggest that separate tables aren’t aligning correctly, or that database keys aren’t accurate. Perhaps the data processing system should be reviewed.

Individual blank cells in an Excel document often result from missing or unavailable data, which is to be expected to some degree. Too many missing cells may indicate a production issue.

Blank or missing columns in an Excel document may suggest that variables aren’t being recognised or calculated correctly. Perhaps the data processing system should be reviewed.

Clean and tidy data

When working with data, you should adhere to the principles of “tidy data”. Your objective when working with raw data is to tidy it into a structured dataset which can be used for analysis. The principles of tidy data are:

  • Each variable you measure should be in one column.
  • Each different observation of that variable should be in a different row.
  • There should be one table for each “kind” of variable.
  • If you have multiple tables, they should include a column in the table that allows them to be linked.

In practice, this might mean a table in which one column – for example – is the name of a product and the next is the price. Each row tells you the name and price of a product. The names should all be written with the same spelling conventions and capitalisations, and the prices should all be decimal numbers with no spaces or other characters.

Importantly, in this example, the currency symbol should be included in the column heading, not in the column values, so that you can add the values together or perform other calculations on them.

A human may be able to add “Sch2.12 + Sch3.45”, but a computer needs to see “2.12+3.45” schillings.

Good data is “disaggregated”

Climate change predictions using a variety of different models. Here the data is disaggregated, so you can see every model rather than one “average change” line. From CGIAR Research Program on Climate Change, Agriculture and Food Security.

Data in publications are often presented as aggregated tables or charts. But open data should be published at the smallest possible aggregation level that doesn’t compromise individual privacy.

For instance, data may be aggregated by gender, region, or economic sector, using averages rather then single observations. This is good for broad analysis, but users should be able to go as far into the original dataset as possible in order to draw their own conclusions.

Say you’re looking at a dataset about wealth distribution. It could be aggregated at the national level (GDP per capita), the provincial level (average income per province), the municipal level or at the township or suburb level. Publishers in this case should look to publish the most detailed level (township or suburb) and allow users to aggregate as they need.

Any further disaggregation – to the street, house or individual level – would be unwise to publish as it would infringe on individual privacy.

Disaggregated data possess greater re-usability than their aggregated counterparts.

Good data is accurate

Sometimes accuracy and precision can be lost as files are converted for publication. Data producers and curators should watch for these common issues.

Here are a few best practices to follow to make sure your data is accurate:

  • Use the highest appropriate level of decimal point precision
  • Include 4-digit years and times, if relevant
  • Use UTF-8/Unicode to avoid character loss

Below you can view a table comparing imprecise and precise data points:

TIMESTAMPS 94-03-12 1994-03-12
TEXT ENCODING S□o Paulo São Paulo

Good data is standardised

Standard formats exist for many kinds of data variables to ensure legibility, compatibility and efficient data processing. Adhering to standards is the responsibility of data producers, with guidance from data curators.

Here are some rules and standards for common types of data:

  • Dates — Years should include four digits, while months and days should include two digits each.Values should be ordered from largest to smallest (following ISO 8601 principles).
  • CoordinatesLatitude and longitude should be in signed degree format, not compass directions or Degree/Minute/Second.
  • Numbers — Use full decimal values, not scientific notation. Measurement units should be placed in separate variables or columns, or as metadata, and NOT included with values.
  • Undefined or Missing ValuesUse an empty field. Values such as 0 or -1 are prone to misinterpretation.

Below is a comparison between non-standard and standard variables in datasets:

DATES MARCH 20, 2009, 3/20/2009 2009-03-20
COORDINATES 41°25’01″N, 120°58’57″W 41.41694, 120.9825
NUMBERS 3.50E+04, 35000 ML 35000
Undefined or Missing Values 0, -1

Organisation and government standards

Many types of data have established standards that specify which types of information to include and how to represent them.

It is worthwhile to do some research to see if there is a standard for your data, and adopt it if possible.

Here are a few examples of some standards used in data published by governments and organisation:

Acronym Standard name Description Website
OCDC Open Contracting Data Standard Public procurement and contracting
HXL Humanitarian Exchange Language Light-weight standard to improve information sharing in humanitarian crises
GTFS General Feed Transit Specification Common format for public transportation schedules and associated geographic information
IATI International Aid Transparency Initiative Publishing framework allowing comparisons of international aid data from different organizations

Good data protects confidentiality

Open data must always protect the privacy and confidentiality of any individuals described in a dataset. The privacy and confidentiality is also important for businesses. It is particularly important to protect privacy in household and business surveys.

Protecting confidentiality starts with knowing what sorts of information could potentially identify an individual or firm, either alone or in combination. These identifiers can either be direct or indirect.

People can be identified in lots of ways. Names, email addresses, identity numbers or cellphone numbers can all be used to link open data to individuals. But knowing a few seemingly anonymous details can also reveal identities: a household may be the only one on a street with a certain number of children, for example

We call these “indirect identifiers“.

Direct versus indirect identifiers

Direct identifiers are bits of information that uniquely identify specific individuals or firms. Examples include:

  • Names
  • Addresses
  • Phone numbers
  • Role title and employer
  • Tax or health registration number
  • Passport or national ID number
  • Rare health conditions

Indirect identifiers delineate groups, for example based on age, race or income. While this doesn’t immediate identify individuals, a combination of characteristics of indirect identifiers in a dataset could lead to them being identified. Examples include:

  • Geo-locators (eg. map data)
  • Unusual education
  • Unusual occupation
  • Unusual physical characteristics
  • Dates (eg. birth, admission, discharge)
  • Salaries or revenues

The process of using indirect identifiers to reveal the identities of people included in a dataset is called “De-anonymisation“. When the city of New York released open data about taxi journeys, for example, it did so because it believed researched would be able to use this information to improve city planning.

Although passenger details were removed, however, it was possible to identify individual fares based on where they regularly stopped.

Protecting confidentiality in datasets

It is the responsibility of data providers to ensure that appropriate measures are taken to protect confidentiality. The process of data anonymisation is highly specialised, and consists of many different techniques.

Data anonymisation techniques include:

  • Aggregation
  • Suppression (ie. simply removing the data)
  • Top-coding
  • Generalisation
  • Sampling

Data aggregation is one common way to protect confidentiality. For instance, when data are aggregated into zones or regions, many individual characteristics are replaced with averages.

Even then, small areas or extreme outliers can make it possible to identify individuals.