Topic

Evaluating data and metadata

Topic Progress:

Click on the numbered tabs to navigate this lesson.

Is this the right dataset?

This lesson will explain the features you should look for when evaluating data to work out whether or not it is right for your purpose.

As discussed in the previous lesson, if you wish to republish data the first thing to look for is that it should be legally open, meaning that you can use it with few restrictions and at no cost.

Republishing data that is subject to copyright may result in a fine or even prison sentence. Journalists may be able to use a public interest defence for publication of restricted materials (such as evidence of corruption in protected data that was acquired as part of a leak), but in many applications or stories, this may not be applicable.

In this lesson, you will learn:

  • The importance of legally and technically open data
  • How to read metadata
  • How to check that data is timely and accurate

Technically open data

As well as being correctly licensed and legally open, open data should also be technically open, meaning that you can access, download and use it easily. A well-designed open data catalogue should meet these criteria already.

Data should also be relevant to your purpose. Topical relevance is obviously important, but so are geography and time. For instance, you may also be interested in data for a particular country or data that is aggregated at a national or local level. You may also be interested in a particular time period. If you are interested in the present state of affairs, you want the most recent data you can find.

Data should also be complete, with as few gaps as possible. For data that is time-sensitive, or which changes frequently, you should check when it was last updated.

Finally, you want data that is authoritative, meaning it is published by recognised and respected individuals or organisations.

Finding metadata

Metadata in the description of the Global Mines Action Database

Metadata should be published as part of the dataset and is available in the overview section of the data catalogue.

It may also appear as a separate sheet or at the header of a CSV file, along with definitions of the variables included in a dataset.

Interpreting metadata

How do you know if data is relevant, timely and authoritative? Look for metadata.

Metadata is the set of data that describes the data and is crucial for determining which datasets will be most useful.

Here are some of the most common metadata fields you are likely to encounter.

Title

A name for the data. The title tells you something about what the dataset contains and where it comes from.

Description

A description or abstract with sufficient detail so that you can quickly understand whether the data is relevant to you.

Licence

Datasets on an open data catalogue often share the same licence. In some cases, the catalogue will include a mix of licences.

Publisher

The publisher or organisation field tells you where the dataset originated and who is responsible for maintaining it. For some data catalogues, the publisher is the same for all datasets. The publisher is used in citations and helps establish the data’s credibility.

Contact information

The name and email address of the dataset’s publisher. Contact information is important if you have questions or other metadata is incomplete.

Checking whether data is timely

There are a few metadata categories you should consider when deciding whether the data you are looking at is up-to-date and relevant for your purpose.

Data that hasn’t been updated in years or data that covers a period too far in the past will most likely not be relevant to you, unless you are using the data look at a historical period or to measure a trend over a large period of time.

Here is some of the metadata you should check…

Frequency

The interval at which the dataset is updated.  Frequency tells you when you should check for updates. If the modification date exceeds the update frequency, the dataset may be out of date.

Modification date

When the dataset was last updated. Depending on your work, older datasets may not be as relevant. If a dataset is regularly updated it may mean that any work based on it is liable to change in the future: indicators published in The World Bank’s data catalogue, for example, are often updated for previous years based on new evidence.

Temporal coverage/range

The range of time for which data was collected in this dataset.  This is often different than when the data was last updated.

How to use metadata: An example

Let’s take a research example to learn how to use metadata to evaluate datasets.

Say we want to learn about trends in maternal mortality in Uganda. You could start with a search engine query, as described in Lesson 1. But for this tutorial, let’s visit the World Bank’s open data website, which provides data for many economic and human development issues.

We’ll start by searching for “maternal mortality Uganda”. The first thing you notice is that there are two data indicators available that look very similar. One is described as a “modelled estimate” and the other as a “national estimate”. You will need to look at each one individually.

Maternal mortality ratio, modelled estimate, Uganda

The first search result shows a chart of maternal mortality in Guatemala, described as a “modelled estimate”.

You also can see that the data is licenced as “open data”, using a CC BY-4.0 licence and that it originates from multiple sources (WHO, UNICEF, UNFPA, World Bank Group, United Nations Population Division).

You can also look at individual data points on the chart by moving the mouse.

This data looks pretty good. There are continuous annual data points showing a steady decline in maternal mortality from just over 687 deaths per 100 000 live births in 1990 to 343 deaths per 100 000 in 2015.

Selecting the “Details” button gives more extensive metadata, including information on methodology, limitations and other factors.

Modelled estimate versus the national estimate

From this information, you learn that the data is estimated from a regression model, using data such as GDP, overall female death rates, fertility and other variables. A regression model is a statistical tool for building data estimates using multiple independent variables.

Now let’s go back and look at the other option. The second search result also shows a chart of maternal mortality in Uganda, this time described as a “national estimate.” This data is also licensed as “open data,” but originates from only one source – UNICEF.

This time, there are only four data points between 1995 and 2012. This data also shows a decline in maternal mortality. But with only a few data points, the decline is more difficult to see.

How does “national estimate” differ from “modelled estimate?”

Again, the Details button gives you information on this data’s methodology. You learn that this data is obtained from household surveys, which are often only available for certain years. This explains why less data is available and why the numbers vary from the “modelled estimate.”

Which data is preferable depends on how you intend to use it.

In this case, the metadata indicate that both data are credible, although they both have important limitations. If you want the mortality rate in a given year, you could use either data series that contains that year. If you want to compare mortality between different years or countries, the “modelled estimate” would probably work better since it has fewer data gaps.

Source: World Bank Open Data

Test your knowledge

Evaluating data and metadata