Click on the numbered tabs to navigate this lesson.
Is this the right dataset?
This lesson will explain the features you should look for when evaluating data to work out whether or not it is right for your purpose.
As discussed in the previous lesson, if you wish to republish data the first thing to look for is that it should be legally open, meaning that you can use it with few restrictions and at no cost.
Republishing data that is subject to copyright may result in a fine or even prison sentence. Journalists may be able to use a public interest defence for publication of restricted materials (such as evidence of corruption in protected data that was acquired as part of a leak), but in many applications or stories, this may not be applicable.
In this lesson, you will learn:
- The importance of legally and technically open data
- How to read metadata
- How to check that data is timely and accurate
Technically open data
As well as being correctly licensed and legally open, open data should also be technically open, meaning that you can access, download and use it easily. A well-designed open data catalogue should meet these criteria already.
Data should also be relevant to your purpose. Topical relevance is obviously important, but so are geography and time. For instance, you may also be interested in data for a particular country or data that is aggregated at a national or local level. You may also be interested in a particular time period. If you are interested in the present state of affairs, you want the most recent data you can find.
Data should also be complete, with as few gaps as possible. For data that is time-sensitive, or which changes frequently, you should check when it was last updated.
Finally, you want data that is authoritative, meaning it is published by recognised and respected individuals or organisations.
Metadata should be published as part of the dataset and is available in the overview section of the data catalogue.
It may also appear as a separate sheet or at the header of a CSV file, along with definitions of the variables included in a dataset.
Metadata is the set of data that describes the data and is crucial for determining which datasets will be most useful.
Here are some of the most common metadata fields you are likely to encounter.
A name for the data. The title tells you something about what the dataset contains and where it comes from.
A description or abstract with sufficient detail so that you can quickly understand whether the data is relevant to you.
Datasets on an open data catalogue often share the same licence. In some cases, the catalogue will include a mix of licences.
The publisher or organisation field tells you where the dataset originated and who is responsible for maintaining it. For some data catalogues, the publisher is the same for all datasets. The publisher is used in citations and helps establish the data’s credibility.
The name and email address of the dataset’s publisher. Contact information is important if you have questions or other metadata is incomplete.
Checking whether data is timely
There are a few metadata categories you should consider when deciding whether the data you are looking at is up-to-date and relevant for your purpose.
Data that hasn’t been updated in years or data that covers a period too far in the past will most likely not be relevant to you, unless you are using the data look at a historical period or to measure a trend over a large period of time.
Here is some of the metadata you should check…
The interval at which the dataset is updated. Frequency tells you when you should check for updates. If the modification date exceeds the update frequency, the dataset may be out of date.
When the dataset was last updated. Depending on your work, older datasets may not be as relevant. If a dataset is regularly updated it may mean that any work based on it is liable to change in the future: indicators published in The World Bank’s data catalogue, for example, are often updated for previous years based on new evidence.
The range of time for which data was collected in this dataset. This is often different than when the data was last updated.
How to use metadata: An example
Let’s take a research example to learn how to use metadata to evaluate datasets.
Say we want to learn about trends in maternal mortality in Uganda. You could start with a search engine query, as described in Lesson 1. But for this tutorial, let’s visit the World Bank’s open data website, which provides data for many economic and human development issues.
We’ll start by searching for “maternal mortality Uganda”. The first thing you notice is that there are two data indicators available that look very similar. One is described as a “modelled estimate” and the other as a “national estimate”. You will need to look at each one individually.
The first search result shows a chart of maternal mortality in Guatemala, described as a “modelled estimate”.
You also can see that the data is licenced as “open data”, using a CC BY-4.0 licence and that it originates from multiple sources (WHO, UNICEF, UNFPA, World Bank Group, United Nations Population Division).
You can also look at individual data points on the chart by moving the mouse.
This data looks pretty good. There are continuous annual data points showing a steady decline in maternal mortality from just over 687 deaths per 100 000 live births in 1990 to 343 deaths per 100 000 in 2015.
Selecting the “Details” button gives more extensive metadata, including information on methodology, limitations and other factors.
Modelled estimate versus the national estimate
From this information, you learn that the data is estimated from a regression model, using data such as GDP, overall female death rates, fertility and other variables. A regression model is a statistical tool for building data estimates using multiple independent variables.
Now let’s go back and look at the other option. The second search result also shows a chart of maternal mortality in Uganda, this time described as a “national estimate.” This data is also licensed as “open data,” but originates from only one source – UNICEF.
This time, there are only four data points between 1995 and 2012. This data also shows a decline in maternal mortality. But with only a few data points, the decline is more difficult to see.
How does “national estimate” differ from “modelled estimate?”
Again, the Details button gives you information on this data’s methodology. You learn that this data is obtained from household surveys, which are often only available for certain years. This explains why less data is available and why the numbers vary from the “modelled estimate.”
Which data is preferable depends on how you intend to use it.
In this case, the metadata indicate that both data are credible, although they both have important limitations. If you want the mortality rate in a given year, you could use either data series that contains that year. If you want to compare mortality between different years or countries, the “modelled estimate” would probably work better since it has fewer data gaps.
Source: World Bank Open Data
Test your knowledge
Evaluating data and metadata
0 of 5 questions completed
You have already completed the quiz before. Hence you can not start it again.
Quiz is loading...
You must sign in or sign up to start the quiz.
You must first complete the following:
0 of 5 questions answered correctly
Time has elapsed
You have reached 0 of 0 point(s), (0)
Earned Point(s): 0 of 0, (0)
0 Essay(s) Pending (Possible Point(s): 0)
Question 1 of 5
Which of the following are typical metadata fields? (Mark all that apply)CorrectIncorrect
Question 2 of 5
Which of the following describes metadata?CorrectIncorrect
Question 3 of 5
Which of the following metadata fields would tell you the most recent year included in a dataset?CorrectIncorrect
Question 4 of 5
One factor that indicates a dataset’s credibility is…CorrectIncorrect
Question 5 of 5
Which of the following definitions best describes the metadata category “Frequency”?CorrectIncorrect