Exploring data reliability

Topic Progress:

Click on the numbered tabs to navigate this lesson.

Over 100 years ago, the American writer and essayist Mark Twain popularised a phrase which is as relevant today as it was in the nineteenth century. 

“There are lies, damned lies, and statistics”

The public is often led astray by numbers used to make claims that they do not fully support. Claims made using statistics may be deliberately manipulated, to suggest that they show something other than what has been recorded. They may suffer from omission if certain groups weren’t included in a population because of difficulties with collection, for example. 

Then there are statistics which are quoted by public figures but are simply not true and have no basis in data, but are picked up by others and taken as truth.

Understanding metadata

In previous lessons, we have learned the importance of checking the metadata as part of our data processing pipeline. The best data will include a full description of what each indicator is showing and the methodology used to arrive at the numbers shown. In the example above, we can see the metadata for the World Bank’s dataset that shows the number of traffic deaths per 100 000 people in Nigeria.

What we can quickly tell is the numbers reported are estimates rather than raw figures. Why is this?

Census versus sampling

It’s very rare that we are able to collect or find data that is a 100% representation of the population we are investigating, and it’s important to understand the difference between census and sampled data.

Census data includes the entire population in its results. In a village of 100 people, we might be able to ask everyone in the village their employment status, and then calculate the local unemployment rate. This is census data – the same information has been collected from everyone covered.

At the national scale, however, it’s simply not feasible to survey several millions of people at regular intervals to state the employment rate with 100% accuracy, so national statistics agency use sampled data. They look at the employment status of a small number of people, and then extrapolate national figures from there. 

Mathematical modelling is used to determine numbers for a whole population from a small sample, and it’s important to be aware that two models will result in different results. In our road traffic data, for example, there is mention of a weighted average

Weighted models apply different calculations to different parts of the sample. For example, we might sample 50 people in each of town A and B and ask them about their employment status. But if there are twice as many people of working age in town A, we would use a different formula to calculate the estimated total number of employed people in the area.


Missing data

We have to be careful even when working with data taken from a national census. Most countries hold a census every ten years or so, in which every household is asked the same set of questions in order that statistics agencies can get a snapshot of the state of the nation to help steer policies.

But is everyone included in the census? Households with a large number of undocumented migrants might avoid replying to census requests for fear of drawing attention to themselves. If the census is carried out door-to-door, who counts the households in which everyone is at work when the survey team visits? How well covered are rural and hard-to-reach areas?

Even census data should come with metadata that includes the methodology used to fill in these gaps.

Random and non-random samples

The mathematics that underlie statistical sampling and reporting give us a very good idea of how many people or observations we have to make in order to produce a sample that is representative of the whole and which can be used to create estimates.

This is an example of an online calculator designed to help researchers decide this sample size.

The gold standard for populations is that samples should be taken at random. If we were to question a sample of 2,000 people about access to amenities in Lagos, but 1 500 of those quizzed live in Makoko because that’s where surveyors are based, we will get very different results compared to a similar sample taken from Lekki or Victoria Islands, two of the most expensive neighborhoods in Lagos.

This would be a non-random sample, and can lead us to misleading conclusions about the overall population.

A random sample could only be created by taking a list which contains the names of everyone who lives in Lagos, and selecting 1,500 at random, regardless of where they live. This would give us more confidence that our sample is representative.

As Professor Alberto Cairo explains in his book “How charts lie — Getting Smarter about Visual Information” (W. W. Norton & Company, 2019), “If random sampling is rigorously conducted, it’s likely that the average you compute based on it will be close to the average of the population the sample was drawn from.”

What else can the metadata show us?

Metadata can also answer other important questions that we need to ask our dataset before we can accept it as reliable. Who published the data, for example, and why? 

Has the data publisher released this data as part of a political campaign, or because it creates a particular impression that they want the public to perceive, for example? Is there a critical set of observations that have been left out. 

As journalists, we are taught that every story should have multiple sources, and that we should use our own investigative skills to verify the truth of what a source tells us. When working with data, it’s important that we should treat data the same way we would any other source.

And the golden rule is that if we aren’t absolutely confident in our ability to verify the reliability of data, we can always ask an expert who will be. As you work with data, it’s useful to cultivate professionals and academics who you can call on to help with this aspect of your work.

Missing values

[Source: Nepal Data Literacy Project]

One important question when assessing data reliability is the treatment of missing values. Consider the table above which shows the results of recording the IQs of a group of people. What does a missing value mean? Has an IQ test not been completed, or its result not recorded? If we exclude the people who haven’t recorded an IQ score, how does that affect our sample population – would it still be possible to calculate an average IQ, for example. 

What about in the next table, where results have clearly been omitted from a particular age group?

In many datasets, there may be confusion between missing values and zero. In our employment survey discussed, what does it mean if there is a missing value in a household’s entry. Does it mean no-one in that household has a job, in which case it should be zero, or that everyone was out working when the survey took place. 

How we treat missing values could give us a very different interpretation of the total number of jobless in the population.