Click on the numbered tabs to navigate this lesson.
It’s always important to understand the limitations of any data set that you are working with. As we saw in the previous lesson, how data is gathered, sampled and issues such as inconsistencies or missing values greatly affect our ability to use that data to make a claim.
In this lesson, we will explore some of the techniques we can use to test claims made using data, and be wary of making the same mistakes ourselves.
Correlation and causation
One of the most critical lessons we need to learn is that correlation does not equal causation. Often, when we are looking at data, we want to understand how two variables interact with each other.
For example, we might want to look at the relationship between crime in different areas around the country and average income rates, to see if there is a pattern in the volume or types of crime depending on whether it is a rich or poor area. What we are looking for is potential cause and effect, such as the well-documented drop in family size that occurs in regions where more women finish schooling.
What we can say with confidence in both cases is that there is a relationship which correlates – lower-income areas tend to have higher crime rates, and the higher a woman’s educational level the fewer children she is likely to have.
Despite this correlation, we cannot jump to a conclusion about which variable is causing the other to change. Causation is often more complex, as this article on female fertility shows.
Chocolate doesn’t make you smart
In 2014, the New England Medical Journal published this chart showing the correlation between countries that consume a lot of chocolate and countries which won lots of Nobel Prizes. Many misunderstood this to suggest that eating chocolate makes you clever, and news organisations around the world ran headlines that reinforced this belief.
But chocolate consumption also correlates strongly with wealth per capita, and wealth per capita correlates strongly with educational achievement. So which variable is actually the main cause of creating Nobel Prize winners?
Want to see some other spurious correlations? Check out this site.
The gold standard of science
To really attribute causation, scientists perform randomised controlled trials (RCT) to verify or refute a particular hypothesis.
An RCT involves splitting a whole population into groups based on random assignment, and then ensuring that one group – the control – continues as normal while a single variable that we want to observe is changed in the other.
For example, if we took the population of Switzerland and split it in two, then allowed one half to consume chocolate as normal while we starved the other half of cocoa-derived food, we could wait and see which group won the most Nobel Prizes over time. If the chocolate eating group proved more successful, we could conclude that there is a causal effect between chocolate and cleverness.
Randomised controlled trials are the gold standard of science and the only way to be sure of causal effects. Before new medications are licensed, drug companies must test them in an RCT giving one group of patients the medicine and the control group no additional treatments. If the group on the new drug report better health outcomes, it’s likely that the drug works.
When publishing data that is based on sampling, data producers should ideally include a confidence measure to help you decide how far you should trust the data in your own work.
A confidence measure, or confidence interval, is a statistical calculation which takes into account the total range of possible values for an estimate, and based on the distribution of those values, the likelihood that a measure falls within a smaller range of those values.
Sometimes this is expressed as a percentage – and 95% confidence is a good minimum for using the data in your own work.
Other times, this may be expressed as a range of values. In the chart above, produced by the UN to show its prediction of global population growth from 2020 to 2100, you can see multiple lines. Most commentators focus on the solid red line which shows that population growth will slow, but continue upward to about 11 billion people by the end of the century.
But look carefully: the median line is created by modelling many different scenarios around health, education and economic well being and their effect on family size over the next 80 years, and then taking an average of all those scenarios.
There are two lines on the chart which show the lower and upper limits within which the UN is confident its model is correct. By the end of the century, these enclose the total range between 9.5 billion and around 13.5 billion. The further away from the present day you travel, the wider the lines become – this indicates how difficult it is to predict even basic statistics far into the future.
The two blue lines show other models – specifically what happens if average family sizes rise or fall by just 0.5 children per woman over the course of 80 years in a way that is not predicted. The high and low estimates are between 7 billion and 16 billion!
That doesn’t mean we have to ignore the UN’s data – but we have to be cognisant of what it actually can and can’t tell us before incorporating it into a story, and not mislead readers with dramatic figures that may not mean what they think they mean.
Looking for inconsistencies
It’s easy to make mistakes when publishing data, and no-one gets it right all the time, and we should always be on the lookout for errors that may have been made.
We can use the same tools that we use for cleaning and analysis to do this. For example, the table above shows the number of confirmed Covid-19 cases in Ghana at the provincial level. A national total is included at the bottom. To check for inconsistencies, we might create a new row with a total we calculate using the =SUM formula, just to check that the numbers have been added up correctly.
When we see a large discrepancydiscrepency in the data we should question whether or not it is really there. In this article, Dataphyte looks at national statistics relating to narcotics-related arrests in Nigeria, and questions whether or not it’s really possible that there was an increase from 10 000 offences to 621 000 in a single year.
The sanity check
[image – population in Sub-Saharan Africa, source: World Bank]
When questioning the reliability of data, one of the most powerful tools we have is our own knowledge and experience. We can continually ask whether or not the numbers look or sound believable. If they seem outlandish, but we can verify them, we have a good story on our hands. All too often, however, it’s just too good to be true.
Take this example: in 2018, the South African commissioner of police claimed that there are 11 million undocumented migrants living in the country. This claim was used by the rightwing political partypart Freedom Front Plus in its election manifesto the next year.
But a simple sanity check should lead us to be highly sceptical of this number. After all, the official number of people living in South Africa is 57 million, is it feasible that every one in six or seven people living there is an undocumented migrant?
Likewise, we know from other sources that most migrants to South Africa come from neighbouring Zimbabwe – itself a country of only 14 million people. Officially there are around 650, 000 Zimbabweans living in South Africa. If that number were higher by a factor of ten, it might well be noticed.
Use fact-checking sites
Undocumented migrants create a problem for data collectors since they are, by their nature, not easily counted. It’s still possible to use other data to estimate their numbers.
AfricaCheck put the South African police commissioner’s claims to the test and found that most estimates for the total number of documented and undocumented migrants in the country are between 1.9 million and just over 3 million people – a huge difference from the 11 million claim.
Fact-checkingFact checking sites can be an important source of information for testing data reliability.