Click on the numbered tabs to navigate this lesson.
When we work with data, we have to understand what is and isn’t shown in the data.
In data analysis, we usually start with a question or hypothesis to guide us – and it’s easy to jump to answers if the data starts to agree with what we expected to see.
It’s easy to fall into this trap of confirmation bias. If we expect to see a trend in the data, we may not question our findings closely enough. In a famous study conducted at Stanford University in 1979, two groups of students were chosen based on their attitude towards the effectiveness of capital punishment – whether they were for or against the death penalty. The two groups were shown the same research on the subject, and both strongly agreed that the evidence backed up their opinion.
We see this conscious and unconscious cherry-picking of data to support existing preconceptions all the time: whether it’s climate change or social policy, two people can see different things in the same data.
Confirmation bias is just one of the many challenges when it comes to interpreting data objectively. Others include selection bias where one selects the sample out of convenience, which meets the time or effort constraints but does not reflect the true population. This leads to false extrapolation with results not being applicable for the entire population.
It is essential that we take these biases into account when considering claims made with data.
How data is gathered can change it
The methods used in data gathering can also strongly affect the results. For example, if a questionnaire is conducted face-to-face with someone, it may stand a greater chance of succumbing to social desirability bias which involves respondents answering the questions to make them more acceptable and liked.
This is mostly true for the questions asking personal or sensitive topics where most of us want to project ourselves in the most positive light.
Question order bias can also change responses, by changing how respondents think about a question based on previous examples.
Question-wording bias is also important to watch out for when interpreting survey results.
If an audience is asked “Do you agree that welfare payments help people get back on their feet?” then the response may seem like an overwhelming endorsement of extending welfare payments. If the same audience is asked to agree or disagree with the statement “welfare pays people who don’t work” then the interpretation of their answer may be very different (Statistics how-to)..
To really understand claims made, we need to go back to the original data and metadata and see if it stands up to scrutiny.
Gaps in the data
One critical challenge when supporting or refuting claims is to be aware of gaps in the data. For example, reporting on crimes of gender-based violence is notoriously difficult and numbers can be highly unreliable. A dataset may show the number of crimes reported to the police over a period, but say nothing about the level of underreporting – many sexual offences don’t get reported, for a number of reasons.
As described in this AfricaCheck article, a decline in the number of rapes reported may not indicate a fall in the actual number of crimes committed: conversely, a rise in crime numbers may indicate women feel safer coming forward and reporting an attacker.
The case of the armoured aeroplane
Gaps in data may give us important clues to guide our analysis. A final type of bias is survivor bias – what assumptions are we making based on missing members of our dataset who didn’t pass some sort of selection process?
US armed forces faced a dilemma during the war because returning bomber planes were riddled with bullet holes and they needed better ways to protect them.
The army knew they needed armour to protect their planes but the question was, “Where should they put it?”
When they plotted out the damage these planes were incurring, it was spread out, but largely concentrated around the tail, body and wings. Engineers were then tasked with armouring these areas that seemed to be most vulnerable to damage.
Abraham Wald, a statistician made a glaring observation—the military would make a terrible mistake by upgrading the armour along these sections of the plane. Why? Because the military was only looking at the damage on returned planes. They hadn’t factored in damage on planes that didn’t return.
Planes that didn’t return were the ones that sustained damage in ways not seen on returned planes— their engines. Unlike the body, tail, and wings, the engine was extremely vulnerable. Once hit there, planes went down, and they didn’t make it back home to have their damage charted out.
Filling in the gaps
[Nigeria Data Portal]
Often, we have to combine datasets in order to fill in gaps in analysis. For example, we might have a chart like this one which shows the GDP of Nigeria by state, but what does this really tell us about the wealth of individual Nigerians.
We can conclude that the economy of Lagos State is larger than the economy of Delta State but not much more than that.
To compare these statistics, we really need to add another dataset to our table, which includes the population of each province. From that, we can calculate the GDP per person in each state and see where, on average, Nigerians are better off.
To do this we combine the table in the previous example? with this one in a single spreadsheet file. We can merge them to get the image above using the =VLOOKUP function in column C to match state data in the two tables.
Now we can create another column, dividing GDP by population to get a per capita value. Sorting this data tells us that on average, incomes are almost 50% higher in Delta State than they are in Lagos.