How to identify data to include in visualizations

Topic Progress:

Click on the numbered tabs to navigate this lesson.

When we move from analysing data to visualising it, we need to extract the data sample that will illustrate our insights from the larger dataset. We will then need to prepare this subsample ready for importing into our visualisation application.

What are we looking for is the subset of data that provides clarity – often this will be based on a data summary and used to illustrate a comparison.

The visualisation above, for example, compares African countries by GDP in a clear manner (source). But think of all the elements and statistical tables that are used to calculate GDP for just one of those countries, and all the data that lies behind this simple chart. How much has been left out?

Could we create a single visualisation that included information for all 52 countries in Africa, that included net revenues for each nation’s agricultural, tourism, and manufacturing industries? Not if we wanted our readers to follow along.

Consider these statements

Both of these statements say the same thing and are based on the same data. Which do you think is going to grab the reader’s attention and help them understand the insight we are trying to share?

Statement 1

“Our brains have the ability to process visuals a lot faster than text. It’s been reported that 70 percent of all our sensory receptors are in our eyes and that we can usually get the sense of a visual scene in less than 1/10 of a second. That’s a lot faster than how long it typically takes us to read and comprehend text-only information.”

Statement 2

“According to the University of Minnesota, visual data is processed by human brains 60,000 times faster than written data.” 


Compared with what?

As the statistician Edward Tufte says “The fundamental task in data analysis is to make smart comparisons – we’re always trying to answer the question “compared with what?”… It always comes down to making and showing smart comparisons.”

In the chart above, readers can get an immediate impression not just of where coffee beans come from (ie. the position on the map) but the comparative output of each of the top eight producing countries. The size of the dots shows just how dominant the top three are before you even read the numbers.

A good data narrative is layered

When journalists are taught how to produce a story for news production, they are taught the principle of the inverted pyramid of news. The most important details – the “who, when, what, how, and why” – are included as succinctly as possible in the first paragraph. 

The middle section of the story provides the critical context and evidence for those details,  while the last part of the story offers extra information that isn’t critical to understanding but is useful for readers to know if they want to explore the subject further.

A data narrative is similarly layered. We pick out the visualisations that tell the most important parts of the story first, we offer context and critical extra information through annotations, and finally, we link back to the full data set so that readers can explore it for more information by themselves.

Selecting data

Consider the chart above, which shows the value of exports (in dollars) of precious metals from Ghana over a three year period. The data is sourced from the CEPII

There are 11 lines on this visualisation, some of them do not have complete data for the period due to changes in the way data was collected. Why doesn’t this chart work as an effective visualisation?

For starters, the value of gold exports vastly outweighs all others. They are clustered into such a small group of lines at the bottom that we learn nothing about their relative value.

We also need to change the way we are measuring data on the y-axis. Wouldn’t it be easiest to read if it was in billions of dollars rather than single greenbacks? (The 2016 value on the far left is $9.41bn, for example. Can you tell that’s what 1.00E+10 means?)

A different viewpoint

Here is the data visualisation created by the OEC. Instead of trying to show the differing value of exports over time, it has selected a single year’s worth of data, and aggregated export types by colour in a treemap. There is much more data in this chart than the one on the previous slide, but it is much easier to read.

Viewers can change the year they are looking at using a pulldown menu. There are thousands of data pointsdatapoints to explore, and if readers want to see how certain commodity exports change over time they can download the original data for themselves.

Managing variables

A visualisation can generally carry up to five variables: a y-value, an x-value, a size value (on a scatterplot or treemap), a colour variable and – you are animating over frames – a change in time as well. 

Remember, the more more variables you include in a visualisation, the harder you are going to have to work to keep it navigable. Remember to check our lessons on narrative to learn more about breaking complex data down into understandable chunks.

See this link for an animated version of the image above, and how the New York Times visualised the initial stages of the spread of coronavirus out of China.