Topic

Cleaning, filtering and sorting data

Topic Progress:

Click on the numbered tabs to navigate this lesson.

This lesson has been adapted from materials prepared for the Sudan Evidence Base Programme, created by Eva Constantaras with support from the World Bank and the UK Department for International Development (DFID).

Spreadsheets are the single most useful tool you have for working with data. Not only are they useful for reading and analysing information on your desktop, they can be used to produce simple charts and graphics as either images or embedded content in a story with interactive elements.

For the purposes of this lesson, we’re going to be using the Google Sheets application. It’s free to use so long as you have a Google (ie Gmail) account, and runs inside a web browser. Any changes you make are automatically saved on Google’s servers and can be accessed from any networked PC or mobile device.

Other spreadsheet software with similar features that can be used offline include Microsoft Excel (paid for) and Libre Office Calc (free).

This lesson follows on from Stories in numbers: Sums and rates  and Calculating percentages. It is highly recommended that you read through those lessons first.

In this lesson, you will learn:

  • How to find relevant data from the World Bank website
  • How to import this data into a spreadsheet
  • How to carry out basic data cleaning
  • How to filter data
  • How to sort data

Getting data

In this example, you will conduct a regional comparison to investigate trends in under-five mortality using data from the World Bank.

You will first find and download the appropriate data to analyse, and then look for pattern on change in under-five mortality rate over time for Sudan and its neighbours. You will also look at individual countries and find out by how much have they reduced this rate over the decade.

To start with, let’s find the data that we want. Open up http://databank.worldbank.org in your browser, and type Health Nutrition and Population in the Explore databases search field.

The result should look like this:

Select the first option in the search results.

Selecting the data

Your table should look like this at the end of this slide.

In the screen that opens, look at the search area on the left. Use it to select the following countries:

  • Burundi
  • The Central African Republic
  • Chad
  • Congo, Dem. Rep.
  • Djibouti
  • Eritrea
  • Ethiopia
  • Kenya
  • Rwanda
  • South Sudan
  • Sudan
  • Tanzania
  • Uganda

Now, under Series, type mortality and press Enter. Now select the following indicators.

  • Maternal mortality ratio (modelled estimate, per 100,000 live births)
  • Mortality rate, infant (per 1,000 live births)
  • Mortality rate, neonatal (per 1,000 live births)
  • Mortality rate, under-5 (per 1,000)

Finally, under Year, select all years from 2006 to 2016.

Click Apply Changes.

It’s important to note that World Bank datasets are not set in stone. Numbers can be – and often are – revised on an annual basis. Always cite your source materials in data stories, and be aware that the source may change, which may impact your claims!

Importing data to a spreadsheet

Now you have the data in the World Bank viewer, you can look at it in different ways.

The buttons at the top of the page allow you to view it as a table or a variety of charts, and you can navigate between the different health indicators using the pull down menu which currently says Maternal mortality ratio (modeled estimate, per 100,000 live births).

We want to examine this data in a spreadsheet application, though, so click on the Download button and select CSV format.

Tip: CSV stands for Comma Separated Values. It’s a plaintext file in which datasets are represented as cells split up by commas rather than a grid. It’s hard to read with the naked eye, but is compatible with every spreadsheet software available.

What data have you downloaded?

In the previous step you chose to download a CSV file, but what you got was a compressed ZIP file called Data_Extract_From_Health_Nutrition_and_Population_Statistics.zip. If you open this file, you’ll see that it contains two separate CSV files. Both have a long series of characters followed by Definition and Source.csv and Data.csv.

For now, extract both these files to you desktop, and rename them to remove the long character string.

Now open up a tab in your browser for Google Sheets, and create a new Sheet using the plus button in the bottom right-hand corner. In this new Sheet, go to File>Import and select Upload. Point the file browser menu that opens at the Definition and Source.csv file we just extracted.

Now select the option to Replace spreadsheet. You should see a small table appear with some metadata for the statistics you were looking at on the World Bank pages. These tell you where the data for each indicator came from, which you’ll need later on when you are crediting your sources.

You’ll see that most of the data was originally calculated by www.childmortality.org. Click on the heading Untitled spreadsheet to rename this Sheet CMR Metadata.

The World Bank data portal packages metadata information in a separate spreadsheet.

Basic data cleaning (1)

Repeat the steps in the last slide for the CSV file data.csv. You should have a Sheet that looks something like this.

What do you notice about the columns with numbers in (E-N)?

Basic data cleaning (2)

Did you work out the answer to the last question? Some of the numbers in those columns are aligned to the left of the cell, and some of them aligned to the right. This is because when Sheets imported the data, it took at guess at the contents of each cell. Where there are only numbers present, the cell is formatted for numbers. But because of the decimal point, Sheets has been cautious and thinks these cells contain plain text.

The distinction is important, because the way in which cell contents are marked up will change any calculations you try to carry out.

Try this and see. In cell E55 type the formula =sum(E2:E53). The number 9341 should appear – but this is not the total of all the numbers in that column.

Now click on the column label (E) to highlight the whole column, and then click on the Format menu at the top of the screen. From the menu that appears, select Number>Number (you can see Automatic was ticked).

Basic data cleaning

It’s important to remember that this number is meaningless as we’re adding together different indicators. It’s just to show you how spreadsheets work!

Basic data cleaning (3)

As you can see from the previous slide, data in a spreadsheet has to be formatted correctly or you run the risk of drawing the wrong conclusions in your analysis. Before we work with data, it’s vitally important to double check that columns and rows are correct. In the Format menu, you’ll find lots of options for text, numbers, currencies and percentages.

In the third submenu, More formats, you can create custom formats which include the number of decimal places shown in a cell as well.

You can delete the contents of cell E55.

Now click on the header for column E to select all contents in that column. Hold down the shift key on your keyboard and click on the header for column O. Sheets should have selected all columns from E to O.

Using the steps outlined previously, change all the cells you’ve just selected to number format.

Filtering data (1)

Cleaning data can take a lot longer than that, and we’ll be covering other techniques in more advanced courses.

One thing that you should have noticed, however, is that all the data for all the indicators is included in the same table. This isn’t useful if we want to look compare specific indicators across countries.

If we want to look at the Mortality rate, under -5 for, example, we can filter out the other indicators.

At the top of the Sheets window, in the toolbar, you’ll see an icon that looks like an upside down triangle with a line below it. Like this:

This is the Filter button. If you click it, you should notice that the borders of your table turn green, and small arrows appear next to each column header. Like this.

Filtering data (2)

Using the filter function to hide rows.

Now click on the arrow that has appeared next to the title Series Name in column A. The menu that appears will give you some options for displaying or hiding data in that column. In the window at the bottom of the menu, you’ll see a list of all the values in the column.

Deselect all the values except for Mortality rate, under -5 (per 1,000).

Click OK, and you should be left with a table that only shows rows for that dataset. Note that the row numbers haven’t changed – the data that you can’t see is hidden, not deleted.

Note the change in row numbers.

Sorting data

Computers are very good at finding information and patterns from unsorted data, but humans are not quite so adept. For example, it’s not too hard from this table to see which country had the most problems with under-5s mortality in 2015, but imagine if there were a hundred countries in our spreadsheet. Or 190.

How would you read 190 rows of data and quickly find the best and worst performers in a column?

To do this, we need to sort the table into order, listing the countries by their “position” in the relevant column.

Say you want to arrange these countries by their under-5 mortality rankings from best to worst. Staying in the Filter view (see previous slide) you can click the arrow next to the column title 2015 [YR2015]. Now click Sort A->Z.

You’ll see the table rearranges itself completely, with the rows (countries) that have the lowest under-5 mortality rate at the top of the chart.

Exploring alternatives

As with most software, Sheets provides several different ways to perform the same task. If you haven’t filtered your data, you can sort columns simply by right-clicking on the letter at the top of the column, and selecting the option to Sort Sheet A->Z (or Z->A, if you prefer higher numbers at the top).

There are lots of resources online for mastering spreadsheets and their basic functions, including a Google Learning Centre dedicated to Sheets.

Don’t be afraid to ask for help, either. There’s always someone who knows more than you and will give you a hand understanding how to manipulate data, either in your newsroom, social media networks or via the Code for Africa Slack channel or your local Hacks/Hackers Africa Google Group.

For now, use your Sheet to answer these questions.

Remember that analysing data is only a part of the overall story.

What questions does the data you’ve seen here raise? Where might you go next to start building your story?