Topic

What is data cleaning?

Topic Progress:

Click on the numbered tabs to navigate this lesson.

An introduction to data cleaning

After data has been acquired, it often must be cleaned and validated prior to analysis, especially if it has been scraped. Even data from good open sources will usually need some attention.

Just a quick glance will tell you that this data is not “tidy”: some entries are blank or formatted differently to others in the same column.

“Cleaning” data is the process of finding and correcting glitches and other imperfections in your data. The goal in cleaning data is to make it as consistent as possible, and remove unnecessary details that might complicate your analysis later. Let’s look at some examples.

In this lesson, you will learn:

  • What it means to have clean and tidy data
  • Standard formats for numbers, dates and letters

Clean and tidy data

In this example from the Kenya Open Data Initiative, data is “tidy”. Each entry in a column has data, in the same format as the rest of the column.

When working with data, you should adhere to the principles of “tidy data”. Your objective when working with raw data is to tidy it into a structured dataset which can be used for analysis. The principles of tidy data are:

  • Each variable you measure should be in one column.
  • Each different observation of that variable should be in a different row.
  • There should be one table for each “kind” of variable.
  • If you have multiple tables, they should include a column in the table that allows them to be linked.

In practice, this might mean a table in which one column – for example – is the name of a product and the next is the price. Each row tells you the name and price of a product. The names should all be written with the same spelling conventions and capitalisations, and the prices should all be decimal numbers with no spaces or other characters.

Importantly, in this example, the currency symbol should be included in the column heading, not in the column values, so that you can add the values together or perform other calculations on them.

A human may be able to add “Sch2.12 + Sch3.45”, but a computer needs to see “2.12+3.45” schillings.

Letters or numbers?

The most basic step in data cleaning is to make sure that data is in the right format for your software to analyse. A spreadsheet application may recognise a column of dates or Kwacha amounts as times or currency if all are written in the same format, but often individual entries may be written in a variety of ways.

For example, 10th April is the same as 10/4 or 4/10/18 to human eyes, but a computer may need a specific format (dd/mm/yyyy) to be able to understand it.

INCORRECT NUMERIC FORMATS CORRECT NUMERIC FORMATS
6.8973e4 68973
54 43 621.25 5443621.25
25.10  C 25.1
50.17 km 50.17
12 miles 19.31

Numeric data sometimes includes extra characters or strange formats that can make it difficult to read or analyse.

In scientific notation, for example, 6.8973e4 is the same as 68973. Scientific notation should be converted to decimal form.

Spaces and other non-numeric characters should be removed, except the decimal character. Units of measurement, such as inches, kilometers, or degrees Celsius, should also be removed or moved to a separate column.

A spreadsheet won’t recognise 25.1 degrees C as a unit of temperature: you’ll need to convert this to 25.1 in a column marked Degrees Celsius to begin analysing it.

If your data includes mixed measurement systems, such as miles and kilometres, you may need to convert the values to the same unit of measurement.

Cleaning date fields

Dates and times should follow a uniform format. Although several date formats are in use, the most common is ISO 8601, in which date and time elements are ordered from largest to smallest. Years should always be written as four-digit numbers and all other elements written as two-digit numbers. Date elements are usually separated with hyphens, while time elements, if included for precision, are separated by colons, as shown here.

INCORRECT DATE/TIME FORMATS CORRECT DATE/TIME FORMATS
9/16/12 2012-09-16
5:22pm 17:22
3/22/2017, 10am 2017-03-22 10:00
Sep 3, 2017 4:15PM CAT 2017-09-03 16:15+02:00

If times are in a mix of different time zones, you may need to include the time zone offset in hours and minutes from Greenwich Mean Time/Coordinated Universal Time (UTC).

ISO 8601 Standard Date Format

5:22PM on March 25th, 2017 is written as 2017-03-25 17:22

5:22PM on March 25th, 2017 is written as 2017-03-25 17:22+08:00

Using a standard date and time format ensures that dates can be interpreted correctly.

Cleaning text values

Text fields can contain ordinary misspellings that can lead to inaccuracies, calculation errors, or sorting mistakes. Inconsistencies in capitalisation, spacing and hyphenation, such as with the names of cities or countries, can also be problematic.

PROBLEMATIC CLEANED
beijin Beijing
Mount vernon avenue Mt Vernon Ave
Johns on Johnson
SÃ¥o Tomé São Tomé
□□□□□□□ Россия

Pay careful attention to text that could contain international characters. Characters from different alphabets such as Cyrillic or Arabic sometimes result in illegible characters. Whenever possible, try to adhere to text encoding standards such as Unicode, or utf-8. Otherwise, you may need to make corrections manually, which can be very tedious.

Test your knowledge

An introduction to data cleaning