Topic

Google Sheets: Scraping data from the internet

Topic Progress:

Click on the numbered tabs to navigate this lesson.

Introduction

By now you should be familiar with Google Sheets as a tool for managing, cleaning and manipulating data for analysis using basic commands and formulae. Most often to import data into Google we either by copy and pasting from another file or by upload a CSV or XLS file.

Sheets can also import data directly from an online source or by “scraping” website. Scraping simply means copying data from a live website into a tool for analysis.

Set it up correctly, and when the data source updates, so will your spreadsheet.

To get started, open us this webpage in a new browser window: https://en.wikipedia.org/wiki/Trafficking_in_Persons_Report

Finding tables in a webpage

You can see that this Wikipedia page, which provides information on the US State Department’s Trafficking In Persons Report (TIP), indicating each nation’s ranking. The ranks are one to three, with three being the highest likely prevalence of trafficking. A “w” indicates that the country is tier 2, but in danger of slipping to tier 3. 

It’s a useful aggregation of data published over a ten year period – and we might want to use it for our own reporting.

We can import the table directly into a spreadsheet for manipulation. In order to do so, we can open up a new spreadsheet in Google Sheets and put our cursor in Cell A1.

Importing tables

To scrape data from a webpage, we use the IMPORTHTML formula. This formula is simple to use – we need to reference a webpage URL, tell it what type of data we are looking for, and the position of that data on the page. 

In this case, our formula in A1 will be:

=IMPORTHTML(“https://en.wikipedia.org/wiki/Trafficking_in_Persons_Report”, “Table”,1)

The one at the end indicates that this is the first table on the webpage.

 

Importing CSV files

We can also import CSV files in a similar manner. You will often find CSV files for useful data that are regularly updated in repositories such as Github. For example, this is a repository of datasets in CSV format, which contain information about human trafficking cases in the US, compiled using FBI reports for a project by Sean F Larson.

The direct link to the CSV file can be obtained by clicking the RAW button in Github. Copy this link into your clipboard.

IMPORTDATA formula

We use a slightly different formula to import a CSV file, IMPORTDATA. In this case, the formula is:

=IMPORTDATA(“https://github.com/DataScienceSean/human_trafficking-/raw/master/Human_Trafficking_Offenses_and_Clearances_by_State_2014_to_2017.csv“)

Working with imported data

Note that importing data in this manner should only be done on an empty table in Sheets. If you enter data into any of the cells that would get written over, the import will not work. 

For the same reason, you cannot perform functions such as Sorting on an imported table.

You can copy the imported data in to a new tab by pressing CTRL + A to select the whole table, then creating a new tab and pressing CTRL + SHIFT + V (this is Paste as values in the menu). This will remove the import formula and leave you with just the data to  work with as normal. Remember, though, that if you create a visualisation based on this new tab, it will not update when the source webpage does.

When not to use Sheets

Working with data imported or scraped in this way can be incredibly powerful. We can use formula to subset it or shape it in other tabs, in order to create visualisations and charts which will automatically update when the source data does. This can be very powerful for building live dashboards showing the latest statistics.

However, it does have limitations. Because scraping or importing into Sheets does not create a copy of the original data, it means that if the original page is removed or edited so that your target table is deleted, you will no longer be able to access the information and any visualisations you have created will no longer work.

Let’s review