Do you need to be a data scientist to work in data-driven journalism? Interview with import.io’s Beatrice Schofield

Beatrice Schofield interview head of data intelligence import io

Picture: import.io

Do you need to be a data scientist to work in data journalism? What is the difference between data analysis and data science? Beatrice Schofield, Head of Data Intelligence at import.io debunks the data science myth.

What does your job as Head of Data Intelligence at import.io consist in?

On a day to day basis I think about new things we can do with data and how to engage new areas whereby people who are not technically trained can start using open data for their fields of research. I also work on news cases by approaching NGOs and data journalists with ideas for stories with data sets. A lot of it is content-driven. It is exploring open data, how to better use it, extract it from sites and build data sets – much of it has traditionally been the realm of people who can programme. I make sure we get the data and give it to people who would be interested to use it but have previously been unable to because they are lack this skill and are not data scientists.

Do you approach journalists or media organisations?

It depends. If there is something big coming up like the budget, I quickly build an extractor which for instance allows us to get the data off the BBC on a minute-by-minute update, which within an hour we give to the Guardian. They then inform their sentiment analysis whereby they could read what was happening. We often take a pro-active approach. We are responsive: when Nelson Mandela died and the Guardian wanted data quickly, we could respond by predetermining what data might be interesting at that time and providing it to journalists.

Who has import.io cooperated with thus far?

We provided data for the Financial Times, the Guardian, the New York Times. The big data story that has recently made the news is Oxfam’s analysis which shows that five richest families in the UK have more money than the poorest 20%. We worked with Oxfam to get this data before it became a media sensation. We’re after pre-determining things like this as well.

Do you hack to get data?

I am not technically trained so I do all my scraping via our tool. On an analytical level, I rely solely on this to get large amount of data and give it to people in whatever form, so there is no need for me to concentrate any attention on developing skills which aren’t necessary with the tool we’ve got.

What kind of skill set do you use in your work?

I have been doing data analysis since university in different roles. But Excel is where it begins and ends. A lot of it is qualitative and quantitative research because much of my work is content-driven. And on a day to day basis I am very much operating as any other data analysts without the need to delve into the realms of data science. It’s pretty much beyond me.

What would you recommend that a trainee data journalist learn in terms of software and skill?

From my perspective it is important to have written something before and being on the sharp edge of data analysis. Data journalism is now a fundamental part of journalism and you can’t be a journalist without being data-savvy. In terms of developing the right skill set, I don’t think it is necessary to be a good programmer. I think you can focus on other areas. Tools are now here, like import.io. to access the data, Tableau to visualise it and all that is left is analysis and seeing where the stories are. This is what data journalism is about. Being quite academic, realising where the holes in the data are, seeing how the bias is created by certain data sets. Because there is a tendency for people to see data as fact and not as a socially constructed set of numbers or letters. It is important to be very critical with what we are being presented with and looking at what is missing as opposed to just what is there.

I certainly think that with data journalism moving forward, you have to have the ability to engage wholly with the amount of data that there is on the web, and have the ability to look into it and see what you can do. Because at the moment we are still – for various reasons – only looking at a tiny section of what’s available. It is key to think imaginatively and creatively about how we can build data sets over time and to focus your skills qualitatively and quantitatively as opposed to focusing all our attention on being a good programmer when it’s no longer the time to be it. There are now tools that allow you to have data sets and spend time focusing on stories.

Is statistical knowledge key, then?

Mostly for journalist’s own time management. No one wants to spend a lot of time in untidy spreadsheets, cleaning data sets and thinking: “This is a bore”. To be able to do the analysis, you can spot trends and patterns and have insights early on but in terms of advanced statistical knowledge, I don’t think it’s necessary. I don’t have it myself. Data science is pretty much a fashion statement now.

You mentioned before a line that should be drawn between a data scientist and a data analyst. Where does it lie?

Where I believe the split lies in the technical skill set. Data scientists traditionally write a lot of script, are able to do mining on huge data sets using scripts. While I see a data analyst as being able to perform the same analysis as a data scientist without having the programming skills and science degrees under their belts. But the two come from the same realm.

Do you think newsrooms will start employing data scientists?

I don’t think they can afford them. A data analyst could easily perform the same job by using freely available tools as opposed to using their own technical know-how. In terms of mining large data sets, it can be a collaborative work of scientists and analysts, but not in terms of assistance to data journalism, which is spotting what you what to see in the stories as opposed to delivering a very methodical, technical approach. I think we are now developing tools that might almost push data scientist to the side.

What would be a prerequisite for becoming a data analyst?

You need to be quantitatively trained in some sense. It doesn’t need to be a degree. For instance, social sciences usually require a quantitive approach. Personally, I have learnt a lot about data analysis while being on the job. You can’t really set aside a certain skill set. Obviously there are certain skills like Excel that are needed to advance but beyond that, analysis can be done at a very qualitative level as well. And then you back it up with figures.

You have told me about your 6-month long project of monitoring alcohol prices on Tesco website. What happens when such a time-consuming undertaking does not yield results you expected?

That’s the nature of it. What you presume might happen might not always happen and your assumptions might be wrong. But with tools like import.io you can run a couple of projects at the same time, so it’s not as if you’re banking on one data set to provide you with the story that you want.

How do you go about generating an initial idea for a project?

I approach my work with an inquisitive approach. I wonder “what could you find out from that?”. Sometimes I don’t start with a pre-determined outcome, but just with creating databases over time and at some point a story is bound to come out of one of them. It’s just all about being imaginative.

And I am having a lot of fun with it. I know data analysis is considered a bit of a dull area but then if you draw the content out of it, you can make it fun. We have been looking at Dulux colours and names of paints because they are absolutely ridiculous and we made a game that pulls the names apart, for example “pomegranate champagne”. Previously we made a game which made people guess which newspaper said which headline. You just need to be creative with it.

I think the Guardian did well. They were the first ones to really push it to the front and say they are very much a data-driven newspaper. But it can be anyone who has the ability to see something unique in data, to bring different insight, different experience and apply it in the data set. This is what I believe sets people apart: the ability to communicate well through visualisation and good analysis and seeing possibilities in data.

Data journalism sits on the split between sciences and humanities – it relies on both to be able to be performed well. It does not require heaviness in the scientific field. It requires intuitive questioning and thinking about external factors that come from humanities.

 

Hint: If you want to learn about data and visualisation, check out my list of best tutorials here.

Best tutorials for data journalists

I compiled a round-up of video tutorials and webinars which I found most useful during the last couple of months of my training to become a data journalist.

Data scraping

A series of webinars by Alex Gimson from import.io on:

  1. Auto table extraction
  2. Building a data crawler
  3. Getting data behind passwords
  4. Datasets

And good news – there will be more! Watch this space: http://blog.import.io/

Data visualisation

A series of webinars by Jewel Loree from Tableau on:

  1. Basic Tableau Proficiency 
  2. Actions, Filters and Parameters in Tableau Public
  3. Data Formatting, joins, blends, and table calculations

Two more to come, stay tuned on Tableau Software YouTube channel.

Mapping

A tutotial by Andrew Hill on using CartoDB for mapping:

Online mapping for beginners

Two Google Fusion Tables tutorials which will teach you how to make:

  1. a point map
  2. a polygon map

and here come two webinars you can still take part in:

Obviously, the list is not exhaustive and you would need to do some more reading around the content of the tutorials. Blogs run by the people behind the software should be very helpful in getting more insight into the particular problems you might encounter on the way.

How to scrape data without coding? A step by step tutorial on import.io

import.ioImport.io (pronounced import-eye-oh) lets you scrape data from any website into a searchable database. It is perfect for gathering, aggregating and analysing data from websites without the need for coding skills. As Sally Hadadi told Journalism.co.uk: The idea is to “democratise” data. “We want journalists to get the best information possible to encourage and enhance unique, powerful pieces of work and generally make their research much easier.” Different uses for journalists, supplemented by case studies, can be found here.

After downloading and opening import.io browser, copy the URL of the page you want to scrape into the import.io browser. I decided to scrape the search result website of orphanages in London:

001 Orphanages in London

After opening the website, press the tiny pink button in top right corner of the browser and follow up with “Let’s get cracking!” in the bottom right menu which has just appeared.

Then, choose the type of scraping you want to perform. In my case, it’s a Crawler (we’ll be getting data from multiple similar pages on the same site):crawler

And confirm the URL of the website you want to scrape by clicking “I’m there”.

As advised, choose “Detect optimal settings” and confirm the following:data

In the menu “Rows per page” select the format in which data appears on the website, whether it is “single” or “multiple”. I’m opting for the multiple as my URL is a listing of multiple search results:multiple

Now, the time has come to “train your rows” i.e. mark which part of the website you are interested in scraping. Hover over an entire “entry” or “paragraph”:hover over entry

…and he entry will be highlighted in pink or blue. Press “Train rows”.train rows

Repeat the operation with the next entry/paragraph so that the scraper gets the hang of the pattern of your selections. Two examples should suffice. Scroll down to the bottom of your website to make sure that all entries until the last one are selected (=highlighted in pink or blue alternately).

If it is, press “I’ve got all 50 rows” (the number depends on how many rows you have selected).

Now it’s time to focus on particular chunks of data you would like to extract. My entries consist of a name of the orphanage, address, phone number and a short description so I will extract all those to separate columns. Let’s start by adding a column “name”:add column

Next, highlight the name of the first orphanage in the list and press “Train”.highlighttrain

Your table should automatically fill in with names of all orphanages in the list:table name

If it didn’t, try tweaking your selection a bit. Then add another column “address” and extract address of the orphanage by highlighting the two lines of address and “training” the rows.

Repeat the operation for a “phone number” and “description”. Your table should end up looking like this:table final

*Before passing on to the next column it is worth to check if all rows have filled up. If not, highlighting and training of individual elements might be necessary.

Once you’ve grabbed all that you need, click “I’ve got what I need”. The menu will now ask you if you want to scrape more pages. In this case, the search yielded two pages of search results so I will add another page. In order to this this, go back to your website in your regular browser, choose page 2 (or any next one) of your search results and copy the URL. Paste it into the import.io browser and confirm by clicking “I’m there”:i'm there

The scraper should automatically fill in your table for page 2. Click “I’ve got all 45 rows” and “I’ve got what I needed”.

You need to add at least 5 pages, which is a bit frustrating with a smaller data set like this one. The way around it is to add page 2 a couple of times and delete the unnecessary rows in the final table.

Once the cheating is done, click “I’m done training!” and “Upload to import.io”.upload

Give the name to your Crawler, e.g. “Orphanages in London” and wait for import.io to upload your data. Then, run crawler:run crawler

Make sure that the page depth is 10 and that click “Go”. If you’re scraping a huge dataset with several pages of search results, you can copy your URLs to Excel, highlight them and drag down with a black cross (bottom right of the cell) to obtain a comprehensive list. Paste it into the “Where to start?” window and press “Go”.go

crawlingAfter the crawling is complete, you can download you data in EXCEL, HTML, JSON or CSV.dataset

As a result, we obtain a data set which can be easily turned into a map of orphanages in London.

Do you have any further tips for import.io extraction? Do you know any other good scrapers? Share your thoughts in the comments below.

Hint: If you need to structure and clean your data, here’s how to do it.

In the meantime, look out for another post in which I will explain the next step: how to visualise the data you have.