Do you need to be a data scientist to work in data journalism? What is the difference between data analysis and data science? Beatrice Schofield, Head of Data Intelligence at import.io debunks the data science myth.
What does your job as Head of Data Intelligence at import.io consist in?
On a day to day basis I think about new things we can do with data and how to engage new areas whereby people who are not technically trained can start using open data for their fields of research. I also work on news cases by approaching NGOs and data journalists with ideas for stories with data sets. A lot of it is content-driven. It is exploring open data, how to better use it, extract it from sites and build data sets – much of it has traditionally been the realm of people who can programme. I make sure we get the data and give it to people who would be interested to use it but have previously been unable to because they are lack this skill and are not data scientists.
Do you approach journalists or media organisations?
It depends. If there is something big coming up like the budget, I quickly build an extractor which for instance allows us to get the data off the BBC on a minute-by-minute update, which within an hour we give to the Guardian. They then inform their sentiment analysis whereby they could read what was happening. We often take a pro-active approach. We are responsive: when Nelson Mandela died and the Guardian wanted data quickly, we could respond by predetermining what data might be interesting at that time and providing it to journalists.
Who has import.io cooperated with thus far?
We provided data for the Financial Times, the Guardian, the New York Times. The big data story that has recently made the news is Oxfam’s analysis which shows that five richest families in the UK have more money than the poorest 20%. We worked with Oxfam to get this data before it became a media sensation. We’re after pre-determining things like this as well.
Do you hack to get data?
I am not technically trained so I do all my scraping via our tool. On an analytical level, I rely solely on this to get large amount of data and give it to people in whatever form, so there is no need for me to concentrate any attention on developing skills which aren’t necessary with the tool we’ve got.
What kind of skill set do you use in your work?
I have been doing data analysis since university in different roles. But Excel is where it begins and ends. A lot of it is qualitative and quantitative research because much of my work is content-driven. And on a day to day basis I am very much operating as any other data analysts without the need to delve into the realms of data science. It’s pretty much beyond me.
What would you recommend that a trainee data journalist learn in terms of software and skill?
From my perspective it is important to have written something before and being on the sharp edge of data analysis. Data journalism is now a fundamental part of journalism and you can’t be a journalist without being data-savvy. In terms of developing the right skill set, I don’t think it is necessary to be a good programmer. I think you can focus on other areas. Tools are now here, like import.io. to access the data, Tableau to visualise it and all that is left is analysis and seeing where the stories are. This is what data journalism is about. Being quite academic, realising where the holes in the data are, seeing how the bias is created by certain data sets. Because there is a tendency for people to see data as fact and not as a socially constructed set of numbers or letters. It is important to be very critical with what we are being presented with and looking at what is missing as opposed to just what is there.
I certainly think that with data journalism moving forward, you have to have the ability to engage wholly with the amount of data that there is on the web, and have the ability to look into it and see what you can do. Because at the moment we are still – for various reasons – only looking at a tiny section of what’s available. It is key to think imaginatively and creatively about how we can build data sets over time and to focus your skills qualitatively and quantitatively as opposed to focusing all our attention on being a good programmer when it’s no longer the time to be it. There are now tools that allow you to have data sets and spend time focusing on stories.
Is statistical knowledge key, then?
Mostly for journalist’s own time management. No one wants to spend a lot of time in untidy spreadsheets, cleaning data sets and thinking: “This is a bore”. To be able to do the analysis, you can spot trends and patterns and have insights early on but in terms of advanced statistical knowledge, I don’t think it’s necessary. I don’t have it myself. Data science is pretty much a fashion statement now.
You mentioned before a line that should be drawn between a data scientist and a data analyst. Where does it lie?
Where I believe the split lies in the technical skill set. Data scientists traditionally write a lot of script, are able to do mining on huge data sets using scripts. While I see a data analyst as being able to perform the same analysis as a data scientist without having the programming skills and science degrees under their belts. But the two come from the same realm.
Do you think newsrooms will start employing data scientists?
I don’t think they can afford them. A data analyst could easily perform the same job by using freely available tools as opposed to using their own technical know-how. In terms of mining large data sets, it can be a collaborative work of scientists and analysts, but not in terms of assistance to data journalism, which is spotting what you what to see in the stories as opposed to delivering a very methodical, technical approach. I think we are now developing tools that might almost push data scientist to the side.
What would be a prerequisite for becoming a data analyst?
You need to be quantitatively trained in some sense. It doesn’t need to be a degree. For instance, social sciences usually require a quantitive approach. Personally, I have learnt a lot about data analysis while being on the job. You can’t really set aside a certain skill set. Obviously there are certain skills like Excel that are needed to advance but beyond that, analysis can be done at a very qualitative level as well. And then you back it up with figures.
You have told me about your 6-month long project of monitoring alcohol prices on Tesco website. What happens when such a time-consuming undertaking does not yield results you expected?
That’s the nature of it. What you presume might happen might not always happen and your assumptions might be wrong. But with tools like import.io you can run a couple of projects at the same time, so it’s not as if you’re banking on one data set to provide you with the story that you want.
How do you go about generating an initial idea for a project?
I approach my work with an inquisitive approach. I wonder “what could you find out from that?”. Sometimes I don’t start with a pre-determined outcome, but just with creating databases over time and at some point a story is bound to come out of one of them. It’s just all about being imaginative.
And I am having a lot of fun with it. I know data analysis is considered a bit of a dull area but then if you draw the content out of it, you can make it fun. We have been looking at Dulux colours and names of paints because they are absolutely ridiculous and we made a game that pulls the names apart, for example “pomegranate champagne”. Previously we made a game which made people guess which newspaper said which headline. You just need to be creative with it.
I think the Guardian did well. They were the first ones to really push it to the front and say they are very much a data-driven newspaper. But it can be anyone who has the ability to see something unique in data, to bring different insight, different experience and apply it in the data set. This is what I believe sets people apart: the ability to communicate well through visualisation and good analysis and seeing possibilities in data.
Data journalism sits on the split between sciences and humanities – it relies on both to be able to be performed well. It does not require heaviness in the scientific field. It requires intuitive questioning and thinking about external factors that come from humanities.
Hint: If you want to learn about data and visualisation, check out my list of best tutorials here.