Russian speakers in Ukraine – the media’s take

As tensions escalated in Ukraine, more and more media organisations took to data visualisation to convey the actual state of affairs in the country. One of the arguments in the dispute over Crimea was the large number of people inhabiting the region for whom Russian is the native language. Here’s how different media organisation pictured it:

The CNN:

Russian speakers in Ukraine

The New York Times:

Russian speakers in Ukraine

The Guardian:

Russian speakers in Ukraine

Non-media outlet map (the only interactive one, with a pop-out info-window):

Russian speakers in Ukraine(This map serves only as a reference point for my comparison.)

Legends and Scales

Clearly, the maps differ significantly between themselves, mainly because of the different number of colour buckets and colour ramps they use. The Guardian opted for the simplest division into “Predominantly Russian-/Ukrainian-speaking” which maybe gives a clear picture of the most general trends, but it does not allow a more nuanced insights into the situation. Especially when compared to the interactive map, the Guardian’s one seems over-simplified. Upon exploring the detailed information in the pop-out window, it’s easy to spot that the territory corresponding to the the Guardian’s blue field does not have more than 10% of Russian speakers per region (with the exception of Sumska Oblast in the north). The distribution is therefore relatively even across the regions and legitimises using just one colour for this part of Ukraine. However, in the territory corresponding to the Guardian’s yellow field, only three regions have as much as 68-77% of Russian speakers, three have a +/- 50% share of them and further three a bracket of only 25-33%. This clearly shows that the Guardian’s take is correct but not fully accurately reflecting the linguistic divisions in the country.

The CNN’s map is already a bit better in this respect but the choice of scale seems puzzling, with the bright red colour encompassing a rather broad bracket of 25-74% of Russian speakers. Adding another bucket (to make it 25-50% and 51-74%) would probably make the picture clearer and at the same time add the predominance factor to the map.

The NYT map on the other hand does not picture the whole of Ukraine, only its easternmost regions, where the tensions escalated in an especially violent way. What’s more the map is not explicitly showcasing the language division – it rather gives it as context to where the clashes took place. The scale it uses is definitely the most accurate one of all, but definitely not most readable (esp. against a greyish background, which sometimes intensifies the shades of blue). But all in all, it is enough to give the reader the general picture of the situation and therefore fulfils its function of background information.

Colours

In terms of colour ramps used, my personal preference is with the NYT, because the choice of sequential single colour scale is most suited to ordered data that progress from low to high.The same applies to the CNN map, which may even be considered better because the problem of colour opacity is not at issue.

The Guardian’s choice of Ukraine’s national colours is a nice pick, too. The choice of divergent scheme (two colours) emphasises the extreme ends of the scale (in this case the only two categories: Ukrainian/Russian) but – as mentioned above it fails to convey a more detailed picture.

Data

One more important aspect to touch on is also the source of data used for the visualisation. Unfortunately, only the CNN provided it for their map:

Publication Data source
CNN  2001 Ukraine Census
 NYT not given
 the Guardian  based on Washington Post map
 StoryMap/Esri  not given

However, it can be suspected that all of the media outlets based their stories on the 2001 census, since it is the only official data source available at this point of time. The problem with the data is that it’s 13 years old, and the current situation might be a far cry from what it was in the beginning of the 2000s. The question is: can therefore the language division be an argument in the case of the conflict in Ukraine?

Conclusion

It is difficult to assess which map did best. As mentioned above, each of them has some shortcomings and it should therefore be the purpose with which they have been created that decides about their usefulness for the reader. I think the most accurate one would be the StoryMaps/Esri one, but I do not particularly like the scale it adopted (accurate as it is). I think a clear milestones (e.g. <10%, 25%, 50%, 75% and >90%) would do a better job.

Hint: If you need a hand in choosing colours for your maps, check out Colorbrewer – it’s a nice little tool to solve all your shades and hues problems.

Hint 2: If you want to learn more about mapping, check out tutorials I listed here.

 

Interview with Kiln’s Duncan Clark

 

Duncan Clark kiln.it

Photo: kiln.it

Kiln is a design studio specialising in data visualisation, digital storytelling, maps and animation. It was founded and is run by Duncan Clark and Robin Houston, creators behind such projects as Women’s Rights or In flight for the Guardian. In this short interview Duncan Clark talks about how they go about their projects.

How do you choose what subjects to cover in your visualisations?

It’s a mix. Sometimes we have an idea that we know we want to pursue; sometimes the Guardian or another client will approach us with an idea.

What is key for you in the process of designing information?

One golden rule is to let the information speak for itself. There’s no point making a pretty visualisation if it doesn’t make the data clearer to understand and easier to interrogate.

What is your favourite project that kiln.it worked on so far and why? What do you think makes it interesting for people to explore?

In flight” is certainly the most ambitious thing we’ve done so far and possibly my favourite. I like that fact that almost everyone says “wow” at seeing the sheer number of planes that have flown through the air in the last 24 hours. But I also think it’s interesting as an experiment in combining different approaches to storytelling: it takes elements from documentary making, data visualisation, radio production, live mapping and tries to combine them into a coherent whole.in flight kiln.it multimedia

What’s your work process? How much leeway do you have in your work? Do you get precise instructions for your projects or do you only accept broadly defined commissions?

It varies. Sometimes the starting point of a commission is just a broad subject area; at other times a client might have a very specific visualisation technique in mind from the outset. Most commonly, though, we’re given a dataset and asked to work out how best to turn it into something compelling.

What advice would you give to a budding data journalist?

It depends what kind of data journalist you want to be. If you’re mainly interested in breaking stories, then getting acquainted with how to get unexplored data via Freedom of Information requests might be a good idea. If you’re more interested in interactives and visualisations then learning to code can’t hurt: access to good developers is always a bottleneck for journalists, so being able to do at least some of the coding yourself is a huge advantage. Try getting started with a free HTML, CSS and JavaScript course at Codecademy.kilnit logo

Do you need to be a data scientist to work in data-driven journalism? Interview with import.io’s Beatrice Schofield

Beatrice Schofield interview head of data intelligence import io

Picture: import.io

Do you need to be a data scientist to work in data journalism? What is the difference between data analysis and data science? Beatrice Schofield, Head of Data Intelligence at import.io debunks the data science myth.

What does your job as Head of Data Intelligence at import.io consist in?

On a day to day basis I think about new things we can do with data and how to engage new areas whereby people who are not technically trained can start using open data for their fields of research. I also work on news cases by approaching NGOs and data journalists with ideas for stories with data sets. A lot of it is content-driven. It is exploring open data, how to better use it, extract it from sites and build data sets – much of it has traditionally been the realm of people who can programme. I make sure we get the data and give it to people who would be interested to use it but have previously been unable to because they are lack this skill and are not data scientists.

Do you approach journalists or media organisations?

It depends. If there is something big coming up like the budget, I quickly build an extractor which for instance allows us to get the data off the BBC on a minute-by-minute update, which within an hour we give to the Guardian. They then inform their sentiment analysis whereby they could read what was happening. We often take a pro-active approach. We are responsive: when Nelson Mandela died and the Guardian wanted data quickly, we could respond by predetermining what data might be interesting at that time and providing it to journalists.

Who has import.io cooperated with thus far?

We provided data for the Financial Times, the Guardian, the New York Times. The big data story that has recently made the news is Oxfam’s analysis which shows that five richest families in the UK have more money than the poorest 20%. We worked with Oxfam to get this data before it became a media sensation. We’re after pre-determining things like this as well.

Do you hack to get data?

I am not technically trained so I do all my scraping via our tool. On an analytical level, I rely solely on this to get large amount of data and give it to people in whatever form, so there is no need for me to concentrate any attention on developing skills which aren’t necessary with the tool we’ve got.

What kind of skill set do you use in your work?

I have been doing data analysis since university in different roles. But Excel is where it begins and ends. A lot of it is qualitative and quantitative research because much of my work is content-driven. And on a day to day basis I am very much operating as any other data analysts without the need to delve into the realms of data science. It’s pretty much beyond me.

What would you recommend that a trainee data journalist learn in terms of software and skill?

From my perspective it is important to have written something before and being on the sharp edge of data analysis. Data journalism is now a fundamental part of journalism and you can’t be a journalist without being data-savvy. In terms of developing the right skill set, I don’t think it is necessary to be a good programmer. I think you can focus on other areas. Tools are now here, like import.io. to access the data, Tableau to visualise it and all that is left is analysis and seeing where the stories are. This is what data journalism is about. Being quite academic, realising where the holes in the data are, seeing how the bias is created by certain data sets. Because there is a tendency for people to see data as fact and not as a socially constructed set of numbers or letters. It is important to be very critical with what we are being presented with and looking at what is missing as opposed to just what is there.

I certainly think that with data journalism moving forward, you have to have the ability to engage wholly with the amount of data that there is on the web, and have the ability to look into it and see what you can do. Because at the moment we are still – for various reasons – only looking at a tiny section of what’s available. It is key to think imaginatively and creatively about how we can build data sets over time and to focus your skills qualitatively and quantitatively as opposed to focusing all our attention on being a good programmer when it’s no longer the time to be it. There are now tools that allow you to have data sets and spend time focusing on stories.

Is statistical knowledge key, then?

Mostly for journalist’s own time management. No one wants to spend a lot of time in untidy spreadsheets, cleaning data sets and thinking: “This is a bore”. To be able to do the analysis, you can spot trends and patterns and have insights early on but in terms of advanced statistical knowledge, I don’t think it’s necessary. I don’t have it myself. Data science is pretty much a fashion statement now.

You mentioned before a line that should be drawn between a data scientist and a data analyst. Where does it lie?

Where I believe the split lies in the technical skill set. Data scientists traditionally write a lot of script, are able to do mining on huge data sets using scripts. While I see a data analyst as being able to perform the same analysis as a data scientist without having the programming skills and science degrees under their belts. But the two come from the same realm.

Do you think newsrooms will start employing data scientists?

I don’t think they can afford them. A data analyst could easily perform the same job by using freely available tools as opposed to using their own technical know-how. In terms of mining large data sets, it can be a collaborative work of scientists and analysts, but not in terms of assistance to data journalism, which is spotting what you what to see in the stories as opposed to delivering a very methodical, technical approach. I think we are now developing tools that might almost push data scientist to the side.

What would be a prerequisite for becoming a data analyst?

You need to be quantitatively trained in some sense. It doesn’t need to be a degree. For instance, social sciences usually require a quantitive approach. Personally, I have learnt a lot about data analysis while being on the job. You can’t really set aside a certain skill set. Obviously there are certain skills like Excel that are needed to advance but beyond that, analysis can be done at a very qualitative level as well. And then you back it up with figures.

You have told me about your 6-month long project of monitoring alcohol prices on Tesco website. What happens when such a time-consuming undertaking does not yield results you expected?

That’s the nature of it. What you presume might happen might not always happen and your assumptions might be wrong. But with tools like import.io you can run a couple of projects at the same time, so it’s not as if you’re banking on one data set to provide you with the story that you want.

How do you go about generating an initial idea for a project?

I approach my work with an inquisitive approach. I wonder “what could you find out from that?”. Sometimes I don’t start with a pre-determined outcome, but just with creating databases over time and at some point a story is bound to come out of one of them. It’s just all about being imaginative.

And I am having a lot of fun with it. I know data analysis is considered a bit of a dull area but then if you draw the content out of it, you can make it fun. We have been looking at Dulux colours and names of paints because they are absolutely ridiculous and we made a game that pulls the names apart, for example “pomegranate champagne”. Previously we made a game which made people guess which newspaper said which headline. You just need to be creative with it.

I think the Guardian did well. They were the first ones to really push it to the front and say they are very much a data-driven newspaper. But it can be anyone who has the ability to see something unique in data, to bring different insight, different experience and apply it in the data set. This is what I believe sets people apart: the ability to communicate well through visualisation and good analysis and seeing possibilities in data.

Data journalism sits on the split between sciences and humanities – it relies on both to be able to be performed well. It does not require heaviness in the scientific field. It requires intuitive questioning and thinking about external factors that come from humanities.

 

Hint: If you want to learn about data and visualisation, check out my list of best tutorials here.

Best tutorials for data journalists

I compiled a round-up of video tutorials and webinars which I found most useful during the last couple of months of my training to become a data journalist.

Data scraping

A series of webinars by Alex Gimson from import.io on:

  1. Auto table extraction
  2. Building a data crawler
  3. Getting data behind passwords
  4. Datasets

And good news – there will be more! Watch this space: http://blog.import.io/

Data visualisation

A series of webinars by Jewel Loree from Tableau on:

  1. Basic Tableau Proficiency 
  2. Actions, Filters and Parameters in Tableau Public
  3. Data Formatting, joins, blends, and table calculations

Two more to come, stay tuned on Tableau Software YouTube channel.

Mapping

A tutotial by Andrew Hill on using CartoDB for mapping:

Online mapping for beginners

Two Google Fusion Tables tutorials which will teach you how to make:

  1. a point map
  2. a polygon map

and here come two webinars you can still take part in:

Obviously, the list is not exhaustive and you would need to do some more reading around the content of the tutorials. Blogs run by the people behind the software should be very helpful in getting more insight into the particular problems you might encounter on the way.

How to make a data visualisation with Infogr.am

organic market report 2014Infogr.am is a free online tool that helps you make quick and beautiful interactive data visualisations like the one I prepared for my online journalism blog. Its interface is intuitive and user-friendly, and majority of tools is drag-and-drop, which makes Infogr.am so easy to operate.

The first step is to choose your data and plan on what you want to present in your interactive visualisation. I opted for a data set from the Organic Market Report 2014 compiled by the Soil Association (available on demand).

Once you sign up to Infogr.am and start your creative process, you are invited to choose one of the ready-made templates:

infogram

Choose a colour palate that you want to go for and click “Use design”. A dashboard with editable elements appears to which you can add a chart, a map, a text, a photo or a video from the menu on the right.

infogram

Double-click on each element to edit it: change text or open a chart menu. First, give a title to your visualisation. Edit the existing chart or add a new one – make sure you choose the right type of chart for the type of data you have. Double-click on the chart. An Excel-like spreadsheet appears where you can paste your data:

infogramAfter the final tweaks to your data, go to the second tab “Settings”. Depending on the chart you chose, you will find here different editing options: colours, directions, chart’s size and other.

infogram

Pay close attention to how you manipulate your chart. It is important that it present the data in a clear and easily understandable way.

After you have finished adjusting your chart, click “Done” and go on to add more elements to your visualisation.

Infogr.am is a a great tool especially for beginners in data-driven journalism, yet it has a couple of major limitations:

  1. It is impossible to copy-paste text to and from text boxes, which makes typing time-consuming and rather laborious.
  2. As you manipulate the data in the Excel-like spreadsheet, the preview of the chart is unavailable, which makes you save and re-edit the chart a couple of times before you achieve the effect you want.
  3. It would be useful to be able to caption the charts directly, as opposed to having to add chart titles and captions as separate elements to your visualisation.

How to scrape data without coding? A step by step tutorial on import.io

import.ioImport.io (pronounced import-eye-oh) lets you scrape data from any website into a searchable database. It is perfect for gathering, aggregating and analysing data from websites without the need for coding skills. As Sally Hadadi told Journalism.co.uk: The idea is to “democratise” data. “We want journalists to get the best information possible to encourage and enhance unique, powerful pieces of work and generally make their research much easier.” Different uses for journalists, supplemented by case studies, can be found here.

After downloading and opening import.io browser, copy the URL of the page you want to scrape into the import.io browser. I decided to scrape the search result website of orphanages in London:

001 Orphanages in London

After opening the website, press the tiny pink button in top right corner of the browser and follow up with “Let’s get cracking!” in the bottom right menu which has just appeared.

Then, choose the type of scraping you want to perform. In my case, it’s a Crawler (we’ll be getting data from multiple similar pages on the same site):crawler

And confirm the URL of the website you want to scrape by clicking “I’m there”.

As advised, choose “Detect optimal settings” and confirm the following:data

In the menu “Rows per page” select the format in which data appears on the website, whether it is “single” or “multiple”. I’m opting for the multiple as my URL is a listing of multiple search results:multiple

Now, the time has come to “train your rows” i.e. mark which part of the website you are interested in scraping. Hover over an entire “entry” or “paragraph”:hover over entry

…and he entry will be highlighted in pink or blue. Press “Train rows”.train rows

Repeat the operation with the next entry/paragraph so that the scraper gets the hang of the pattern of your selections. Two examples should suffice. Scroll down to the bottom of your website to make sure that all entries until the last one are selected (=highlighted in pink or blue alternately).

If it is, press “I’ve got all 50 rows” (the number depends on how many rows you have selected).

Now it’s time to focus on particular chunks of data you would like to extract. My entries consist of a name of the orphanage, address, phone number and a short description so I will extract all those to separate columns. Let’s start by adding a column “name”:add column

Next, highlight the name of the first orphanage in the list and press “Train”.highlighttrain

Your table should automatically fill in with names of all orphanages in the list:table name

If it didn’t, try tweaking your selection a bit. Then add another column “address” and extract address of the orphanage by highlighting the two lines of address and “training” the rows.

Repeat the operation for a “phone number” and “description”. Your table should end up looking like this:table final

*Before passing on to the next column it is worth to check if all rows have filled up. If not, highlighting and training of individual elements might be necessary.

Once you’ve grabbed all that you need, click “I’ve got what I need”. The menu will now ask you if you want to scrape more pages. In this case, the search yielded two pages of search results so I will add another page. In order to this this, go back to your website in your regular browser, choose page 2 (or any next one) of your search results and copy the URL. Paste it into the import.io browser and confirm by clicking “I’m there”:i'm there

The scraper should automatically fill in your table for page 2. Click “I’ve got all 45 rows” and “I’ve got what I needed”.

You need to add at least 5 pages, which is a bit frustrating with a smaller data set like this one. The way around it is to add page 2 a couple of times and delete the unnecessary rows in the final table.

Once the cheating is done, click “I’m done training!” and “Upload to import.io”.upload

Give the name to your Crawler, e.g. “Orphanages in London” and wait for import.io to upload your data. Then, run crawler:run crawler

Make sure that the page depth is 10 and that click “Go”. If you’re scraping a huge dataset with several pages of search results, you can copy your URLs to Excel, highlight them and drag down with a black cross (bottom right of the cell) to obtain a comprehensive list. Paste it into the “Where to start?” window and press “Go”.go

crawlingAfter the crawling is complete, you can download you data in EXCEL, HTML, JSON or CSV.dataset

As a result, we obtain a data set which can be easily turned into a map of orphanages in London.

Do you have any further tips for import.io extraction? Do you know any other good scrapers? Share your thoughts in the comments below.

Hint: If you need to structure and clean your data, here’s how to do it.

In the meantime, look out for another post in which I will explain the next step: how to visualise the data you have.

Structuring data: the basics

In order to properly analyse data, you need to structure it first. Here is a couple of tips and tricks on how to do it in an Excel table if you are only at the beginning of your adventure with data journalism.

  1. You will want to start with a table, which contains rows and columns. Each column corresponds to a variable, and each row corresponds to a record.structuring data
  2. Make sure you include only one header row at the very top of the spreadsheet. It should contain column names, one next to another. If you come across a table with multiple headers, simplify it into a single header or divide the data into multiple tables.data structuring header
  3. Remember to include only one type of data per variable – one column should only include one type of data.data structuring
  4. Make copies of spreadsheets with data that you are about to analyse. You might want to use your raw data for another analysis at a later stage so keep the original file untouched.
  5. Add new data to the table as new rows, not new columns. Columns correspond to new variables (which you haven’t looked at before), not new “data entries” or “data records”.data structuring
  6. Once the data is structured into an orderly table, it is time to decide what’s needed and what’s simply obscuring the big picture. Remove or modify any rows and columns which are not necessary or sufficiently accurate.
  7. Take care to name your variables (columns) in a clear and concise way. You might not be the only person dealing with the file so making the names as straightforwards as possible is key to make it work.
  8. Make sure your data for each variable is clear and readable. It must be entered in the uniform way into each column.
  9. Look out for any missing data and handle it as appropriate. Leaving a cell empty is in most cases safer than inserting a “0”.
  10. Finally, make sure you format all data according to its type (date/number/location/text…) so that Excel and any other processing software read is correctly.

Hint: All data structured and cleaned? Visualise it. In this post, I explain how to do it quickly and easily.