UPDATED 12:56 EDT / NOVEMBER 24 2017

BIG DATA

How big data made the ‘Paradise Papers’ revelations real

When more than 13 million secret documents detailing the offshore activities of some of the world’s most prominent people and organizations fall into a journalist’s lap, it presents an exciting dilemma. The data is potentially explosive, but how do you even extract the information tied up in millions of PDF files and emails, much less tease out the complex relationships that stakeholders have intentionally tried to hide?

That was the challenge that faced the International Consortium of Investigative Journalists a year ago when the German newspaper Süddeutsche Zeitung obtained a cache of records now known as the Paradise Papers. The documents consisted of nearly 7 million loan agreements, financial statements, emails, trust deeds and other paperwork accumulated over the course of nearly 50 years by the offshore law firm Appleby Group Ltd. and Asiaciti Trust Singapore Pte. Ltd., a family-owned trust company.

“It was a mess of unstructured, different sources. We had to find leads, names and connections,” said Emilia Díaz-Struck, lead researcher for cross-border investigations at the ICIJ, a global network of more than 200 investigative journalists in 70 countries who collaborate on investigative reporting into cross-border issues.

But the organization had experience and technology to help. In a yearlong investigation aided by big data and collaboration software, a team of 380 journalists around the globe sorted through the documents to uncover the ways in which wealthy and prominent individuals and organizations — including Apple Computer Inc., Nike Inc., U.S. Commerce Secretary Wilbur Ross and a host of big U.S. political donors — have used offshore holding companies and havens to hide investments or avoid paying U.S. taxes. Their work was picked up by news organizations across the globe and has resulted in executive resignations and calls for tax reform.

Data journalism

Their work is an example of how big data is intersecting with old-fashioned shoe leather to revolutionize the craft of reporting. “I don’t know how we could have done a project this large without technology,” said reporter Spencer Woodman, who reported on investments that Kremlin-owned firms have made in Twitter Inc. and Facebook Inc., among other topics.

The ICIJ is on the leading edge of what’s being called “data journalism.” Two years ago, the organization was the beneficiary of more than 11.5 million financial and legal records that documented widespread crime and corruption involving some prominent public figures, and foreign governments. The Panama Papers captured a Pulitzer Prize for the organization and is considered a milestone in data journalism.

Some of the tools the ICIJ developed for The Panama Papers were brought back to help organize the Paradise Papers project. One is Extract, an open-source utility that can pull text and metadata from a wide variety of document formats, including PDF files and images. That was useful in combing through the millions of PDF documents contained in the Paradise Papers trove. Extract can even apply optical character recognition to scanned documents, turning rivers of text into neatly structured rows and columns.

“The technology helps you extract information based on key terms or the way a file is structured,” Diaz-Struck said. “For example, you can generate a spreadsheet of people connected to specific companies from a PDF.”

Seeing connections

Interactive graphic of Trump donors from Big U.S. Political Donors Play The Offshore Game. Source: ICIJ

Interactive graphic of Trump donors from Big U.S. Political Donors Play The Offshore Game. Source: ICIJ

Another critical tool is Neo4J, a graph database engine from Neo4j Inc. Graph databases are a type of NoSQL database that represent data as objects called nodes, edges and properties. Rather than storing information in rows and columns, they manage unstructured data as connections between objects. For example, a graph database can establish that an account that’s in one person’s name may be indirectly controlled by that person’s spouse. A relational database indexed by account holder would be unable to discover that linkage, unless it was explicitly stated.

This capability proved critical to Paradise Papers journalists who needed to scope out relationships that often spanned multiple degrees of separation. “Graph databases bring you a lot of clarity to find connections that aren’t obvious in millions of documents,” Diaz-Struck said. “We can see the connections between officers and shareholders, directors and foundations. These are a key part of the research.”

To visualize those connections, the team used Linkurious, a visualization tool for graph database engines from Paris-based Linkurious SAS. Visualization was important in helping reporters who aren’t data scientists see how a complex web of entities are related to each other. The task was too big for a human to process, Diaz-Struck said.

“If you search for ‘Wilbur Ross,’ you’ll find thousands of documents, but with Linkurious you can see all the companies Appleby was servicing offshore that listed Ross as a shareholder,” she said. “That shows you how big and important his connection to Appleby was.” Many of the Linkurious visualizations appear as informational graphics in profiles of prominent figures from the papers on the ICIJ website.

Tying everything together was Global I-Hub ICIJ, a Facebook-like collaboration platform that reporters could use to share findings and ask for help.

“We created lots of groups – by countries, entities and subject matter like yachts and private planes,” said Martha Hamilton, an ICIJ editor and writer. “As people came across things that pertained to that topic, we put it in the group. You could post a question, and since other people were likely to be looking at the same data, you wouldn’t have to go find it yourself.”

Some things don’t change

The Associated Press may be writing corporate earnings stories and high school sports with robots, but investigative journalism still requires human touch. As great as the tools were at decoding and organizing information, it was still up to the individual reporters and editors to slog through the data, an often tedious task. “When you come right down to it, a lot of what I did was searching,” Woodman said.

“It’s tedious work, going through a lot of documents, discarding most of them and trying to figure out if you’ve found the person you’re looking for, or the 52 other people who have the same name,” Hamilton said.

Technology helped find offshore account holders, but reporters still had to call subjects to verify information and get responses. Some chose not to reply at first but relented when confronted with documentation, Woodman said. That’s another way data journalism has changed the rules.

The Paradise Papers revelations aren’t over. More than 20 additional stories have appeared since the bombshell report about Apple’s offshore holdings first appears on Nov. 6. The ICIJ has committed to releasing most of the source data for anyone to download and analyze, meaning that future discoveries could come from anywhere. The organization has also open-sourced Extract and Global I-Hub for use by other resource-strapped media organizations.

Does data journalism change the game? Veteran writer and editor Martha Hamilton thinks so. “It’s a new style of journalism,” she said. “This kind of collaboration wouldn’t have been possible 15 years ago.”

Image: International Consortium of Investigative Journalists

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU