Data mining for good: the Panama Papers

Every time I accept or decline cookies on a website (based on about the same amount of logic I use to decide whether or not I want a receipt), I have the troubling feeling my ignorance will come back to bite me. So maybe that’s why, in an ignorance-is-bliss-way, this week I’ve researched a positive instance of data analysis: the Panama Papers.

My awareness of the fear surrounding data mining began with the scare of the 2018 Cambridge Analytica scandal, when Facebook users’ personal data was secretly collected – or rather ‘harvested’, like an organ I didn’t know I had. However, the same data mining tools used by tech corporations for data harvesting have been used against large companies such as Mossack Fonseca, in the case of the Panama Papers. This incident provides a counterpoint to the data scare; in the Panama Papers case, digital technologies helped in data literacy and revealing a case of financial fraud on an unprecedented scale.

Not to be confused with the Pandora Papers or the Paradise Papers (who is naming these??), the Panama Papers were a collection of 11.5 million documents, equalling 2.6 terabytes of data, that were leaked by an anonymous source within offshore Panamanian law firm Mossack Fonseca to Süddeutsche Zeitung, a German newspaper who worked in collaboration with the ICIJ (International Consortium of Investigative Journalists). After a year of analysis, the results of the ICIJ’s investigation were published in April of 2016. The Panama Papers have been described as the largest data leak in the history of journalism and analysis would not have been possible without digital technologies, owing to the sheer quantity of information.

The ICIJ team used various open-source platforms under a cross-platform tool called Extract. Initially, they had to identify duplicate documents to reduce the mass of information. They used eDiscovery analytics software Nuix, which identified and removed a third of the data as duplicates. Next, OCR (optical character recognition) technology, via Apache Tika and Tesseract, was used to cipher searchable text from PDFs, images, and other files. This transformed the data into machine-readable sets of text, which was extremely important for the next step: identifying names within the data and arranging them into nodes of an information network, using graph software Neo4j. There were over 840,000 nodes but the graph database enabled journalists to easily uncover a person’s web of connections by linking, for instance, people with matching addresses, or with regular correspondence. This hugely sped up the ICIJ’s process of establishing relationships between the various individuals and companies whose shell corporations were being used for fraud, tax evasion and other financial crimes.

In their statement, The Revolution Will Be Digitized, the anonymous whistle-blower who leaked the Panama Papers revealed that, although it endangered their life, they were motivated to leak the data by growing global income inequality and financial corruption enabled by firms such as Mossack Fonseca. As a direct consequence of the work of the ICIJ to decipher the leaked Panama Papers, Mossack Fonseca was forced to shut down, and several high-profile people have been arrested or have resigned. After the discovery of an undeclared offshore company, Iceland’s President Ólafur Grímsson was forced to step down. Nawaz Sharif, former Prime Minister of Pakistan, was jailed for corruption. Wirecard’s COO Jan Marsalek is still a fugitive from Interpol after the Panama Papers revealed his links to Russian intelligence. Without question, the Panama Papers represent a unique collaboration between investigative journalism and data analysis technologies on a global scale, and give these same technologies that are used to harvest mine and your personal information somewhat of a redemptive arc.