Everyone working with data knows the problem: You found some interesting data for your journalistic project or statistics for preparing a nice map, but the data comes messy and hidden in PDF-files, not automatically readable for your program. Normally you have to write out and clean up the data by hand. But there are tools for that…
Ok, let’ say you found some data in a PDF file on the Internet, but the editor didn’t really wan’t anybody to ever be able to use the data. He hides the data in ugly formatted tables and doesn’t care about data consistency. Usually that’s an easy way for companies or governments to pretend they are open with their data. But in reality, no one wants to clean up their messy stuff manually to dig deep in the data and check them with statistical algorithms.
Today I found some nice pieces of software that want to break this limitation, to give data journalists a tool to break the data free. I want to introduce to you “tabula” and “OpenRefine” (formerly Google Refine).
Following, I am giving short descriptions of the programs and their usage, but due to limited time, I didn’t test them for myself.
Ectracting tabular data from PDF files with “tabula”
For downloads and installation instructions, please go to http://tabula.nerdpower.org/
- Upload a file with tables you would like to copy.
- Draw a box around the area of the table you would like to copy. (Note: currently, Tabula can’t select tables over multiple pages)
- You will be given the option to copy the table as a CSV (comma-separated values) file or download the CSV or TSV (tab separated values). If you notice any errors in the table, you can make text edits to the selected text before copying or downloading.
- Now you can work with your data in a spreadsheet or text file rather than a PDF!
Note: Tabula only works on text-based PDFs at this time, not scanned documents.
Clean up data sheets with “OpenRefine”
For downloads and installation instructions, please go to http://openrefine.org/
OpenRefine is a former Google project released as open source software project, developed by the community. It is a sophisticated program to load any tabular data and clean up data inconsistencies, data type failures and other common problems. See it action in the following videos from their homepage:
Clean and Transform Data
Reconcile / Match