open source tool for liberating data tables trapped inside PDF filesFrom Mozillaopennews.orgIntroducing TabulaUpload a PDF, get back tabular CSV data. Poof!April 3, 2013By Manuel Aristarán, Mike Tigas…As many developers and data reporters know, dealing with data tables in Adobe Acrobat PDF files is a pain in the rear (to put it lightly). Part of the problem is that PDF is not a data format so much as an electronic paper format. Another part: existing extraction tools, such as xpdf/Poppler’s pdftotext, aren’t designed for data tables and aren’t exactly human-friendly.Jeremy B. Merrill recently wrote a first-hand account of some of the difficulties ProPublica encountered as they released a massive update to their Dollars for Docs interactive database. During this project, ProPublica used an internally-developed command-line utility named Farrago, which utilized computer vision techniques to detect and extract data from tables in PDF files.
Poll of the Week
Could your inbox use a little more awesome?
Sign up to get a daily dose of awesome gov-focused resources, trainings, blogs and articles to help you do you job better.