We evaluated other great open source solutions like Tabula, but they just couldn’t handle the structure of some of the PDFs we needed to scrape. The Apache PDFBox™ library is an open source Java tool for working with PDF documents.

CaLi2CoPi is a multiplatform PDF parser library programmed in PostScript. It has an extensible PDF parser that can be used for other purposes than text analysis. Download PDF_Parser for free.

pdfid.py and pdf-parser.py are two PDF tools by Didier Stevens written in Python. 5 into a *.dump file: Fillable PDF Form Processing. Expand . PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. The PDFjet Open Source Edition has the following features: * Drawing support for: points, lines, boxes, circles, bezier curves, polygons, stars, complex paths and shapes. Also supports online execution on web based user interface via Ghostscript. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. * Text: unicode support, text kerning when using the Helvetica and Times-Roman families of core fonts, embedding hyperlinks. Extract data from PDF forms fields, in bulk, with our batch parsing software.

Automate your PDF form processing for download, or integration to wherever you need the data. Apache PDFBox also includes several command-line utilities. a PDF parser for security analyzer Apache PDFBox is published under the Apache License v2.0. 1. pdf2json buggy, no support anymore, memory leak, throws non-catchable fatal errors 2. j-pdfjson fork of pdf2json 3. pdf-parser buggy, no tests 4. pdfreader using pdf2json 5. pdf-extract not cross-platform using xpdf Pure python PDFjet Open Source Edition is a library for dynamic generation of PDF documents from Java and .NET. Their background is also to help explore malicious PDFs -- but I also find it useful to analyze the structure and contents of benign PDF files. Here is an example how I would extract the uncompressed stream of PDF object no. It includes a PDF converter that can transform PDF files into other text formats (such as HTML).

“Textricator is both flexible and powerful and has cut the time we spend to process large datasets from days to hours,” says Andrew Branch, director of technology. Works with several specialized switch in order to verify, add, extract or change any PDF content.