PDFs - Editing / Extracting From...

Here is the low down on PDFs


PDFs are not special documents, they can be edited and changed by anyone.
A PDF is actually an "envelope" - whatever is actually inside the PDF is what we can take out.

Extracting from a PDF

It is always possible to extract the images from a PDF, and text from a PDF.
It is normally not possible to extract images as text.
Conversion from image to text is buggy even with the most costly software. External document management specialists are usually required to achieve this with any accuracy, and it is usually done in bulk.


Here is a flow chart of the possibilities.






Techie corner
PDFs are "envelope" type documents where all the text and images and other data occur in rectangular blocks located using a coordinate system, this is why they are good for printing.
If you can imagine a sheet of paper, take the bottom left hand corner, this is point 0,0 on the coordinate grid.

eg.
If the PDF is a a few pages scanned from an invoice, it will consist of
page1, image1, coords 0,0, page break
page2, image2, coords 0,0, page break
page2, image2, coords 0,0, page break
etc...

Since no text is involved, we cannot extract any, however, OCR might be able to recognise some of the text on the images.



Comments