Extracting text from PDFs

Unwanted line breaks in text copied from PDF
Anybody working with information sooner or later have to copy and paste text from PDF-files. And anybody who has done that knows what a pain in the a** that is! You get actual line breaks from the visual line breaks in the PDF. The other day I began a job where I have to copy and paste text from a whole bunch of PDF files and it didn’t take long before I almost exploded with anger
So I thought: Why not make a simple application that extracts the text from the PDF and – to the most possible degree – normalizes the unwanted line breaks.
And then there was Textifyer
So I fired up Visual C# Express and started hacking. I soon found the PDFbox component – using IKVM.NET – and it didn’t take long before I had some code that actually extracted the text from a PDF file. (a PDF extraction in C# howto)
I figured out how to detect unwanted line breaks: Each line with an unwanted line break ends with a space character. Lines with a wanted line break doesn’t (in 99% of the cases). So it is just a matter of of looping over the lines and if it ends with a space skip adding a line break and just append it to the previous text buffer.

Unwanted line breaks removed
So now I just have to clean up the interface and bug test the program – which will happen automatically since I’m copy and paste from a whole bunch of PDFs at the moment. When I feel it’s working alright I will release the program. It’s really nothing hardcore about it anyway

Of course there’s drag-n-drop support!


