<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Pontus Östlund &#187; Textifyer</title>
	<atom:link href="http://www.poppa.se/blog/tag/textifyer/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.poppa.se/blog</link>
	<description>My blog about web development and such</description>
	<lastBuildDate>Mon, 16 Jan 2012 00:38:45 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Extracting text from PDFs</title>
		<link>http://www.poppa.se/blog/extracting-text-from-pdfs/</link>
		<comments>http://www.poppa.se/blog/extracting-text-from-pdfs/#comments</comments>
		<pubDate>Mon, 11 Jan 2010 16:24:09 +0000</pubDate>
		<dc:creator>Pontus</dc:creator>
				<category><![CDATA[Applications]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[C#]]></category>
		<category><![CDATA[Textifyer]]></category>

		<guid isPermaLink="false">http://www.poppa.se/blog/?p=355</guid>
		<description><![CDATA[
Unwanted line breaks in text copied from PDF
Anybody working with information sooner or later have to copy and paste text from PDF-files. And anybody who has done that knows what a pain in the a** that is! You get actual line breaks from the visual line breaks in the PDF. The other day I began [...]]]></description>
			<content:encoded><![CDATA[<p><img src="/blog/data/images/textifyer-3.png/680" alt="Unwanted line breaks in text copied from PDF" /><br />
<small><em>Unwanted line breaks in text copied from PDF</em></small></p>
<p>Anybody working with information sooner or later have to copy and paste text from PDF-files. And anybody who has done that knows what a pain in the a** that is! You get actual line breaks from the visual line breaks in the PDF. The other day I began a job where I have to copy and paste text from a whole bunch of PDF files and it didn&#8217;t take long before I almost exploded with anger <img src='http://www.poppa.se/blog/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<p><strong>So I thought:</strong> Why not make a simple application that extracts the text from the PDF and &#8211; to the most possible degree &#8211; normalizes the unwanted line breaks.</p>
<h2>And then there was Textifyer</h2>
<p>So I fired up <em>Visual C# Express</em> and started hacking. I soon found the <a href="http://www.pdfbox.org">PDFbox</a> component &#8211; using <a href="http://ikvm.net">IKVM.NET</a> &#8211; and it didn&#8217;t take long before I had some code that actually extracted the text from a PDF file. (<a href="http://www.codeproject.com/KB/string/pdf2text.aspx">a PDF extraction in C#  howto</a>)</p>
<p>I figured out how to detect unwanted line breaks: Each line with an unwanted line break ends with a space character. Lines with a wanted line break doesn&#8217;t (in 99% of the cases). So it is just a matter of of looping over the lines and if it ends with a space skip adding a line break and just append it to the previous text buffer. </p>
<p><img src="/blog/data/images/textifyer-2.png/680" alt="Unwanted line breaks removed" /><br />
<small><em>Unwanted line breaks removed</em></small></p>
<p>So now I just have to clean up the interface and bug test the program &#8211; which will happen automatically since I&#8217;m copy and paste from a whole bunch of PDFs at the moment. When I feel it&#8217;s working alright I will release the program. It&#8217;s really nothing hardcore about it anyway <img src='http://www.poppa.se/blog/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<p><img src="/blog/data/images/textifyer.png/680" alt="Textifyer: Drag-n-drop enabled" /><br />
<small><em>Of course there&#8217;s drag-n-drop support!</em></small></p>
]]></content:encoded>
			<wfw:commentRss>http://www.poppa.se/blog/extracting-text-from-pdfs/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

