ASP.NET – Convert PDF to TXT (Plain-Text) or HTML in C# with iTextSharp An useful C# code snippet to convert PDF files into TXT plain-text or HTML in C# with iTextSharp, an open-source PDF management library for ASP.NET

Classe ASP.NET C# per il controllo e il calcolo formale del Codice Fiscale

Today I had to find a quick way to programmatically convert a bunch of PDF files into txt / text / plain-text format within an ASP.NET web application. Unfortunately, there aren’t much open-source libraries that can do that.

After some time struggling with Google, I stumbled upon an old friend of mine – iTextSharp, a great PDF management library for ASP.NET that I used a while ago to fullfill a rather different task involving PDF parsing. By reading the updated SourceForge page I acknowledged that the (once) open-source code has evolved into a commercial product called iText, available for Java and .NET through a Java-port which is still called iTextSharp. Luckily enough, iText also offers a Comunity Edition coming with an AGPL licence model.

Long story short, I installed iTextSharp 5.5.13 from NuGet  and used it to pull off this simple helper class that extracts the text from any PDF file:

Needless to say, once we extract the plain-text we can easily format and/or style it using some fancy HTML markup in the following way:

That’s about it: I sincerely hope that this simple class will help those who’re looking for an easy way to convert PDF into plain-text or HTML.



About Ryan

IT Project Manager, Web Interface Architect and Lead Developer for many high-traffic web sites & services hosted in Italy and Europe. Since 2010 it's also a lead designer for many App and games for Android, iOS and Windows Phone mobile devices for a number of italian companies.

View all posts by Ryan