Site icon Ryadel

ASP.NET - Convert PDF to TXT (Plain-Text) or HTML in C# with iTextSharp

Classe ASP.NET C# per il controllo e il calcolo formale del Codice Fiscale

Today I had to find a quick way to programmatically convert a bunch of PDF files into txt / text / plain-text format within an ASP.NET web application. Unfortunately, there aren't much open-source libraries that can do that.

After some time struggling with Google, I stumbled upon an old friend of mine - iTextSharp, a great PDF management library for ASP.NET that I used a while ago to fullfill a rather different task involving PDF parsing. By reading the updated SourceForge page I acknowledged that the (once) open-source code has evolved into a commercial product called iText, available for Java and .NET through a Java-port which is still called iTextSharp. Luckily enough, iText also offers a Comunity Edition coming with an AGPL licence model.

Long story short, I installed iTextSharp 5.5.13 from NuGet  and used it to pull off this simple helper class that extracts the text from any PDF file:

Needless to say, once we extract the plain-text we can easily format and/or style it using some fancy HTML markup in the following way:

That's about it: I sincerely hope that this simple class will help those who're looking for an easy way to convert PDF into plain-text or HTML.

 

Exit mobile version