ASP.NET - Convert PDF to TXT or HTML in C# with iTextSharp

Today I had to find a quick way to programmatically convert a bunch of PDF files into txt / text / plain-text format within an ASP.NET web application. Unfortunately, there aren't much open-source libraries that can do that.

After some time struggling with Google, I stumbled upon an old friend of mine - iTextSharp, a great PDF management library for ASP.NET that I used a while ago to fullfill a rather different task involving PDF parsing. By reading the updated SourceForge page I acknowledged that the (once) open-source code has evolved into a commercial product called iText, available for Java and .NET through a Java-port which is still called iTextSharp. Luckily enough, iText also offers a Comunity Edition coming with an AGPL licence model.

Long story short, I installed iTextSharp 5.5.13 from NuGet and used it to pull off this simple helper class that extracts the text from any PDF file:

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

namespace PDF
{
    /// <summary>
    /// Parses a PDF file and extracts the text from it.
    /// </summary>
    public static class PDFParser
    {
        /// <summary>
        /// Extracts a text from a PDF file.
        /// </summary>
        /// <param name="filePath">the full path to the pdf file.</param>
        /// <returns>the extracted text</returns>
        public static string GetText(string filePath)
        {
            var sb = new StringBuilder();
            try
            {
                using (PdfReader reader = new PdfReader(filePath))
                {
                    string prevPage = "";
                    for (int page = 1; page <= reader.NumberOfPages; page++)
                    {
                        ITextExtractionStrategy its = new SimpleTextExtractionStrategy();
                        var s = PdfTextExtractor.GetTextFromPage(reader, page, its);
                        if (prevPage != s) sb.Append(s);
                        prevPage = s;
                    }
                    reader.Close();
                }
            }
            catch (Exception e)
            {
                throw e;
            }
            return sb.ToString();
        }
    }
}

using System;

using System.Collections.Generic;

using System.IO;

using System.Linq;

using System.Text;

using System.Threading.Tasks;

using iTextSharp.text.pdf;

using iTextSharp.text.pdf.parser;

namespace PDF

{

/// <summary>

/// Parses a PDF file and extracts the text from it.

/// </summary>

public static class PDFParser

{

/// <summary>

/// Extracts a text from a PDF file.

/// </summary>

/// <param name="filePath">the full path to the pdf file.</param>

/// <returns>the extracted text</returns>

public static string GetText(string filePath)

{

var sb = new StringBuilder();

try

{

using (PdfReader reader = new PdfReader(filePath))

{

string prevPage = "";

for (int page = 1; page <= reader.NumberOfPages; page++)

{

ITextExtractionStrategy its = new SimpleTextExtractionStrategy();

var s = PdfTextExtractor.GetTextFromPage(reader, page, its);

if (prevPage != s) sb.Append(s);

prevPage = s;

}

reader.Close();

}

catch (Exception e)

{

throw e;

}

return sb.ToString();

}

Needless to say, once we extract the plain-text we can easily format and/or style it using some fancy HTML markup in the following way:

public static GetHTMLText(string sourceFilePath)
{
    var txt = PDFParser.GetText(sourceFilePath);
    var sb = new StringBuilder();
    foreach (string s in txt.Split('\n')) {
        sb.AppendFormat("<p>{0}</p>", s);
    }
    return sb.ToString();  
}

public static GetHTMLText(string sourceFilePath)

{

var txt = PDFParser.GetText(sourceFilePath);

var sb = new StringBuilder();

foreach (string s in txt.Split('\n')) {

sb.AppendFormat("<p>{0}</p>", s);

}

return sb.ToString();

}

That's about it: I sincerely hope that this simple class will help those who're looking for an easy way to convert PDF into plain-text or HTML.

Print Friendly & PDF Download

3 Comments on “ASP.NET - Convert PDF to TXT (Plain-Text) or HTML in C# with iTextSharp An useful C# code snippet to convert PDF files into TXT plain-text or HTML in C# with iTextSharp, an open-source PDF management library for ASP.NET”

noobs101 says:

December 26, 2018 at 11:28

how about the images? how to convert pdf (with images) to html?

1. Ryan says:
  
  December 26, 2018 at 14:08
  Hello there,
  
  this post is specifically for plaintext-to-PDF: if HTML-to-PDF is what you’re looking for, check out the following post:
  - https://www.lifewire.com/pdf-to-html-conversion-tools-3469173
Subhash says:

June 8, 2020 at 22:02

Please give full example, after conversion pdf text to string, how we can display it in Big Text Editor as a proper html input in browser and after that how we can change the content and rewrite the pdf again, with new text.

ASP.NET - Convert PDF to TXT (Plain-Text) or HTML in C# with iTextSharp An useful C# code snippet to convert PDF files into TXT plain-text or HTML in C# with iTextSharp, an open-source PDF management library for ASP.NET

About Ryan

3 Comments on “ASP.NET - Convert PDF to TXT (Plain-Text) or HTML in C# with iTextSharp An useful C# code snippet to convert PDF files into TXT plain-text or HTML in C# with iTextSharp, an open-source PDF management library for ASP.NET”

Leave a Reply Cancel reply

Related Posts

ASP.NET Core 8 and Angular - Sixth Edition Learn how to build stunning web apps with Angular backed up with an efficient and lightweight Web API back-end built with ASP.NET Core

Auto-restart ASP.NET Core apps after deployment on Linux using incron How to automatically restart your ASP.NET Core web applications hosted on your Linux server after the deploy using incron

Top 5 Web Programming Languages to Learn in 2023 Are you planning to become a web developer? Here's a list of the top 5 programming languages you should consider learning in 2023

About Ryan

3 Comments on “ASP.NET - Convert PDF to TXT (Plain-Text) or HTML in C# with iTextSharp An useful C# code snippet to convert PDF files into TXT plain-text or HTML in C# with iTextSharp, an open-source PDF management library for ASP.NET”

Leave a Reply Cancel reply