Skip to main content

How To Convert Pdf file to text in asp.net

in this post i will show how to convert pdf document to text file using pdftotext. (pdftotext is an open source command-line utility for converting PDF files to plain text files —i.e. extracting text data from PDF-protected files. It is freely available and included with many Linux distributions. It must be installed as part of the xpdf package for Windows.) click here to download pdftotext
<%@ Page Language="C#" AutoEventWireup="true" CodeFile="pdf2tex.aspx.cs" Inherits="pdf2tex"
   ValidateRequest="False" %>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head runat="server">
   <title>Untitled Page</title>
</head>
<body>
   <form id="form1" runat="server">
       <div>
           <asp:FileUpload ID="FileUpload1" runat="server" />
           <br />
           <asp:Button ID="btnRead" Text="Convert" runat="Server" OnClick="btnRead_Click" />
           <br />
           <asp:TextBox ID="txtContent" runat="Server" TextMode="MultiLine" Height="376px" Width="411px"></asp:TextBox>
       </div>
   </form>
</body>
</html>
using System;
using System.Data;
using System.Configuration;
using System.Collections;
using System.Web;
using System.Web.Security;
using System.Web.UI;
using System.Web.UI.WebControls;
using System.Web.UI.WebControls.WebParts;
using System.Web.UI.HtmlControls;
using System.IO;

public partial class pdf2tex : System.Web.UI.Page
{
    protected void Page_Load(object sender, EventArgs e)
    {

    }
    protected void btnRead_Click(object sender, EventArgs e)
    {
        string appPath = Request.ApplicationPath;
        System.Diagnostics.Process p = new System.Diagnostics.Process();
        p.StartInfo.Arguments = " -raw -htmlmeta" + " " + FileUpload1.PostedFile.FileName + " " + "c:\\output.htm"; ;
        p.StartInfo.FileName = Page.MapPath("pdftotext.exe");
       
        p.StartInfo.UseShellExecute = false;
        p.StartInfo.CreateNoWindow = false;
        p.StartInfo.RedirectStandardOutput = false;
        p.Start();
        p.WaitForExit();
        System.Threading.Thread.Sleep(3000);
        txtContent.Text = ReadFile("c:\\output.htm");


    }

    public string ReadFile(string s)
    {
        StreamReader sr = new StreamReader(s);
        string strReturn = sr.ReadToEnd();
        return strReturn;

    }
}

Comments

  1. Hi,
    First of all thanks for the post.
    I am using pdftohtml to convert pdf file into html. I have some issues:
    1. This worked for some pdf files and not for some. Didn't work means there was no content added in the html file.
    2. The nature that each time I execute the code from my asp.net page, the command prompt seems to flick. Would this be good while being access from web? At least for now we can neglect it, however.

    Thank you very much!

    ReplyDelete
  2. Hi ashan,
    check this article
    http://aspdotnetcodebook.blogspot.com/2008/07/how-to-export-content-of-gridview-to.html

    ReplyDelete
  3. hi , m not able to download the pdftotext.exe file from the url given in forums

    can u plz email me as soon as possible.

    at poojaverma05@gmail.com
    or vivekgumber@yahoo.com

    ReplyDelete
  4. its not working at my end . this code is not writing anything in the output file

    please email me the any possible problem i might be having or gie me ur id so tht i could email u the file m using to convert.

    ReplyDelete
  5. this code is a gr8 utility. initially it might give u hard time. but it works bravely.
    santosh was a gr8 help in making it work for me

    Many many Thanks to him

    ReplyDelete
  6. hi santosh i m facing problem is that i m converting the PDF file to htm as per ur example..... but it only generate file with 0 kb with no content .... can u give me a clu why is it so or what mistake i m doing

    Thanks ...for help

    ReplyDelete
  7. This comment has been removed by the author.

    ReplyDelete
  8. Hi Rakesh,
    Reason is that due to security reason IE 7,8 and mozila gives only file name not full path.
    FileUpload1.PostedFile.FileName

    Just place break point on bold statement and check the path of the file.

    ReplyDelete
  9. Nitin MajgaonkarApril 5, 2012 at 9:52 PM

    Can anyone suggest dll or component which can convert any file format (doc, docx, pdf, txt to html)?

    Thanks
    Nitin

    ReplyDelete
  10. Hi Santosh
    I am able to convert pdf to html but content is not coming

    "FileUpload1.PostedFile.FileName" give file name

    "FileUpload1.PostedFile" give lenght but
    ReadTimeout
    and WriteTimeOut giving
    ReadTimeout '(FileUpload1.PostedFile.InputStream).ReadTimeout' threw an exception of type 'System.InvalidOperationException' int {System.InvalidOperationException}

    when i watch in quick watch


    i already checked it with chorme also

    Can you give some sol for it

    or where i did mistake

    ReplyDelete
  11. Friend its not working ..!
    plz send me the detailed code

    ReplyDelete
  12. I go to http://www.foolabs.com/xpdf/download.html and can't find pdftotext.exe..where it's is?

    ReplyDelete
  13. I have found a C#/.NET Library that can convert pdf file to text file and vice versa known as Aspose.PDF for .NET. Below is the link if anyone want to try it:

    http://www.aspose.com/.net/pdf-component.aspx

    ReplyDelete
  14. The PDF Focus .Net is C#/.NET library which provide you API to convert pdf to any formats (text, rtf, html, images). It cost much cheaper than other competitors, the price starts from $199.00.

    To get the ball rolling, it's smal example:

    SautinSoft.PdfFocus f = new SautinSoft.PdfFocus();
    f.OpenPdf(@"c:\Invoice.pdf");
    f.ToHtml(@"c:\Web-Invoice.html");

    http://www.sautinsoft.com/products/pdf-focus/index.php

    ReplyDelete

Post a Comment

Popular posts from this blog