How to convert pdf file to excel in c#

30,938

Solution 1

You absolutely do not have to convert PDF to Excel. First of all, please determine whether your PDF contains textual data, or it is scanned image. If it contains textual data, then you are right about using "some free dll". I recommend iTextSharp as it is popular and easy to use.

Now the controversial part. If you don't need rock solid solution, it would be easiest to read all PDF to a string and then retrieve emails using regular expression.
Here is example (not perfect) of reading PDF with iTextSharp and extracting emails:

public string PdfToString(string fileName)
{
    var sb = new StringBuilder();    
    var reader = new PdfReader(fileName);
    for (int page = 1; page <= reader.NumberOfPages; page++)
    {
        var strategy = new SimpleTextExtractionStrategy();
        string text = PdfTextExtractor.GetTextFromPage(reader, page, strategy);
        text = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text)));
        sb.Append(text);
    }
    reader.Close();        
    return sb.ToString();
}
//adjust expression as needed
Regex emailRegex = new Regex("Email Address (?<email>.+?) Passport No");
public IEnumerable<string> ExtractEmails(string content)
{   
    var matches = emailRegex.Matches(content);
    foreach (Match m in matches)
    {
        yield return m.Groups["email"].Value;
    }
}

Solution 2

Using bytescout PDF Extractor SDK we can be able to extract the whole page to csv as below.

CSVExtractor extractor = new CSVExtractor();
extractor.RegistrationName = "demo";
extractor.RegistrationKey = "demo";

TableDetector tdetector = new TableDetector();
tdetector.RegistrationKey = "demo";
tdetector.RegistrationName = "demo";

// Load the document
extractor.LoadDocumentFromFile("C:\\sample.pdf");
tdetector.LoadDocumentFromFile("C:\\sample.pdf");

int pageCount = tdetector.GetPageCount();

for (int i = 1; i <= pageCount; i++)
{
    int j = 1;

        do
        {
                extractor.SetExtractionArea(tdetector.GetPageRect_Left(i),
                tdetector.GetPageRect_Top(i),
                tdetector.GetPageRect_Width(i),
                tdetector.GetPageRect_Height(i)
            );

            // and finally save the table into CSV file
            extractor.SavePageCSVToFile(i, "C:\\page-" + i + "-table-" + j + ".csv");
            j++;
        } while (tdetector.FindNextTable()); // search next table
}
Share:
30,938
yasmeen soubhi
Author by

yasmeen soubhi

Updated on August 06, 2020

Comments

  • yasmeen soubhi
    yasmeen soubhi almost 4 years

    I want to extract some data like " email addresses " .. from table which are in PDF file and use this email addresses which I extract to send email to those people.

    What I have found so far through searching the web:

    1. I have to convert the PDF file to Excel to read the data easily and use them as I want.

    2. I find some free dll like itextsharp or PDFsharp.

    But I didn't find any snippet code help to do this in C#. is there any solution ?