Is there an easy way to manually decode a FlateDecode Filter to extract text in a PDF? C#

15,213

Solution 1

private static string decompress(byte[] input)
{
    byte[] cutinput = new byte[input.Length - 2];
    Array.Copy(input, 2, cutinput, 0, cutinput.Length);

    var stream = new MemoryStream();

    using (var compressStream = new MemoryStream(cutinput))
    using (var decompressor = new DeflateStream(compressStream, CompressionMode.Decompress))
        decompressor.CopyTo(stream);

    return Encoding.Default.GetString(stream.ToArray());
}

According to below similar question the first 2 bytes of the stream has to be cut from the stream. This is done in above function. Just pass all bytes of the stream to input. Make sure the bytecount is the same as the length specified.

C# decode (decompress) Deflate data of PDF File

Solution 2

The easiest solution is to use DeflateStream provided by .NET framework. Example can be found in similar thread. This approach might have some pitfalls.

If this doesn't work, there are libraries (like DotNetZip), capable of deflate stream decompression. Please check this link for performance comparison.

The last possible option I see, without reinventing wheel is to use other PDF parsing libraries and use them for stream decompression, or even for whole PDF processing.

Share:
15,213
greentea
Author by

greentea

Updated on June 13, 2022

Comments

  • greentea
    greentea almost 2 years

    I posted a question related to this a while back but got no responses. Since then, I've discovered that the PDF is encoded using FlateDecode, and I was wondering if there is a way to manually decode the PDF in C# (Windows Phone 8)? I'm getting output like the following:

    %PDF-1.5
    %????
    1 0 obj
    <<
    /Type /Catalog
    /Pages 2 0 R
    >>
    endobj
    5 0 obj
    <<
    /Filter /FlateDecode
    /Length 9
    >>
    stream x^+
    

    The PDF has been created using the SyncFusion PDF controls for Windows Phone 8. Unfortunately, they do not currently have a text extraction feature, and I couldn't find that feature in other WP PDF controls either.

    Basically, all I want is to download the PDF from OneDrive and read the PDF contents. Curious if this is easily doable?

  • Nacht
    Nacht over 4 years
    +1 for mentioning that you have to trim the first 2 bytes... It works! How would we possibly know that???
  • Pete
    Pete over 4 years
    Thanks, yes in the other thread user1011394 explains it has something to do with RFC1951 over RC1950. I did some more research and found out these 2 bytes are the RFC 1950 - ZLIB framing bytes. Maybe another way is to use the ZLIB libary.