Convert PDF or FDF to CSV?

8,208

Solution 1

How to convert FDF to CSV?

Option A:

awk -F "[()]" '{ if ($1=="/V ") value[$2];} \
    END {printf( "CompanyName\tEmailAddress\t\tCountryOrRegion\n" ); \
           for (x in value)printf("%s\t", x);print "" ; \
    }' filled_form.fdf > filled_form.CSV

Option B:

grep -oP '(?<=\/T \(txt).*(?=\))' filled_form.fdf |awk '{ORS=(NR%3?",":RS)}1'; \
grep -oP '(?<=\/V \().*(?=\))' filled_form.fdf |awk '{ORS=(NR%3?",":RS)}1'; 

Short version of above command would be:

paste -sd, <(grep -oP '(?<=\/T \(txt).*(?=\))' filled_form.fdf) <(grep -oP '(?<=\/V \().*(?=\))' filled_form.fdf)

Option C:

awk 'NR%2==0{type[$0]} NR%2{value[$0]} END{for (x in type)printf("%s\t", x);print "" ;for (y in value)printf("%s\t", y);print "" ;}' <(grep -oP '(?<=\/T \(txt|\/V \().*(?=\))' filled_form.fdf)

How to convert unfriendly CSV to friendly CSV ?

Option A:

awk -F: '{ if ($1=="FieldValue") value[$2];} \
    END {printf( "CountryOrRegion\tCompanyName\tEmailAddress\n" ); \
           for (x in value)printf("%s\t", x) ;print ""; \
    }' filled_form.csv > friendly_filled_form.CSV

Option B:

grep -oP '(?<=FieldName: txt).*' filled_form.csv |awk '{ORS=(NR%3?",":RS)}1'; \
grep -oP '(?<=FieldValue: ).*' filled_form.csv |awk '{ORS=(NR%3?",":RS)}1'

*Note that this command is one linear. so for running it you have to type/copy both line.

And short version for this one would be:

paste -sd, <(grep -oP '(?<=FieldName: txt).*' filled_form.csv) <(grep -oP '(?<=FieldValue: ).*' filled_form.csv)

Option C:

awk 'NR%2{type[$0]} NR%2==0{value[$0]} END{for (x in type)printf("%s\t", x);print "" ;for (y in value)printf("%s\t", y);print "" ;}' <(grep -oP '(?<=FieldName: txt|FieldValue: ).*' filled_form.csv)

or even this one:

awk 'NR%2{type[$0]} NR%2==0{value[$0]} END{for (x in type)printf("%s\t", x);print "" ;for (y in value)printf("%s\t", y);print "" ;}' <(awk -F'FieldValue: |FieldName: txt' 'NF>1{print $2}' filled_form.csv)

How to convert PDF to CSV?

I will complete the solution with pdfgrep tomorrow but if you want to try it for yourself, here is the command:

pdfgrep 'CompanyName|CountryOrRegion|EmailAddress' filled_form-submit_format_fdf.pdf

It needs to work on output format. If you want to get only whole words, use the -C 0 option. good luck and I hope helps ;)

Solution 2

Answering especially your comment here:

I wish Base could decode a PDF file that Base created

No, it is not reasonable to expect that any program should be able to read a PDF it generated, not any more than expecting a printer to be able to churn back a sheet it had just printed and giving you back your document.

PDF is a printing/archival format. It is basically(1) a series of instructions of the kind of "put the word hello at position x,y in the page". So when you see in a PDF:

example PDF

There is no way for a computer program to say if it's two rows by three columns or two rows by two columns containing spaces in it(2). The number 3 could be a 2.95 rounded up, even. You can't know because the info is simply not there. So no, you never think that going back from PDF to your data should be possible(3).

However, you can try to convert the information to things more manageable with, for example, pdftotext(4):

[romano:~/tmp] 1 % pdftotext Untitled1.pdf; cat Untitled1.txt
Sheet1

shiny
mint

new
used

3
1

Page 1

...and then use scripting, manual editing, whatever to massage it up.


Footnotes:

(1) really, really understatement here.

(2) It's 2 by 2, really.

(3) I have even seen PDFs where the creator program went long stretches to make this almost impossible, by putting each glyph independently and randomly in the page, to make tampering with it very difficult. No, I can't find the example any more...

(4) from the package poppler-utils

Share:
8,208

Related videos on Youtube

jtd
Author by

jtd

Updated on September 18, 2022

Comments

  • jtd
    jtd over 1 year

    It seems crazy, but while Libre Base can export a form as PDF, there is no way to import the completed form data from the PDF into Libre Base.

    One idea is to extract PDF form data to FDF (or unfriendly CSV) and then convert that to a friendly CSV at which point one can copy and paste CSV data into Libre Base.

    • How can I convert a PDF, FDF, or unfriendly CSV file to a friendly CSV with:
      • Row 1: "FieldName1, FieldName2, ..."
      • Row 2: "Value1, Value2, ..."

    Useful files on Ge.tt until 2015,02,25:

    1. Super basic blank pdf with "Submit format" as FDF
      • Libre Base > Forms > Edit > Open Database Object --> File > Export as PDF
    2. Filled PDF made from (1)
      • Type data manually
    3. Extracted CSV data from (2)
      • pdftk filled_form-submit_format_fdf.pdf dump_data_fields output filled_form.csv
    4. Extracted FDF data from (2)
      • pdftk filled_form-submit_format_fdf.pdf generate_fdf output filled_form.fdf

    *Note: You can also use Libre Base to export a blank form as XML or HTML, but I don't know how to input data into those files nor import that data back into Base.

    • αғsнιη
      αғsнιη over 9 years
      Could you provide a sample PDF file?
    • Rmano
      Rmano over 9 years
      ...because PDF is thought as a terminal, archive only format - like a print. There are literally millions of way to forge a PDF file with the same exact appearance, it would be really complex to parse all. Can't Libre Base save/export in other formats? Otherwise, you can try things like pdftotext and a bit (or a lot) of scripting....
    • tohuwawohu
      tohuwawohu over 9 years
      FDF is a good starting point, since it's text-based and has the form data in a clear fieldname/value structure. Using perl, for example, it shouldn't be too difficult to transform this to csv, but this is off-topic here (since it isn't ubuntu-specific).
    • jtd
      jtd over 9 years
      @Rmano: I wish Base could decode a PDF file that Base created. Base can export forms as PDF, HTML, or XML but I don't know how to ask others to input data into the HTML or XML files. That script is too difficult for me now. Thanks.
    • muru
      muru over 9 years
      @jtd you could have just rolled back my edit. The point of that edit was so that people wouldn't have to go there to download each file.
    • jtd
      jtd over 9 years
      @muru: Apologies. I thought AU automatically changed links, and since those API links failed for me, I changed them. However, I think I reached a solution that satisfies your concern.
    • αғsнιη
      αғsнιη over 9 years
      @jtd Ok. I see in your filled_form.csv file you have another fields like FieldType, FieldFlags and FieldJustification; how did you get them from pdf file? Does they are always same (I mean fixed value?). And last question: Does filled_form.csv is your final result that you want? please edit the question and add more details with your input(Which file?) and what result (from input file as output). Thanks
    • Max Wyss
      Max Wyss over 9 years
      Note that Libre Base creates a (flat) PDF Document when it exports a PDF. But what you want to import back into Base is Data. The way PDF makes Data available would be via form fields (which you don't have when exporting).
    • jtd
      jtd over 9 years
      @KasiyA: Question now demonstrates how to generate each file. I want to import data from 2, 3, or 4 (same data, different formats) into Base as a record. It is fine to use a new CSV as an intermediary as described in question (Row 1, Row 2, ...).
    • jtd
      jtd over 9 years
      @MaxWyss: I used pdftk (thanks @tohuwawohu) to extract form fields data from 2 and generate files 3 and 4. I still cannot get form fields data from 3 and 4 into Base as a record.
  • jtd
    jtd over 9 years
    Thanks for your answer. Although I know nothing about PDF coding, Base appears to generate a pdf that carries more than printing information. It carries the form fields information, i.e., FieldName1= "CompanyName," Value1= "" (and anyone can change the value of the latter). Other programs (pdftk) can read these values (see my files on Ge.tt). That is why I hoped and assumed, incorrectly, that Base could read the form fields information and import it.
  • jtd
    jtd over 9 years
    This is genius. I am going to play with radio buttons, checklists, etc., and see if pdfgrep or other options can handle that. Thank you!