Hive load CSV with commas in quoted fields

112,468

Solution 1

The problem is that Hive doesn't handle quoted texts. You either need to pre-process the data by changing the delimiter between the fields (e.g: with a Hadoop-streaming job) or you can also give a try to use a custom CSV SerDe which uses OpenCSV to parse the files.

Solution 2

If you can re-create or parse your input data, you can specify an escape character for the CREATE TABLE:

ROW FORMAT DELIMITED FIELDS TERMINATED BY "," ESCAPED BY '\\';

Will accept this line as 4 fields

1,some text\, with comma in it,123,more text

Solution 3

As of Hive 0.14, the CSV SerDe is a standard part of the Hive install

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

(See: https://cwiki.apache.org/confluence/display/Hive/CSV+Serde)

Share:
112,468
Martijn Lenderink
Author by

Martijn Lenderink

Updated on July 09, 2022

Comments

  • Martijn Lenderink
    Martijn Lenderink almost 2 years

    I am trying to load a CSV file into a Hive table like so:

    CREATE TABLE mytable
    (
    num1 INT,
    text1 STRING,
    num2 INT,
    text2 STRING
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";
    
    LOAD DATA LOCAL INPATH '/data.csv'
    OVERWRITE INTO TABLE mytable;    
    


    The csv is delimited by an comma (,) and looks like this:

    1, "some text, with comma in it", 123, "more text"
    

    This will return corrupt data since there is a ',' in the first string.
    Is there a way to set an text delimiter or make Hive ignore the ',' in strings?

    I can't change the delimiter of the csv since it gets pulled from an external source.

  • Ken Williams
    Ken Williams almost 11 years
    That handles embedded commas, but not embedded newlines, which are the other gotcha in CSV data. Or can the newlines be escaped too? The spec at cwiki.apache.org/confluence/display/Hive/… doesn't seem to allow escaping newlines.
  • ekta
    ekta almost 10 years
    sed -i 's/"//g' your_file_name does the pre-process inplace by removing the quoted text. However, you NEED to be certain that there is no innocous removal of other intended quoted (") characters.
  • bartektartanus
    bartektartanus about 8 years
    If your HIVE is up-to-date, this is the best answer :)
  • Kulasangar
    Kulasangar almost 7 years
    This helped me too!
  • JeffR
    JeffR over 6 years
    When you use OpenCSVSerde is there a way to specify what Null is defined with? Using "ROW FORMAT DELIMITED" I could add the option "NULL DEFINED AS ' '" to recognize null values in the data.
  • Gocht
    Gocht over 6 years
    This is not working for me, Hive shows the quoted value as NULL
  • Aditya
    Aditya over 4 years
    @wrschneider, where can i download this serde ?
  • wrschneider
    wrschneider over 4 years
    @Aditya it should come out of the box now and you should no longer have to download it
  • Aditya
    Aditya over 4 years
    @wrschneider thanks for your response. I could find it now. However this does not support multiline values in column. Example is multiline address. The only solution seem replacing the \n within quotes using scala or python. Any suggestion here ?
  • Nas Banov
    Nas Banov over 4 years
    it's working since '\;' is the same thing as ';'. There is no need to escape semicolon - but there is no need either
  • Tokci
    Tokci over 3 years
    it creates table with all strings in schema
  • John Prawyn
    John Prawyn over 3 years
    Thanks a lot, this is working for me. To re-create csv file, I used python csv writer with following dialect csv.register_dialect('for_hive', escapechar='\\', quoting=csv.QUOTE_NONE)