Hive load CSV with commas in quoted fields
Solution 1
The problem is that Hive
doesn't handle quoted texts. You either need to pre-process the data by changing the delimiter between the fields (e.g: with a Hadoop-streaming job) or you can also give a try to use a custom CSV SerDe which uses OpenCSV to parse the files.
Solution 2
If you can re-create or parse your input data, you can specify an escape character for the CREATE TABLE:
ROW FORMAT DELIMITED FIELDS TERMINATED BY "," ESCAPED BY '\\';
Will accept this line as 4 fields
1,some text\, with comma in it,123,more text
Solution 3
As of Hive 0.14, the CSV SerDe is a standard part of the Hive install
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
(See: https://cwiki.apache.org/confluence/display/Hive/CSV+Serde)
Martijn Lenderink
Updated on July 09, 2022Comments
-
Martijn Lenderink almost 2 years
I am trying to load a CSV file into a Hive table like so:
CREATE TABLE mytable ( num1 INT, text1 STRING, num2 INT, text2 STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ","; LOAD DATA LOCAL INPATH '/data.csv' OVERWRITE INTO TABLE mytable;
The csv is delimited by an comma (,) and looks like this:1, "some text, with comma in it", 123, "more text"
This will return corrupt data since there is a ',' in the first string.
Is there a way to set an text delimiter or make Hive ignore the ',' in strings?I can't change the delimiter of the csv since it gets pulled from an external source.
-
Ken Williams almost 11 yearsThat handles embedded commas, but not embedded newlines, which are the other gotcha in CSV data. Or can the newlines be escaped too? The spec at cwiki.apache.org/confluence/display/Hive/… doesn't seem to allow escaping newlines.
-
ekta almost 10 yearssed -i 's/"//g' your_file_name does the pre-process inplace by removing the quoted text. However, you NEED to be certain that there is no innocous removal of other intended quoted (") characters.
-
bartektartanus about 8 yearsIf your HIVE is up-to-date, this is the best answer :)
-
Kulasangar almost 7 yearsThis helped me too!
-
JeffR over 6 yearsWhen you use OpenCSVSerde is there a way to specify what Null is defined with? Using "ROW FORMAT DELIMITED" I could add the option "NULL DEFINED AS ' '" to recognize null values in the data.
-
Gocht over 6 yearsThis is not working for me, Hive shows the quoted value as NULL
-
Aditya over 4 years@wrschneider, where can i download this serde ?
-
wrschneider over 4 years@Aditya it should come out of the box now and you should no longer have to download it
-
Aditya over 4 years@wrschneider thanks for your response. I could find it now. However this does not support multiline values in column. Example is multiline address. The only solution seem replacing the \n within quotes using scala or python. Any suggestion here ?
-
Nas Banov over 4 yearsit's working since
'\;'
is the same thing as';'
. There is no need to escape semicolon - but there is no need either -
Tokci over 3 yearsit creates table with all strings in schema
-
John Prawyn over 3 yearsThanks a lot, this is working for me. To re-create csv file, I used python csv writer with following dialect csv.register_dialect('for_hive', escapechar='\\', quoting=csv.QUOTE_NONE)