Write pandas table to impala

11,119

Solution 1

You're going to love Ibis! It has the HDFS functions (put, namely) and wraps the Impala DML and DDL you'll need to make this easy.

The general approach I've used for something similar is to save your pandas table to a CSV, HDFS.put that on to the cluster, and then create a new table using that CSV as the data source.

You don't need Ibis for this, but it should make it a little bit easier and may be a nice tool for you if you're already familiar with pandas (Ibis was also created by Wes, who wrote pandas).

Solution 2

I am trying to do same thing and I figured out a way to do this with an example provided with impyla:

df = pd.DataFrame(np.reshape(range(16), (4, 4)), columns=['a', 'b', 'c', 'd'])
df.to_sql(name=”test_df”, con=conn, flavor=”mysql”)

This works fine and table in impala (backend mysql) works fine.

However, I got stuck on getting text values in as impala tries to do analysis on columns and I get cast errors. (It would be really nice if possible to implicitly cast from string to [var]char(N) in impyla.)

Share:
11,119
SummerEla
Author by

SummerEla

Bioinformatic Scientist (data science for science science) Nerd, in progress Enjoys long walks to the pub on moonlit nights... on any night, really

Updated on July 20, 2022

Comments

  • SummerEla
    SummerEla almost 2 years

    Using the impyla module, I've downloaded the results of an impala query into a pandas dataframe, done analysis, and would now like to write the results back to a table on impala, or at least to an hdfs file.

    However, I cannot find any information on how to do this, or even how to ssh into the impala shell and write the table from there.

    What I'd like to do:

    from impala.dbapi import connect
    from impala.util import as_pandas
    
    # connect to my host and port
    conn=connect(host='myhost', port=111) 
    
    # create query to save table as pandas df
    create_query = """
         SELECT * FROM {}
         """.format(my_table_name)
    
    # run query on impala
    cur = conn.cursor()
    cur.execute(create_query)
    
    # store results as pandas data frame
    pandas_df = as_pandas(cur)
    cur.close()  
    

    Once I've done whatever I need to do with pandas_df, save those results back to impala as a table.

    # create query to save new_df back to impala
    save_query = """
        CREATE TABLE new_table AS
            SELECT * 
            FROM pandas_df
        """
    
    # run query on impala
    cur = conn.cursor()
    cur.execute(save_query)
    cur.close()
    

    The above scenario would be ideal, but I'd be happy if I could figure out how to ssh into impala-shell and do this from python, or even just save the table to hdfs. I'm writing this as a script for other users, so it's essential to have this all done within the script. Thanks so much!

  • SummerEla
    SummerEla over 8 years
    I was actually just reading up on Ibis the other day. Sounds like I'll have to check it out.
  • elvainch
    elvainch about 6 years
    Hi im getting an error in Ibis in the db.create_table('mytablename',mydataframe). The error is File "C:\Anaconda3\lib\json\decoder.py", line 357, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0). Any idea what could be? @Isaac