Hive Data to Pandas Data frame
22,413
Solution 1
pd.read_sql() (pandas 0.24.0) takes a DB connection. Use PyHive connection directly with pandas.read_sql()
as follows:
from pyhive import hive
import pandas as pd
# open connection
conn = hive.Connection(host=host,port= 20000, ...)
# query the table to a new dataframe
dataframe = pd.read_sql("SELECT id, name FROM test.example_table", conn)
Dataframe's columns will be named after the hive table's. One can change them during/after dataframe creation if needed:
- via HiveQL:
SELECT id AS new_column_name ...
- via columns attribute in
pd.read_sql()
Solution 2
You can try this: ( I'm pretty sure it will work)
res = cur.getSchema()
description = list(col['columnName'] for col in res) ## for getting the column names of the table
headers = [x.split(".")[1] for x in description] # for splitting the list if the column name contains a period
df= pd.DataFrame(cur.fetchall(), columns = headers)
df.head(n = 20)
Author by
ankita gupta
Updated on February 23, 2020Comments
-
ankita gupta about 4 years
Newbie to Python.
How can i save the data from hive to Pandas data frame.
with pyhs2.connect(host, port=20000,authMechanism="PLAIN",user,password, database) as conn: with conn.cursor() as cur: #Show databases print cur.getDatabases() #Execute query cur.execute(query) #Return column info from query print cur.getSchema() #Fetch table results for i in cur.fetch(): print i **columnNames = [a['columnName'] for a in cur.getSchema()] print columnNames df1=pd.DataFrame(cur.fetch(),columnNames)**
Tried using column names. Didn't Work.
Pls. suggest something.