How to Access Hive via Python?
Solution 1
I believe the easiest way is to use PyHive.
To install you'll need these libraries:
pip install sasl
pip install thrift
pip install thrift-sasl
pip install PyHive
Please note that although you install the library as PyHive
, you import the module as pyhive
, all lower-case.
If you're on Linux, you may need to install SASL separately before running the above. Install the package libsasl2-dev using apt-get or yum or whatever package manager for your distribution. For Windows there are some options on GNU.org, you can download a binary installer. On a Mac SASL should be available if you've installed xcode developer tools (xcode-select --install
in Terminal)
After installation, you can connect to Hive like this:
from pyhive import hive
conn = hive.Connection(host="YOUR_HIVE_HOST", port=PORT, username="YOU")
Now that you have the hive connection, you have options how to use it. You can just straight-up query:
cursor = conn.cursor()
cursor.execute("SELECT cool_stuff FROM hive_table")
for result in cursor.fetchall():
use_result(result)
...or to use the connection to make a Pandas dataframe:
import pandas as pd
df = pd.read_sql("SELECT cool_stuff FROM hive_table", conn)
Solution 2
I assert that you are using HiveServer2, which is the reason that makes the code doesn't work.
You may use pyhs2 to access your Hive correctly and the example code like that:
import pyhs2
with pyhs2.connect(host='localhost',
port=10000,
authMechanism="PLAIN",
user='root',
password='test',
database='default') as conn:
with conn.cursor() as cur:
#Show databases
print cur.getDatabases()
#Execute query
cur.execute("select * from table")
#Return column info from query
print cur.getSchema()
#Fetch table results
for i in cur.fetch():
print i
Attention that you may install python-devel.x86_64 cyrus-sasl-devel.x86_64 before installing pyhs2 with pip.
Wish this can help you.
Solution 3
Below python program should work to access hive tables from python:
import commands
cmd = "hive -S -e 'SELECT * FROM db_name.table_name LIMIT 1;' "
status, output = commands.getstatusoutput(cmd)
if status == 0:
print output
else:
print "error"
Solution 4
here's a generic approach which makes it easy for me because I keep connecting to several servers (SQL, Teradata, Hive etc.) from python. Hence, I use the pyodbc connector. Here's some basic steps to get going with pyodbc (in case you have never used it):
- Pre-requisite: You should have the relevant ODBC connection in your windows setup before you follow the below steps. In case you don't have it, find the same here
Once complete:
STEP 1. pip install:
pip install pyodbc
(here's the link to download the relevant driver from Microsoft's website)
STEP 2. now, import the same in your python script:
import pyodbc
STEP 3. Finally, go ahead and give the connection details as follows:
conn_hive = pyodbc.connect('DSN = YOUR_DSN_NAME , SERVER = YOUR_SERVER_NAME, UID = USER_ID, PWD = PSWD' )
The best part of using pyodbc is that I have to import just one package to connect to almost any data source.
Solution 5
To connect using a username/password and specifying ports, the code looks like this:
from pyhive import presto
cursor = presto.connect(host='host.example.com',
port=8081,
username='USERNAME:PASSWORD').cursor()
sql = 'select * from table limit 10'
cursor.execute(sql)
print(cursor.fetchone())
print(cursor.fetchall())
Matthew Moisen
Backend engineer specializing in Python and RDBMS.
Updated on February 12, 2021Comments
-
Matthew Moisen about 3 years
https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-Python appears to be outdated.
When I add this to /etc/profile:
export PYTHONPATH=$PYTHONPATH:/usr/lib/hive/lib/py
I can then do the imports as listed in the link, with the exception of
from hive import ThriftHive
which actually need to be:from hive_service import ThriftHive
Next the port in the example was 10000, which when I tried caused the program to hang. The default Hive Thrift port is 9083, which stopped the hanging.
So I set it up like so:
from thrift import Thrift from thrift.transport import TSocket from thrift.transport import TTransport from thrift.protocol import TBinaryProtocol try: transport = TSocket.TSocket('<node-with-metastore>', 9083) transport = TTransport.TBufferedTransport(transport) protocol = TBinaryProtocol.TBinaryProtocol(transport) client = ThriftHive.Client(protocol) transport.open() client.execute("CREATE TABLE test(c1 int)") transport.close() except Thrift.TException, tx: print '%s' % (tx.message)
I received the following error:
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/hive/lib/py/hive_service/ThriftHive.py", line 68, in execute self.recv_execute() File "/usr/lib/hive/lib/py/hive_service/ThriftHive.py", line 84, in recv_execute raise x thrift.Thrift.TApplicationException: Invalid method name: 'execute'
But inspecting the ThriftHive.py file reveals the method execute within the Client class.
How may I use Python to access Hive?
-
muro over 9 yearsThanks for this, I was having issues installing pyhs2 with pip in CentOS 6, but your suggested YUM libraries did the trick. One quick comment, in your example code above, you need to indent everything below database='default') as conn: for it work correctly. There's some additional documentation here: github.com/BradRuderman/pyhs2
-
Liondancer over 9 yearsI am getting this error. I was wondering if you knew anything about this? github.com/BradRuderman/pyhs2/issues/32
-
Matthew Moisen about 9 yearsHi Naveen. My apologies for the year+ response. When I try this example for one of my tables, it appears that the client.fetchOne() command is returning a string, not a row/array/dict object that is indexed by columns. Is this your impression as well? I would like to be able to access the individual columns. Regards.
-
Matthew Moisen over 8 yearsAt the time I wrote this question, I was not on HiveServer2. However we just installed it yesterday, and I can confirm this answer works on HS2.
-
Tagar over 8 yearsthat's awesome.. would this approach work for Hive Server 2 servers too? would this work with kerberized clusters? thanks.
-
joefromct over 8 yearsFYI, as of now sasl doesn't work with python 3. Other info here.
-
TayTay over 8 yearsHow would you use kerberos authentication instead of plain?
-
noleto about 8 yearsI experienced some troubles when connecting to HiveServer2 on Debian. Error was: "SASL authentication failure: No worthy mechs found". I had to install libsasl2-modules package (via apt-get) to get it works.
-
chandank about 8 yearsThe project home page in github says it is no longer actively maintained
-
Matthew Moisen about 8 yearsI can confirm this works with Hive Server 2. However, like @chandank mentioned, the project is not being maintained and I've noticed a couple of bugs. It doesn't return unicode values (which if you are for example displaying the values on Flask/Jinja will through an error),
fetchmany()
returns null values if the number provided is greater than the number of rows left, and etc. -
Matthew Moisen about 8 yearsHi Tristan, would you happen to have an example with Hive Server 2? I couldn't find any reference to a password for Connection at pythonhosted.org/PyHive/pyhive.html#module-pyhive.hive
-
Tagar almost 8 years+1 might be good in some quick-and-dirty cases when you can't install external yum or pip packages on a server.
-
pele88 over 7 years@Ruslan see this post for connecting to a kerberised cluster stackoverflow.com/a/39698445/4186116
-
Tagar over 7 years@pele88 thank you. pyhs2 is no longer maintained. see my response at stackoverflow.com/a/38666630/470583
-
ML_Passion over 7 years@python-starter , your method only works when the hive resides on the same server where python is installed. If you are accessing the hive tables on a remote server, I guess something else is required.
-
Marlon Abeykoon about 7 yearsYou have said that "For Windows there are some options on GNU.org, you can download a binary installer". Can you provide the link? cos I still have an issue.. stackoverflow.com/questions/42210901/…
-
kten almost 7 yearsi am getting an error "could not connect to any of (IP Address)" I gave the host name but it resolved to IP. port=10000,username="username",database="def_db". Could anyone point to any additional params required?
-
skywalkerytx almost 7 years@ML_Passion could be just few more lines of bash code to deal with remote problem. however parse the result will be the real hell (esp. when try this with Asian language and weird dirty data from other team)
-
VeLKerr almost 7 yearsWhen I tried to install
sasl
via Pip, I caught an error:sasl/saslwrapper.h:22:23: fatal error: sasl/sasl.h: No such file or directory
. Installationlibsasl2-dev
via apt-get, installation works fine. -
Goutham Anush almost 6 yearsI have a strange problem. If I give "select * from <table> limit 10" its working fine. But if i give "select * from <table> where hostname='<host>'" it is failing. It says raise NotSupportedError("Hive does not have transactions") # pragma: no cover Any idea why this is happening? I am able to execute the same query on Ambari.
-
Sandeep Singh almost 6 yearsI am using python 3 (virtual environment) though I am getting this error
AttributeError: module 'subprocess' has no attribute 'getstatusoutput'
-
goks almost 6 years@SandeepSingh Check if you have any python module with name 'subprocess' in your working directory
-
Tagar over 5 yearsIt's a good solution. The only thing that worries me with the jaydebeapi approach is that database connectivity requires a jvm being spawned - which is a huge overhead. Also
jaydebeapi
is still marked as "beta" on pip registry and hasn't been updated for last couple of years. -
Carsten over 4 yearsI get the same kind error. No idea where to start looking.
-
Tagar over 4 yearsImpyla's primary role was an interface for Impala, and it doesn't support transactions, hence NotImplemented errors when you try to commit/rollback etc. Filed github.com/cloudera/impyla/issues/372 improvement request
-
chris_aych over 3 yearsThis was so helpful! Thank you