How to Access Hive via Python?

python hadoop hive

227,655

Solution 1

I believe the easiest way is to use PyHive.

To install you'll need these libraries:

pip install sasl
pip install thrift
pip install thrift-sasl
pip install PyHive

Please note that although you install the library as PyHive, you import the module as pyhive, all lower-case.

If you're on Linux, you may need to install SASL separately before running the above. Install the package libsasl2-dev using apt-get or yum or whatever package manager for your distribution. For Windows there are some options on GNU.org, you can download a binary installer. On a Mac SASL should be available if you've installed xcode developer tools (xcode-select --install in Terminal)

After installation, you can connect to Hive like this:

from pyhive import hive
conn = hive.Connection(host="YOUR_HIVE_HOST", port=PORT, username="YOU")

Now that you have the hive connection, you have options how to use it. You can just straight-up query:

cursor = conn.cursor()
cursor.execute("SELECT cool_stuff FROM hive_table")
for result in cursor.fetchall():
  use_result(result)

...or to use the connection to make a Pandas dataframe:

import pandas as pd
df = pd.read_sql("SELECT cool_stuff FROM hive_table", conn)

Solution 2

I assert that you are using HiveServer2, which is the reason that makes the code doesn't work.

You may use pyhs2 to access your Hive correctly and the example code like that:

import pyhs2

with pyhs2.connect(host='localhost',
               port=10000,
               authMechanism="PLAIN",
               user='root',
               password='test',
               database='default') as conn:
    with conn.cursor() as cur:
        #Show databases
        print cur.getDatabases()

        #Execute query
        cur.execute("select * from table")

        #Return column info from query
        print cur.getSchema()

        #Fetch table results
        for i in cur.fetch():
            print i

Attention that you may install python-devel.x86_64 cyrus-sasl-devel.x86_64 before installing pyhs2 with pip.

Wish this can help you.

Reference: https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-PythonClientDriver

Solution 3

Below python program should work to access hive tables from python:

import commands

cmd = "hive -S -e 'SELECT * FROM db_name.table_name LIMIT 1;' "

status, output = commands.getstatusoutput(cmd)

if status == 0:
   print output
else:
   print "error"

Solution 4

here's a generic approach which makes it easy for me because I keep connecting to several servers (SQL, Teradata, Hive etc.) from python. Hence, I use the pyodbc connector. Here's some basic steps to get going with pyodbc (in case you have never used it):

Pre-requisite: You should have the relevant ODBC connection in your windows setup before you follow the below steps. In case you don't have it, find the same here

Once complete: STEP 1. pip install: pip install pyodbc (here's the link to download the relevant driver from Microsoft's website)

STEP 2. now, import the same in your python script:

import pyodbc

STEP 3. Finally, go ahead and give the connection details as follows:

conn_hive = pyodbc.connect('DSN = YOUR_DSN_NAME , SERVER = YOUR_SERVER_NAME, UID = USER_ID, PWD = PSWD' )

The best part of using pyodbc is that I have to import just one package to connect to almost any data source.

Solution 5

To connect using a username/password and specifying ports, the code looks like this:

from pyhive import presto

cursor = presto.connect(host='host.example.com',
                    port=8081,
                    username='USERNAME:PASSWORD').cursor()

sql = 'select * from table limit 10'

cursor.execute(sql)

print(cursor.fetchone())
print(cursor.fetchall())

View more solutions

227,655

Author by

Matthew Moisen

Backend engineer specializing in Python and RDBMS.

Updated on February 12, 2021

Comments

Matthew Moisen about 3 years

https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-Python appears to be outdated.

When I add this to /etc/profile:

export PYTHONPATH=$PYTHONPATH:/usr/lib/hive/lib/py

I can then do the imports as listed in the link, with the exception of from hive import ThriftHive which actually need to be:

from hive_service import ThriftHive

Next the port in the example was 10000, which when I tried caused the program to hang. The default Hive Thrift port is 9083, which stopped the hanging.

So I set it up like so:

from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol
try:
    transport = TSocket.TSocket('<node-with-metastore>', 9083)
    transport = TTransport.TBufferedTransport(transport)
    protocol = TBinaryProtocol.TBinaryProtocol(transport)
    client = ThriftHive.Client(protocol)
    transport.open()
    client.execute("CREATE TABLE test(c1 int)")

    transport.close()
except Thrift.TException, tx:
    print '%s' % (tx.message)

I received the following error:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/hive/lib/py/hive_service/ThriftHive.py", line 68, in execute
self.recv_execute()
File "/usr/lib/hive/lib/py/hive_service/ThriftHive.py", line 84, in recv_execute
raise x
thrift.Thrift.TApplicationException: Invalid method name: 'execute'

But inspecting the ThriftHive.py file reveals the method execute within the Client class.

How may I use Python to access Hive?

muro over 9 years

Thanks for this, I was having issues installing pyhs2 with pip in CentOS 6, but your suggested YUM libraries did the trick. One quick comment, in your example code above, you need to indent everything below database='default') as conn: for it work correctly. There's some additional documentation here: github.com/BradRuderman/pyhs2
Liondancer over 9 years

I am getting this error. I was wondering if you knew anything about this? github.com/BradRuderman/pyhs2/issues/32
Matthew Moisen about 9 years

Hi Naveen. My apologies for the year+ response. When I try this example for one of my tables, it appears that the client.fetchOne() command is returning a string, not a row/array/dict object that is indexed by columns. Is this your impression as well? I would like to be able to access the individual columns. Regards.
Matthew Moisen over 8 years

At the time I wrote this question, I was not on HiveServer2. However we just installed it yesterday, and I can confirm this answer works on HS2.
Tagar over 8 years

that's awesome.. would this approach work for Hive Server 2 servers too? would this work with kerberized clusters? thanks.
joefromct over 8 years

FYI, as of now sasl doesn't work with python 3. Other info here.
TayTay over 8 years

How would you use kerberos authentication instead of plain?
noleto about 8 years

I experienced some troubles when connecting to HiveServer2 on Debian. Error was: "SASL authentication failure: No worthy mechs found". I had to install libsasl2-modules package (via apt-get) to get it works.
chandank about 8 years

The project home page in github says it is no longer actively maintained
Matthew Moisen about 8 years

I can confirm this works with Hive Server 2. However, like @chandank mentioned, the project is not being maintained and I've noticed a couple of bugs. It doesn't return unicode values (which if you are for example displaying the values on Flask/Jinja will through an error), fetchmany() returns null values if the number provided is greater than the number of rows left, and etc.
Matthew Moisen about 8 years

Hi Tristan, would you happen to have an example with Hive Server 2? I couldn't find any reference to a password for Connection at pythonhosted.org/PyHive/pyhive.html#module-pyhive.hive
Tagar almost 8 years

+1 might be good in some quick-and-dirty cases when you can't install external yum or pip packages on a server.
pele88 over 7 years

@Ruslan see this post for connecting to a kerberised cluster stackoverflow.com/a/39698445/4186116
Tagar over 7 years

@pele88 thank you. pyhs2 is no longer maintained. see my response at stackoverflow.com/a/38666630/470583
ML_Passion over 7 years

@python-starter , your method only works when the hive resides on the same server where python is installed. If you are accessing the hive tables on a remote server, I guess something else is required.
Marlon Abeykoon about 7 years

You have said that "For Windows there are some options on GNU.org, you can download a binary installer". Can you provide the link? cos I still have an issue.. stackoverflow.com/questions/42210901/…
kten almost 7 years

i am getting an error "could not connect to any of (IP Address)" I gave the host name but it resolved to IP. port=10000,username="username",database="def_db". Could anyone point to any additional params required?
skywalkerytx almost 7 years

@ML_Passion could be just few more lines of bash code to deal with remote problem. however parse the result will be the real hell (esp. when try this with Asian language and weird dirty data from other team)
VeLKerr almost 7 years

When I tried to install sasl via Pip, I caught an error: sasl/saslwrapper.h:22:23: fatal error: sasl/sasl.h: No such file or directory. Installation libsasl2-dev via apt-get, installation works fine.
Goutham Anush almost 6 years

I have a strange problem. If I give "select * from <table> limit 10" its working fine. But if i give "select * from <table> where hostname='<host>'" it is failing. It says raise NotSupportedError("Hive does not have transactions") # pragma: no cover Any idea why this is happening? I am able to execute the same query on Ambari.
Sandeep Singh almost 6 years

I am using python 3 (virtual environment) though I am getting this error AttributeError: module 'subprocess' has no attribute 'getstatusoutput'
goks almost 6 years

@SandeepSingh Check if you have any python module with name 'subprocess' in your working directory
Tagar over 5 years

It's a good solution. The only thing that worries me with the jaydebeapi approach is that database connectivity requires a jvm being spawned - which is a huge overhead. Also jaydebeapi is still marked as "beta" on pip registry and hasn't been updated for last couple of years.
Carsten over 4 years

I get the same kind error. No idea where to start looking.
Tagar over 4 years

Impyla's primary role was an interface for Impala, and it doesn't support transactions, hence NotImplemented errors when you try to commit/rollback etc. Filed github.com/cloudera/impyla/issues/372 improvement request
chris_aych over 3 years

This was so helpful! Thank you