How to Access Hive via Python?

227,655

Solution 1

I believe the easiest way is to use PyHive.

To install you'll need these libraries:

pip install sasl
pip install thrift
pip install thrift-sasl
pip install PyHive

Please note that although you install the library as PyHive, you import the module as pyhive, all lower-case.

If you're on Linux, you may need to install SASL separately before running the above. Install the package libsasl2-dev using apt-get or yum or whatever package manager for your distribution. For Windows there are some options on GNU.org, you can download a binary installer. On a Mac SASL should be available if you've installed xcode developer tools (xcode-select --install in Terminal)

After installation, you can connect to Hive like this:

from pyhive import hive
conn = hive.Connection(host="YOUR_HIVE_HOST", port=PORT, username="YOU")

Now that you have the hive connection, you have options how to use it. You can just straight-up query:

cursor = conn.cursor()
cursor.execute("SELECT cool_stuff FROM hive_table")
for result in cursor.fetchall():
  use_result(result)

...or to use the connection to make a Pandas dataframe:

import pandas as pd
df = pd.read_sql("SELECT cool_stuff FROM hive_table", conn)

Solution 2

I assert that you are using HiveServer2, which is the reason that makes the code doesn't work.

You may use pyhs2 to access your Hive correctly and the example code like that:

import pyhs2

with pyhs2.connect(host='localhost',
               port=10000,
               authMechanism="PLAIN",
               user='root',
               password='test',
               database='default') as conn:
    with conn.cursor() as cur:
        #Show databases
        print cur.getDatabases()

        #Execute query
        cur.execute("select * from table")

        #Return column info from query
        print cur.getSchema()

        #Fetch table results
        for i in cur.fetch():
            print i

Attention that you may install python-devel.x86_64 cyrus-sasl-devel.x86_64 before installing pyhs2 with pip.

Wish this can help you.

Reference: https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-PythonClientDriver

Solution 3

Below python program should work to access hive tables from python:

import commands

cmd = "hive -S -e 'SELECT * FROM db_name.table_name LIMIT 1;' "

status, output = commands.getstatusoutput(cmd)

if status == 0:
   print output
else:
   print "error"

Solution 4

here's a generic approach which makes it easy for me because I keep connecting to several servers (SQL, Teradata, Hive etc.) from python. Hence, I use the pyodbc connector. Here's some basic steps to get going with pyodbc (in case you have never used it):

  • Pre-requisite: You should have the relevant ODBC connection in your windows setup before you follow the below steps. In case you don't have it, find the same here

Once complete: STEP 1. pip install: pip install pyodbc (here's the link to download the relevant driver from Microsoft's website)

STEP 2. now, import the same in your python script:

import pyodbc

STEP 3. Finally, go ahead and give the connection details as follows:

conn_hive = pyodbc.connect('DSN = YOUR_DSN_NAME , SERVER = YOUR_SERVER_NAME, UID = USER_ID, PWD = PSWD' )

The best part of using pyodbc is that I have to import just one package to connect to almost any data source.

Solution 5

To connect using a username/password and specifying ports, the code looks like this:

from pyhive import presto

cursor = presto.connect(host='host.example.com',
                    port=8081,
                    username='USERNAME:PASSWORD').cursor()

sql = 'select * from table limit 10'

cursor.execute(sql)

print(cursor.fetchone())
print(cursor.fetchall())
Share:
227,655
Matthew Moisen
Author by

Matthew Moisen

Backend engineer specializing in Python and RDBMS.

Updated on February 12, 2021

Comments

  • Matthew Moisen
    Matthew Moisen about 3 years

    https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-Python appears to be outdated.

    When I add this to /etc/profile:

    export PYTHONPATH=$PYTHONPATH:/usr/lib/hive/lib/py
    

    I can then do the imports as listed in the link, with the exception of from hive import ThriftHive which actually need to be:

    from hive_service import ThriftHive
    

    Next the port in the example was 10000, which when I tried caused the program to hang. The default Hive Thrift port is 9083, which stopped the hanging.

    So I set it up like so:

    from thrift import Thrift
    from thrift.transport import TSocket
    from thrift.transport import TTransport
    from thrift.protocol import TBinaryProtocol
    try:
        transport = TSocket.TSocket('<node-with-metastore>', 9083)
        transport = TTransport.TBufferedTransport(transport)
        protocol = TBinaryProtocol.TBinaryProtocol(transport)
        client = ThriftHive.Client(protocol)
        transport.open()
        client.execute("CREATE TABLE test(c1 int)")
    
        transport.close()
    except Thrift.TException, tx:
        print '%s' % (tx.message)
    

    I received the following error:

    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/usr/lib/hive/lib/py/hive_service/ThriftHive.py", line 68, in execute
    self.recv_execute()
    File "/usr/lib/hive/lib/py/hive_service/ThriftHive.py", line 84, in recv_execute
    raise x
    thrift.Thrift.TApplicationException: Invalid method name: 'execute'
    

    But inspecting the ThriftHive.py file reveals the method execute within the Client class.

    How may I use Python to access Hive?

  • muro
    muro over 9 years
    Thanks for this, I was having issues installing pyhs2 with pip in CentOS 6, but your suggested YUM libraries did the trick. One quick comment, in your example code above, you need to indent everything below database='default') as conn: for it work correctly. There's some additional documentation here: github.com/BradRuderman/pyhs2
  • Liondancer
    Liondancer over 9 years
    I am getting this error. I was wondering if you knew anything about this? github.com/BradRuderman/pyhs2/issues/32
  • Matthew Moisen
    Matthew Moisen about 9 years
    Hi Naveen. My apologies for the year+ response. When I try this example for one of my tables, it appears that the client.fetchOne() command is returning a string, not a row/array/dict object that is indexed by columns. Is this your impression as well? I would like to be able to access the individual columns. Regards.
  • Matthew Moisen
    Matthew Moisen over 8 years
    At the time I wrote this question, I was not on HiveServer2. However we just installed it yesterday, and I can confirm this answer works on HS2.
  • Tagar
    Tagar over 8 years
    that's awesome.. would this approach work for Hive Server 2 servers too? would this work with kerberized clusters? thanks.
  • joefromct
    joefromct over 8 years
    FYI, as of now sasl doesn't work with python 3. Other info here.
  • TayTay
    TayTay over 8 years
    How would you use kerberos authentication instead of plain?
  • noleto
    noleto about 8 years
    I experienced some troubles when connecting to HiveServer2 on Debian. Error was: "SASL authentication failure: No worthy mechs found". I had to install libsasl2-modules package (via apt-get) to get it works.
  • chandank
    chandank about 8 years
    The project home page in github says it is no longer actively maintained
  • Matthew Moisen
    Matthew Moisen about 8 years
    I can confirm this works with Hive Server 2. However, like @chandank mentioned, the project is not being maintained and I've noticed a couple of bugs. It doesn't return unicode values (which if you are for example displaying the values on Flask/Jinja will through an error), fetchmany() returns null values if the number provided is greater than the number of rows left, and etc.
  • Matthew Moisen
    Matthew Moisen about 8 years
    Hi Tristan, would you happen to have an example with Hive Server 2? I couldn't find any reference to a password for Connection at pythonhosted.org/PyHive/pyhive.html#module-pyhive.hive
  • Tagar
    Tagar almost 8 years
    +1 might be good in some quick-and-dirty cases when you can't install external yum or pip packages on a server.
  • pele88
    pele88 over 7 years
    @Ruslan see this post for connecting to a kerberised cluster stackoverflow.com/a/39698445/4186116
  • Tagar
    Tagar over 7 years
    @pele88 thank you. pyhs2 is no longer maintained. see my response at stackoverflow.com/a/38666630/470583
  • ML_Passion
    ML_Passion over 7 years
    @python-starter , your method only works when the hive resides on the same server where python is installed. If you are accessing the hive tables on a remote server, I guess something else is required.
  • Marlon Abeykoon
    Marlon Abeykoon about 7 years
    You have said that "For Windows there are some options on GNU.org, you can download a binary installer". Can you provide the link? cos I still have an issue.. stackoverflow.com/questions/42210901/…
  • kten
    kten almost 7 years
    i am getting an error "could not connect to any of (IP Address)" I gave the host name but it resolved to IP. port=10000,username="username",database="def_db". Could anyone point to any additional params required?
  • skywalkerytx
    skywalkerytx almost 7 years
    @ML_Passion could be just few more lines of bash code to deal with remote problem. however parse the result will be the real hell (esp. when try this with Asian language and weird dirty data from other team)
  • VeLKerr
    VeLKerr almost 7 years
    When I tried to install sasl via Pip, I caught an error: sasl/saslwrapper.h:22:23: fatal error: sasl/sasl.h: No such file or directory. Installation libsasl2-dev via apt-get, installation works fine.
  • Goutham Anush
    Goutham Anush almost 6 years
    I have a strange problem. If I give "select * from <table> limit 10" its working fine. But if i give "select * from <table> where hostname='<host>'" it is failing. It says raise NotSupportedError("Hive does not have transactions") # pragma: no cover Any idea why this is happening? I am able to execute the same query on Ambari.
  • Sandeep Singh
    Sandeep Singh almost 6 years
    I am using python 3 (virtual environment) though I am getting this error AttributeError: module 'subprocess' has no attribute 'getstatusoutput'
  • goks
    goks almost 6 years
    @SandeepSingh Check if you have any python module with name 'subprocess' in your working directory
  • Tagar
    Tagar over 5 years
    It's a good solution. The only thing that worries me with the jaydebeapi approach is that database connectivity requires a jvm being spawned - which is a huge overhead. Also jaydebeapi is still marked as "beta" on pip registry and hasn't been updated for last couple of years.
  • Carsten
    Carsten over 4 years
    I get the same kind error. No idea where to start looking.
  • Tagar
    Tagar over 4 years
    Impyla's primary role was an interface for Impala, and it doesn't support transactions, hence NotImplemented errors when you try to commit/rollback etc. Filed github.com/cloudera/impyla/issues/372 improvement request
  • chris_aych
    chris_aych over 3 years
    This was so helpful! Thank you