How can I partition a table with HIVE?

10,077

If I understand correctly, you have files in the folders 4 level deep from the directory logs. In that case, you define your table as external with path 'logs' and partitioned by 4 virtual fields: year, month, day_of_month, hour_of_day.

The partitioning is essentially done for you by Flume.

EDIT 3/9: A lot of details depends on how exactly Flume writes files. But in general terms, your DDL should look something like this:

CREATE TABLE table_name(fields...)
PARTITIONED BY(log_year STRING, log_month STRING, 
    log_day_of_month STRING, log_hour_of_day STRING)
format description
STORED AS TEXTFILE
LOCATION '/your user path/logs';

EDIT 3/15: Per zzarbi request, I'm adding a note that after the table is created, the Hive needs to be informed about partitions created. This needs to be done repeatedly as long as Flume or other process creates new partitions. See my answer to Create external with Partition question.

Share:
10,077
zzarbi
Author by

zzarbi

Engineering Things

Updated on June 17, 2022

Comments

  • zzarbi
    zzarbi almost 2 years

    I've been playing with Hive for few days now but I still have a hard time with partition.

    I've been recording Apache logs (Combine format) in Hadoop for few months. They are stored in row text format, partitioned by date (via flume): /logs/yyyy/mm/dd/hh/*

    Example:

    /logs/2012/02/10/00/Part01xx (02/10/2012 12:00 am)
    /logs/2012/02/10/00/Part02xx
    /logs/2012/02/10/13/Part0xxx (02/10/2012 01:00 pm)
    

    The date in the combined log file is following this format [10/Feb/2012:00:00:00 -0800]

    How can I create a external table with partition in Hive that use my physical partition. I can't find any good documentation on Hive partition. I found related Question such as:

    If I load my logs in an external table with Hive, I cannot partition with the time, since it's not the good format (Feb <=> 02). Even if if it was in a good format how do i transform a string "10/02/2012:00:00:00 -0800" into multiple directory "/2012/02/10/00"?

    I could eventually use pig script to convert my raw logs into hive tables but at this point I should just be using pig instead of hive to do my reporting.

  • zzarbi
    zzarbi about 12 years
    So I would the creation of the table look like? and I would I do a query to use those partition?
  • zzarbi
    zzarbi about 12 years
    I'll have to test that I'll get back to you as soon as I can
  • zzarbi
    zzarbi about 12 years
    Olaf I tried your solution pastebin.com/TkLCzWdv the table is created correctly however if I query Select count(*) from raw_datastore where year = '2012' and month = '02'; It launch a map/reduce job but there are no result.
  • Olaf
    Olaf about 12 years
    It is hard to diagnose this kind of problem without looking at the data and at the system setup. Plus answering SO question doesn't pay all that well ;-) I suggest you copy files from one of the Flume-created directories into a separate directory, define a new unpartitioned external Hive table and make sure that select count(*) returns good value. After that you can start troubleshooting partitioning.
  • zzarbi
    zzarbi about 12 years
    Actually my problem is pretty easy... I have to add my partition one by one... Which I tried and failed until I understood the correct syntax: ALTER TABLE raw_datastore ADD PARTITION (year = '2011', month='05', day='05', hour='14') LOCATION '/logs/2011/05/05/14'; If I do that now I can do select count(*) and it works. So I'm going to appprove your answer but it would be cool if you can edit your answer to add the fact that you need to add the partition yourself.
  • Olaf
    Olaf about 12 years
    Thanks! I've edited my answer and linked to another question on declaring table partitions to Hive.