How to create index in Spark Table?

14,141

There's no way to do this through a Spark SQL query, really. But there's an RDD function called zipWithIndex. You can convert the DataFrame to an RDD, do zipWithIndex, and convert the resulting RDD back to a DataFrame.

See this community Wiki article for a full-blown solution.

Another approach could be to use the Spark MLLib String Indexer.

Share:
14,141
Cherry Wu
Author by

Cherry Wu

https://github.com/hanhanwu

Updated on June 04, 2022

Comments

  • Cherry Wu
    Cherry Wu almost 2 years

    I know Spark Sql is almost same as Hive.

    Now I have created a table and when I am doing Spark sql query to create the table index, it always gives me this error:

    Error in SQL statement: AnalysisException: mismatched input '' expecting AS near ')' in create index statement

    test table

    The Spark sql query I am using is:

    CREATE INDEX word_idx ON TABLE t (id)
    

    The data type of id is bigint. Before this, I have also tried to create table index on "word" column of this table, it gave me the same error.

    So, is there anyway to create index through Spark sql query?

  • Cherry Wu
    Cherry Wu about 8 years
    Yes, I am using zipWithIndex on some RDDs, but for this one, I need to create index on a specific column, zipWithIndex is not very convenient for yes, I need to separate the data first, use zipWithIndex, then join. I am wondering whether there is a simpler way.
  • David Griffin
    David Griffin about 8 years
    Maybe take a look at the mLib StringIndexer?
  • David Griffin
    David Griffin about 8 years
    If you don't need ID to be sequential you could look at monotonically_increasing_id()