How to connect to Amazon Redshift or other DB's in Apache Spark?

15,193

Solution 1

Although this seems to be a very old post, anyone who is still looking for answer, below steps worked for me!

Start the shell including the jar.

bin/pyspark --driver-class-path /path_to_postgresql-42.1.4.jar --jars /path_to_postgresql-42.1.4.jar

Create a df by giving appropriate details:

myDF = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:redshift://host:port/db_name") \
    .option("dbtable", "table_name") \
    .option("user", "user_name") \
    .option("password", "password") \
    .load()

Spark Version: 2.2

Solution 2

It turns out you only need a username/pwd to access Redshift in Spark, and it is done as follows (using the Python API):

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.load(source="jdbc", 
                     url="jdbc:postgresql://host:port/dbserver?user=yourusername&password=secret", 
                     dbtable="schema.table"
)

Hope this helps someone!

Solution 3

If you're using Spark 1.4.0 or newer, check out spark-redshift, a library which supports loading data from Redshift into Spark SQL DataFrames and saving DataFrames back to Redshift. If you're querying large volumes of data, this approach should perform better than JDBC because it will be able to unload and query the data in parallel.

If you still want to use JDBC, check out the new built-in JDBC data source in Spark 1.4+.

Disclosure: I'm one of the authors of spark-redshift.

Solution 4

You first need to download Postgres JDBC driver. You can find it here: https://jdbc.postgresql.org/

You can either define your environment variable SPARK_CLASSPATH in .bashrc, conf/spark-env.sh or similar file or specify it in the script before you run your IPython notebook.

You can also define it in your conf/spark-defaults.conf in the following way:

spark.driver.extraClassPath  /path/to/file/postgresql-9.4-1201.jdbc41.jar

Make sure it is reflected in the Environment tab of your Spark WebUI.

You will also need to set appropriate AWS credentials in the following way:

sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "***")
sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "***")
Share:
15,193
Evan Zamir
Author by

Evan Zamir

Updated on June 07, 2022

Comments

  • Evan Zamir
    Evan Zamir almost 2 years

    I'm trying to connect to Amazon Redshift via Spark, so I can join data we have on S3 with data on our RS cluster. I found some very spartan documentation here for the capability of connecting to JDBC:

    https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#jdbc-to-other-databases

    The load command seems fairly straightforward (although I don't know how I would enter AWS credentials here, maybe in the options?).

    df = sqlContext.load(source="jdbc", url="jdbc:postgresql:dbserver", dbtable="schema.tablename")
    

    And I'm not entirely sure how to deal with the SPARK_CLASSPATH variable. I'm running Spark locally for now through an iPython notebook (as part of the Spark distribution). Where do I define that so that Spark loads it?

    Anyway, for now, when I try running these commands, I get a bunch of undecipherable errors, so I'm kind of stuck for now. Any help or pointers to detailed tutorials are appreciated.