Reading Avro File in Spark

34,004

For DataFrame I'd go with Avro data source directly:

  • Include spark-avro in packages list. For the latest version use:

    com.databricks:spark-avro_2.11:3.2.0
    
  • Load the file:

    val df = spark.read
      .format("com.databricks.spark.avro")
      .load(path)
    
Share:
34,004
Gayatri
Author by

Gayatri

Curiosity is the essence of human existence. 'Who are we? Where are we? Where do we come from? Where are we going?'... I don't know. I don't have any answers to those questions. I don't know what's over there around the corner. But I want to find out. Eugene Cernan

Updated on July 20, 2020

Comments

  • Gayatri
    Gayatri almost 4 years

    I have read an avro file into spark RDD and need to conver that into a sql dataframe. how do I do that.

    This is what I did so far.

    import org.apache.avro.generic.GenericRecord
    import org.apache.avro.mapred.{AvroInputFormat, AvroWrapper}
    import org.apache.hadoop.io.NullWritable
    
    val path = "hdfs://dds-nameservice/user/ghagh/"
    val avroRDD = sc.hadoopFile[AvroWrapper[GenericRecord], NullWritable, AvroInputFormat[GenericRecord]](path)
    

    When I do:

    avro.take(1)
    

    I get back

    res1: Array[(org.apache.avro.mapred.AvroWrapper[org.apache.avro.generic.GenericRecord], org.apache.hadoop.io.NullWritable)] = Array(({"column1": "value1", "column2": "value2", "column3": value3,...
    

    How do I convert this to a SparkSQL dataframe?

    I am using Spark 1.6

    Can anyone tell me if there is an easy solution around this?