how to read excel data into a dataframe in spark/scala

scala apache-spark dataframe spark-dataframe

11,703

Solution 1

Answers for you question 2: Inspite of using ' you need to use ` before the start and end of the column names with spaces. Try below query it will work:

val expLevel = sc.sqlContext.sql("Select `time_spend_company (Years)` as `Years_spent_in_company`,count(1) from EMP where left_company = 1 group by `time_spend_company (Years)`")

Question 1: Loading excel using "com.crealytics.spark.excel" is ok. I am also using it. There can be different option too. For assigning a different column name, you can use the Struct Type to define the schema and impose it during the loading the data into dataframe. e.g

val newSchema = StructType(
    List(StructField("a", IntegerType, nullable = true),
         StructField("b", IntegerType, nullable = true),
         StructField("c", IntegerType, nullable = true),
         StructField("d", IntegerType, nullable = true))
  )

val employeesDF = spark.read.schema(newSchema)
  .format("com.crealytics.spark.excel")
  .option("sheetName", "Sheet1")
  .option("useHeader", "true")
  .option("treatEmptyValuesAsNulls", "false")
  .option("inferSchema", "false")
  .option("location", empFile)
  .option("addColorColumns", "False")
  .load()

The first four column names now will be accessed by a,b,c and d. Run below query it will work on new column names.

sc.sqlContext.sql("select a,b,c,d from EMP").show()

Solution 2

For the version 0.13.5 you will need a different set of parameters:

def readExcel(file: String): DataFrame = {
    sqlContext.read
      .format("com.crealytics.spark.excel")
      .option("dataAddress", "'sheet_name'!A1") // Optional, default: "A1"
      .option("header", "true") // Required
      .option("treatEmptyValuesAsNulls", "false") // Optional, default: true
      .option("inferSchema", "true") // Optional, default: false
      .option("addColorColumns", "false") // Optional, default: false
      .option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
      .option("maxRowsInMemory", 20) // Optional, d[#All]efault None. If set, uses a streaming reader which can help with big files
      .load(file)
  }

maven dependency:

<dependency>
  <groupId>com.crealytics</groupId>
  <artifactId>spark-excel_2.11</artifactId>
  <version>0.13.5</version>
</dependency>

Solution 3

Spark has good support for working with CSVs. So if your excel file has just one sheet you can convert it to CSV by simply renaming EmpDatasets.xlsx to EmpDatasets.csv. Use this to do it.

Once you have your file as CSV, you can read it as spark.read.csv(pathToCSV) and can supply many options like: to read/skip header or supply schema of the dataset as spark.read.schema(schema).csv(pathToCSV).

Here schema can be created as described here or can be extracted from a case class using spark sql Encoders Encoders.product[case_class_name].schema

You can remove spaces from the column names like:

val employeesDFColumns = employeesDF.columns.map(x => col(x.replaceAll(" ", "")))

And apply these new column names on the dataframe.

val employeeDF = employeeDF.select(employeesDFColumns:_*)

11,703

Author by

Krishnan

Updated on June 05, 2022

Comments

Krishnan almost 2 years
I have a requirement where-in I need to read the excel file (with .xlsx extension) in spark/scala. I need to create a dataframe with the data read from excel and apply/write sql queries on top it to do some analysis. The excel file has some column headers/titles like "time_spend_company (Years)", "average_monthly_hours (hours)" etc which as spaces in the headers itself, these spaces are causing me problems to apply any sql queries on the loaded dataframe.

I am using com.crealytics.spark.excel library to parse the excel contents, and my code looks like below
```
val empFile = "C:\\EmpDatasets.xlsx"

val employeesDF = sc.sqlContext.read
  .format("com.crealytics.spark.excel")
  .option("sheetName", "Sheet1")
  .option("useHeader", "true")
  .option("treatEmptyValuesAsNulls", "false")
  .option("inferSchema", "false")
  .option("location", empFile)
  .option("addColorColumns", "False")
  .load()

employeesDF.createOrReplaceTempView("EMP")
```
I want to apply some group by and other aggregate functions on these columns and I am facing issues with these columns like below, my requirement is to apply group by on time_spent_company column and get a count of it.
```
val expLevel = sc.sqlContext.sql("Select 'time_spend_company (Years)' as 'Years_spent_in_company',count(1) from EMP where left_company = 1 group by 'time_spend_company (Years)'")
expLevel.show
```
I need help on :-
1. Is there any better way to load the excel and assign custom column names to it and create a dataframe?
2. how to write sql queries for these column names which has spaces within in it?
Note: I need to read it as excel file only, I can't convert into csv or any other file formats.
Krishnan over 6 years

I need to read it as excel file only, I can't convert into csv or any other file formats. Can you please suggest anything on reading/parsing excel files?
Santoshi M over 6 years

The library you are using for excel seems perfectly alright. And is one of the recommended ways to read excel. What are the problem you are facing in using it?
Krishnan over 6 years

Please refer my question, I need to assign custom column names to it and create a dataframe, how to write sql queries for these column names which has spaces within in it?
Krishnan over 6 years

I tried defining the schema as per your suggestion and loaded the dataset, but I am getting the exception like below:- Exception in thread "main" org.apache.spark.sql.AnalysisException: com.crealytics.spark.excel does not allow user-specified schemas.; at org.apache.spark.sql.execution.datasources.DataSource.resolv‌eRelation(DataSource‌.scala:313) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.sc‌ala:178) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.sc‌ala:146) at EmpPerformanceAnalysis$.main(EmpPerformanceAnalysis.scala:58‌)
Krishnan over 6 years

val empSchema = StructType( List(StructField("satisfaction_level", FloatType, nullable = true), StructField("last_evaluation", FloatType, nullable = true), .. StructField("salary", StringType, nullable = true)) ) val employeesDF = sc.sqlContext.read.schema(empSchema) .format("com.crealytics.spark.excel") .option("sheetName", "Sheet1") .option("useHeader", "true") .option("treatEmptyValuesAsNulls", "false") .option("inferSchema", "false") .option("location", empFile) .option("addColorColumns", "False") .load()
Amit Kumar over 6 years

@Krishnan I am running the same on spark 2.1.0 and scala 2.11. I am not getting the exception that you have mentioned. For the code you have posted, i am getting the excpetion "java.lang.IllegalArgumentException: Parameter "path" is missing in options." . Then i added the path option inplace of location option and it works fine.
Amit Kumar over 6 years

This is what i am running : val employeesDF2 = spark.sqlContext.read.schema(empSchema).format("com.crealyti‌cs.spark.excel").opt‌ion("sheetName", "Sheet1").option("useHeader", "true").option("treatEmptyValuesAsNulls", "false").option("inferSchema", "false").option("location", "csvdatatest/Book1.xlsx").option("addColorColumns", "False").load("csvdatatest/Book1.xlsx")
Amit Kumar over 6 years

printschema result is : scala> employeesDF2.printSchema root |-- satisfaction_level: float (nullable = true) |-- last_evaluation: float (nullable = true) |-- salary: string (nullable = true)
Amit Kumar over 6 years

This is the excel library version : <dependency> <groupId>com.crealytics</groupId> <artifactId>spark-excel_2.11</artifactId> <version>0.9.8</version> </dependency>
Krishnan over 6 years

version was the problem, after adding the latest lib it worked fine, thanks