how to read excel data into a dataframe in spark/scala
Solution 1
Answers for you question 2: Inspite of using '
you need to use ` before the start and end of the column names with spaces. Try below query it will work:
val expLevel = sc.sqlContext.sql("Select `time_spend_company (Years)` as `Years_spent_in_company`,count(1) from EMP where left_company = 1 group by `time_spend_company (Years)`")
Question 1: Loading excel using "com.crealytics.spark.excel" is ok. I am also using it. There can be different option too. For assigning a different column name, you can use the Struct Type to define the schema and impose it during the loading the data into dataframe. e.g
val newSchema = StructType(
List(StructField("a", IntegerType, nullable = true),
StructField("b", IntegerType, nullable = true),
StructField("c", IntegerType, nullable = true),
StructField("d", IntegerType, nullable = true))
)
val employeesDF = spark.read.schema(newSchema)
.format("com.crealytics.spark.excel")
.option("sheetName", "Sheet1")
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "false")
.option("inferSchema", "false")
.option("location", empFile)
.option("addColorColumns", "False")
.load()
The first four column names now will be accessed by a,b,c and d. Run below query it will work on new column names.
sc.sqlContext.sql("select a,b,c,d from EMP").show()
Solution 2
For the version 0.13.5
you will need a different set of parameters:
def readExcel(file: String): DataFrame = {
sqlContext.read
.format("com.crealytics.spark.excel")
.option("dataAddress", "'sheet_name'!A1") // Optional, default: "A1"
.option("header", "true") // Required
.option("treatEmptyValuesAsNulls", "false") // Optional, default: true
.option("inferSchema", "true") // Optional, default: false
.option("addColorColumns", "false") // Optional, default: false
.option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
.option("maxRowsInMemory", 20) // Optional, d[#All]efault None. If set, uses a streaming reader which can help with big files
.load(file)
}
maven dependency:
<dependency>
<groupId>com.crealytics</groupId>
<artifactId>spark-excel_2.11</artifactId>
<version>0.13.5</version>
</dependency>
Solution 3
- Spark has good support for working with CSVs. So if your excel file has just one sheet you can convert it to CSV by simply renaming
EmpDatasets.xlsx
toEmpDatasets.csv
. Use this to do it.
Once you have your file as CSV, you can read it as spark.read.csv(pathToCSV)
and can supply many options like: to read/skip header or supply schema of the dataset as spark.read.schema(schema).csv(pathToCSV)
.
Here schema
can be created as described here or can be extracted from a case class using spark sql Encoders Encoders.product[case_class_name].schema
- You can remove spaces from the column names like:
val employeesDFColumns = employeesDF.columns.map(x
=> col(x.replaceAll(" ", "")))
And apply these new column names on the dataframe.
val employeeDF = employeeDF.select(employeesDFColumns:_*)
Krishnan
Updated on June 05, 2022Comments
-
Krishnan almost 2 years
I have a requirement where-in I need to read the excel file (with .xlsx extension) in spark/scala. I need to create a dataframe with the data read from excel and apply/write sql queries on top it to do some analysis. The excel file has some column headers/titles like "time_spend_company (Years)", "average_monthly_hours (hours)" etc which as spaces in the headers itself, these spaces are causing me problems to apply any sql queries on the loaded dataframe.
I am using com.crealytics.spark.excel library to parse the excel contents, and my code looks like below
val empFile = "C:\\EmpDatasets.xlsx" val employeesDF = sc.sqlContext.read .format("com.crealytics.spark.excel") .option("sheetName", "Sheet1") .option("useHeader", "true") .option("treatEmptyValuesAsNulls", "false") .option("inferSchema", "false") .option("location", empFile) .option("addColorColumns", "False") .load() employeesDF.createOrReplaceTempView("EMP")
I want to apply some group by and other aggregate functions on these columns and I am facing issues with these columns like below, my requirement is to apply group by on time_spent_company column and get a count of it.
val expLevel = sc.sqlContext.sql("Select 'time_spend_company (Years)' as 'Years_spent_in_company',count(1) from EMP where left_company = 1 group by 'time_spend_company (Years)'") expLevel.show
I need help on :-
- Is there any better way to load the excel and assign custom column names to it and create a dataframe?
- how to write sql queries for these column names which has spaces within in it?
Note: I need to read it as excel file only, I can't convert into csv or any other file formats.
-
Krishnan over 6 yearsI need to read it as excel file only, I can't convert into csv or any other file formats. Can you please suggest anything on reading/parsing excel files?
-
Santoshi M over 6 yearsThe library you are using for excel seems perfectly alright. And is one of the recommended ways to read excel. What are the problem you are facing in using it?
-
Krishnan over 6 yearsPlease refer my question, I need to assign custom column names to it and create a dataframe, how to write sql queries for these column names which has spaces within in it?
-
Krishnan over 6 yearsI tried defining the schema as per your suggestion and loaded the dataset, but I am getting the exception like below:- Exception in thread "main" org.apache.spark.sql.AnalysisException: com.crealytics.spark.excel does not allow user-specified schemas.; at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:313) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146) at EmpPerformanceAnalysis$.main(EmpPerformanceAnalysis.scala:58)
-
Krishnan over 6 yearsval empSchema = StructType( List(StructField("satisfaction_level", FloatType, nullable = true), StructField("last_evaluation", FloatType, nullable = true), .. StructField("salary", StringType, nullable = true)) ) val employeesDF = sc.sqlContext.read.schema(empSchema) .format("com.crealytics.spark.excel") .option("sheetName", "Sheet1") .option("useHeader", "true") .option("treatEmptyValuesAsNulls", "false") .option("inferSchema", "false") .option("location", empFile) .option("addColorColumns", "False") .load()
-
Amit Kumar over 6 years@Krishnan I am running the same on spark 2.1.0 and scala 2.11. I am not getting the exception that you have mentioned. For the code you have posted, i am getting the excpetion "java.lang.IllegalArgumentException: Parameter "path" is missing in options." . Then i added the path option inplace of location option and it works fine.
-
Amit Kumar over 6 yearsThis is what i am running : val employeesDF2 = spark.sqlContext.read.schema(empSchema).format("com.crealytics.spark.excel").option("sheetName", "Sheet1").option("useHeader", "true").option("treatEmptyValuesAsNulls", "false").option("inferSchema", "false").option("location", "csvdatatest/Book1.xlsx").option("addColorColumns", "False").load("csvdatatest/Book1.xlsx")
-
Amit Kumar over 6 yearsprintschema result is : scala> employeesDF2.printSchema root |-- satisfaction_level: float (nullable = true) |-- last_evaluation: float (nullable = true) |-- salary: string (nullable = true)
-
Amit Kumar over 6 yearsThis is the excel library version : <dependency> <groupId>com.crealytics</groupId> <artifactId>spark-excel_2.11</artifactId> <version>0.9.8</version> </dependency>
-
Krishnan over 6 yearsversion was the problem, after adding the latest lib it worked fine, thanks