How to read files with .xlsx and .xls extension in Azure data factory?
Solution 1
Azure Data Factory V2 has recently released an update to support parsing Excel(.xls) files on existing connectors.
Currently, the connections supporting excel files are:
- Amazon S3
- Azure Blob
- Azure Data Lake Storage Gen1
- Azure Data Lake Storage Gen2
- Azure File Storage
- File System
- FTP
- Google Cloud Storage
- HDFS
- HTTP
- SFTP
More details can be found here: https://docs.microsoft.com/en-us/azure/data-factory/format-excel
Solution 2
Excel files have a proprietary format and are not simple delimited files. As indicated here, Azure Data Factory does not have a direct option to import Excel files, eg you cannot create a Linked Service to an Excel file and read it easily. Your options are:
- Export or convert the data as flat files eg before transfer to cloud, as .csv, tab-delimited, pipe-delimited etc are easier to read than Excel files. This is your simplest option although obviously requires a change in process.
- Try shredding the XML - create a custom task to open the Excel file as XML and extract your data as suggested here.
- SSIS packages are now supported in Azure Data Factory (with the Execute SSIS package activity) and have better support for Excel files, eg a Connection Manager. So it may be an option to create an SSIS package to deal with the Excel and host it in ADFv2. Warning! I have not tested this, I am only speculating it is possible. Also there is the overhead of creating an Integration Runtime (IR) for running SSIS in ADFv2.
- Try some other custom activity, eg there is a custom U-SQL Extractor for shredding XML on github here.
- Try and read the Excel using Databricks, some examples here although spinning up a Spark cluster to read a few Excel files does seem somewhat overkill. This might be a good option if Spark is already in your architecture.
Let us know how you get on.
vikas shivakumar
Updated on June 15, 2022Comments
-
vikas shivakumar almost 2 years
I am trying to read and excel file in Azure Blob Storage with .xlsx extension in my azure data factory dataset. it throws following error
Error found when processing 'Csv/Tsv Format Text' source 'Filename.xlsx' with row number 3: found more columns than expected column count: 1.
What are the right Column and row delimiters for excel files to be read in azure Data factory