Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column

json scala apache-spark apache-spark-sql

19,346

Solution 1

The problem is with the JSON file. The file : "D:/playground/input.json" looks like as you descibed as

{
  "a": {
  "b": 1
  }
}

This is not right. Spark while processing json data considers each new line as a complete json. Thus it is failing.

You should keep your complete json in a single line in a compact form by removing all white spaces and newlines.

Like

{"a":{"b":1}}

If you want multiple jsons in a single file keep them like this

{"a":{"b":1}}
{"a":{"b":2}}
{"a":{"b":3}} ...

For more infos see

Solution 2

You may try either of these two ways.

Option-1: JSON in single line as answered above by @Avishek Bhattacharya.

Option-2: Add option to read multi line JSON in the code as follows. You could read the nested attribute also as shown below.

val df = spark.read.option("multiline","true").json("C:\\data\\nested-data.json")
df.select("a.b").show()

Here is the output for Option-2.

20/07/29 23:14:35 INFO DAGScheduler: Job 1 finished: show at NestedJsonReader.scala:23, took 0.181579 s
+---+
|  b|
+---+
|  1|
+---+

19,346

Author by

Alon

Updated on June 06, 2022

Comments

Alon almost 2 years
I have a json file:
```
{
  "a": {
    "b": 1
  }
}
```
I am trying to read it:
```
val path = "D:/playground/input.json"
val df = spark.read.json(path)
df.show()
```
But getting an error:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column (named _corrupt_record by default). For example: spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count() and spark.read.schema(schema).json(file).select("_corrupt_record").show(). Instead, you can cache or save the parsed results and then send the same query. For example, val df = spark.read.schema(schema).json(file).cache() and then df.filter($"_corrupt_record".isNotNull).count().;

So I tried to cache it as they suggest:
```
val path = "D:/playground/input.json"
val df = spark.read.json(path).cache()
df.show()
```
But I keep getting the same error.

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

Related

dataframe Spark scala explode json array

Re-using A Schema from JSON within a Spark DataFrame using Scala

How to use Spark SQL to parse the JSON array of objects

Spark Row to JSON

How to convert DataFrame to Json?

Spark from_json - StructType and ArrayType

Create new column with an array of range of numbers

Difference between two rows in Spark dataframe

Convert a Spark Vector of features into an array

How to find the max String length of a column in Spark using dataframe?