pyspark: ValueError: Some of types cannot be determined after inferring
Solution 1
In order to infer the field type, PySpark looks at the non-none records in each field. If a field only has None records, PySpark can not infer the type and will raise that error.
Manually defining a schema will resolve the issue
>>> from pyspark.sql.types import StructType, StructField, StringType
>>> schema = StructType([StructField("foo", StringType(), True)])
>>> df = spark.createDataFrame([[None]], schema=schema)
>>> df.show()
+----+
|foo |
+----+
|null|
+----+
Solution 2
And to fix this problem, you could provide your own defined schema.
For example:
To reproduce the error:
>>> df = spark.createDataFrame([[None, None]], ["name", "score"])
To fix the error:
>>> from pyspark.sql.types import StructType, StructField, StringType, DoubleType
>>> schema = StructType([StructField("name", StringType(), True), StructField("score", DoubleType(), True)])
>>> df = spark.createDataFrame([[None, None]], schema=schema)
>>> df.show()
+----+-----+
|name|score|
+----+-----+
|null| null|
+----+-----+
Solution 3
If you are using the RDD[Row].toDF()
monkey-patched method you can increase the sample ratio to check more than 100 records when inferring types:
# Set sampleRatio smaller as the data size increases
my_df = my_rdd.toDF(sampleRatio=0.01)
my_df.show()
Assuming there are non-null rows in all fields in your RDD, it will be more likely to find them when you increase the sampleRatio
towards 1.0.
Edamame
Updated on November 13, 2020Comments
-
Edamame over 3 years
I have a pandas data frame
my_df
, andmy_df.dtypes
gives us:ts int64 fieldA object fieldB object fieldC object fieldD object fieldE object dtype: object
Then I am trying to convert the pandas data frame
my_df
to a spark data frame by doing below:spark_my_df = sc.createDataFrame(my_df)
However, I got the following errors:
ValueErrorTraceback (most recent call last) <ipython-input-29-d4c9bb41bb1e> in <module>() ----> 1 spark_my_df = sc.createDataFrame(my_df) 2 spark_my_df.take(20) /usr/local/spark-latest/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio) 520 rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio) 521 else: --> 522 rdd, schema = self._createFromLocal(map(prepare, data), schema) 523 jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd()) 524 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json()) /usr/local/spark-latest/python/pyspark/sql/session.py in _createFromLocal(self, data, schema) 384 385 if schema is None or isinstance(schema, (list, tuple)): --> 386 struct = self._inferSchemaFromList(data) 387 if isinstance(schema, (list, tuple)): 388 for i, name in enumerate(schema): /usr/local/spark-latest/python/pyspark/sql/session.py in _inferSchemaFromList(self, data) 318 schema = reduce(_merge_type, map(_infer_schema, data)) 319 if _has_nulltype(schema): --> 320 raise ValueError("Some of types cannot be determined after inferring") 321 return schema 322 ValueError: Some of types cannot be determined after inferring
Does anyone know what the above error mean? Thanks!