pyspark: ValueError: Some of types cannot be determined after inferring

48,501

Solution 1

In order to infer the field type, PySpark looks at the non-none records in each field. If a field only has None records, PySpark can not infer the type and will raise that error.

Manually defining a schema will resolve the issue

>>> from pyspark.sql.types import StructType, StructField, StringType
>>> schema = StructType([StructField("foo", StringType(), True)])
>>> df = spark.createDataFrame([[None]], schema=schema)
>>> df.show()
+----+
|foo |
+----+
|null|
+----+

Solution 2

And to fix this problem, you could provide your own defined schema.

For example:

To reproduce the error:

>>> df = spark.createDataFrame([[None, None]], ["name", "score"])

To fix the error:

>>> from pyspark.sql.types import StructType, StructField, StringType, DoubleType
>>> schema = StructType([StructField("name", StringType(), True), StructField("score", DoubleType(), True)])
>>> df = spark.createDataFrame([[None, None]], schema=schema)
>>> df.show()
+----+-----+
|name|score|
+----+-----+
|null| null|
+----+-----+

Solution 3

If you are using the RDD[Row].toDF() monkey-patched method you can increase the sample ratio to check more than 100 records when inferring types:

# Set sampleRatio smaller as the data size increases
my_df = my_rdd.toDF(sampleRatio=0.01)
my_df.show()

Assuming there are non-null rows in all fields in your RDD, it will be more likely to find them when you increase the sampleRatio towards 1.0.

Share:
48,501
Edamame
Author by

Edamame

Updated on November 13, 2020

Comments

  • Edamame
    Edamame over 3 years

    I have a pandas data frame my_df, and my_df.dtypes gives us:

    ts              int64
    fieldA         object
    fieldB         object
    fieldC         object
    fieldD         object
    fieldE         object
    dtype: object
    

    Then I am trying to convert the pandas data frame my_df to a spark data frame by doing below:

    spark_my_df = sc.createDataFrame(my_df)
    

    However, I got the following errors:

    ValueErrorTraceback (most recent call last)
    <ipython-input-29-d4c9bb41bb1e> in <module>()
    ----> 1 spark_my_df = sc.createDataFrame(my_df)
          2 spark_my_df.take(20)
    
    /usr/local/spark-latest/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio)
        520             rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
        521         else:
    --> 522             rdd, schema = self._createFromLocal(map(prepare, data), schema)
        523         jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
        524         jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
    
    /usr/local/spark-latest/python/pyspark/sql/session.py in _createFromLocal(self, data, schema)
        384 
        385         if schema is None or isinstance(schema, (list, tuple)):
    --> 386             struct = self._inferSchemaFromList(data)
        387             if isinstance(schema, (list, tuple)):
        388                 for i, name in enumerate(schema):
    
    /usr/local/spark-latest/python/pyspark/sql/session.py in _inferSchemaFromList(self, data)
        318         schema = reduce(_merge_type, map(_infer_schema, data))
        319         if _has_nulltype(schema):
    --> 320             raise ValueError("Some of types cannot be determined after inferring")
        321         return schema
        322 
    
    ValueError: Some of types cannot be determined after inferring
    

    Does anyone know what the above error mean? Thanks!