Why does dropna() not work?

10,512

tl;dr The methods na and dropna are only available since Spark 1.3.1.

Few mistakes you made:

  1. data = sc.parallelize([....('',75,'', 7 )]), you intended to use '' to represent None, however, it's just a String instead of null

  2. na and dropna are both methods on dataFrame class, therefore, you should call it with your df.

Runnable Code:

data = sc.parallelize([('Foo',41,'US',3),
                       ('Foo',39,'UK',1),
                       ('Bar',57,'CA',2),
                       ('Bar',72,'CA',3),
                       ('Baz',22,'US',6),
                       (None, 75, None, 7)])

schema = StructType([StructField('Name', StringType(), True),
                 StructField('Age', IntegerType(), True),
                 StructField('Country', StringType(), True),
                 StructField('Score', IntegerType(), True)])

df = sqlContext.createDataFrame(data,schema)

df.dropna().show()
df.na.drop().show()
Share:
10,512
Jason
Author by

Jason

Updated on June 07, 2022

Comments

  • Jason
    Jason almost 2 years

    System: Spark 1.3.0 (Anaconda Python dist.) on Cloudera Quickstart VM 5.4

    Here's a Spark DataFrame:

    from pyspark.sql import SQLContext
    from pyspark.sql.types import *
    sqlContext = SQLContext(sc)
    
    data = sc.parallelize([('Foo',41,'US',3),
                           ('Foo',39,'UK',1),
                           ('Bar',57,'CA',2),
                           ('Bar',72,'CA',3),
                           ('Baz',22,'US',6),
                           (None,75,None,7)])
    
    schema = StructType([StructField('Name', StringType(), True),
                         StructField('Age', IntegerType(), True),
                         StructField('Country', StringType(), True),
                         StructField('Score', IntegerType(), True)])
    
    df = sqlContext.createDataFrame(data,schema)
    

    data.show()

    Name Age Country Score
    Foo  41  US      3    
    Foo  39  UK      1    
    Bar  57  CA      2    
    Bar  72  CA      3    
    Baz  22  US      6    
    null 75  null    7 
    

    However neither of these work!

    df.dropna()
    df.na.drop()
    

    I get this message:

    >>> df.show()
    Name Age Country Score
    Foo  41  US      3    
    Foo  39  UK      1    
    Bar  57  CA      2    
    Bar  72  CA      3    
    Baz  22  US      6    
    null 75  null    7    
    >>> df.dropna().show()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 580, in __getattr__
        jc = self._jdf.apply(name)
      File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
      File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
    py4j.protocol.Py4JJavaError: An error occurred while calling o50.apply.
    : org.apache.spark.sql.AnalysisException: Cannot resolve column name "dropna" among (Name, Age, Country, Score);
        at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:162)
        at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:162)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161)
        at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436)
        at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
        at py4j.Gateway.invoke(Gateway.java:259)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:207)
        at java.lang.Thread.run(Thread.java:745)
    

    Has anybody else experienced this problem? What's the workaround? Pyspark seems to thing that I am looking for a column called "na". Any help would be appreciated!