How to test if a column exists and is not null in a DataFrame

python apache-spark dataframe rdd

11,038

Solution 1

if (logsDF['column6'] in rddstats and logsDF['column6'].isNotNull)

I'm pretty sure you are going to be throwing a KeyError if column6 does not exist.

You could do something like:

if 'column6' in logsDF.columns:
    if logsDF['column6'].notnull().any():
        logsDF.select("column1","column2","column3","column4","column5","column6")
    else:
        logsz84statsDF.select("column1","column2","column3","column4","column5","column7")
else:
    logsz84statsDF.select("column1","column2","column3","column4","column5","column7")

Check to see if column6 exists in logsDF columns first. If so, see if any() value is not null.

Column7 is used if column6 does not exist, or if column6 exists but all values are null.

Editing my own comment: Since python will not evaluate the second condition if the first is False, you can actually do:

if 'column6' in logsDF.columns and logsDF['column6'].notnull().any():
    logsDF.select("column1","column2","column3","column4","column5","column6")
else:
    logsz84statsDF.select("column1","column2","column3","column4","column5","column7")

as long as the 'column6' in logsDF.columns comes first, the logsDF['column6'] will never evaluate and throw the KeyError if column6 doesn't exist.

Solution 2

if set(['A','C']).issubset(df.columns):
   df['sum'] = df['A'] + df['C']

set([]) can alternatively be constructed with curly braces:

if {'A', 'C'}.issubset(df.columns):

See this question for a discussion of the curly-braces syntax.

Or, you can use a list comprehension, as in:

if all([item in df.columns for item in ['A','C']]):

11,038

Author by

vero

Updated on June 17, 2022

Comments

vero almost 2 years

I have a python RDD:

rddstats = rddstats.filter(lambda x : len(x) == NB_LINE or len(x) == NB2_LINE)

I created a dataframe based on this RDD:

logsDF = sqlContext.createDataFrame(rddstats,schema=["column1","column2","column3","column4","column5","column6","column7"])

I would like to do a test on the two columns 6 and 7:
if column 6 exists in the dataframe and not null I should return dataframe containing this value of the column 6, else I should return a dataframe that contains the value of the column 7. This following my small code:

logsDF = sqlContext.createDataFrame(rddstats,schema=["column1","column2","column3","column4","column5","column6","column7"])
if (logsDF['column6'] in rddstats and logsDF['column6'].isNotNull):
    logsDF.select("column1","column2","column3","column4","column5","column6")
else:
    logsz84statsDF.select("column1","column2","column3","column4","column5","column7")

Is the syntax correct and have I the right to write in Python like this ?