Finding the difference of two columns in Spark dataframes and appending to a new column

11,904

There are multiple issues here.

First if you look at the exception, it basically tells you that there is no "Average Total Payments" column in the dataframe (it also helpfully gives you the columns it sees). It seems the column name read from the csv has an extra space at the end.

Second df("Average Total Payments") and df("Average Medicare Payments") are columns.

You are trying to call show on df("Average medicate payments"). Show is not a member of column (and on dataframe it returns unit so you couldn't do df("Average Total Payments") -df("Average Medicare Payments").show(5) anyway because that would be Column - Unit).

What you want to do is define a new column which is the difference between the two and add it to the dataframe as a new column. Then you want to select just that column and show it. For example:

val sim = df.withColumn("diff",df("Average Total Payments") -df("Average Medicare Payments"))
sim.select("diff").show(5)
Share:
11,904
Admin
Author by

Admin

Updated on June 30, 2022

Comments

  • Admin
    Admin almost 2 years

    Below is my code for loading csv data into dataframe and applying the difference on two columns and appending to a new one using withColumn.The two columns I am trying to find the difference is of kind Double. Please help me in figuring out the following exception:

    import org.apache.log4j.{Level, Logger}
    import org.apache.spark.sql.SparkSession
    
    /**
      * Created by Guest1 on 5/10/2017.
      */
    object arith extends App {
      Logger.getLogger("org").setLevel(Level.ERROR)
      Logger.getLogger("akka").setLevel(Level.ERROR)
    
      val spark = SparkSession.builder().appName("Arithmetics").
                    config("spark.master", "local").getOrCreate()
      val df =spark.read.option("header","true")
                      .option("inferSchema",true")
                      .csv("./Input/Arith.csv").persist()
    
    //  df.printSchema()
    val sim =df("Average Total Payments") -df("Average Medicare Payments").show(5)
    }
    

    I am getting the following exception:

    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot resolve column name "Average Total Payments" among (DRG Definition, Provider Id, Provider Name, Provider Street Address, Provider City, Provider State, Provider Zip Code, Hospital Referral Region Description,  Total Discharges ,  Average Covered Charges ,  Average Total Payments , Average Medicare Payments);
        at org.apache.spark.sql.Dataset$$anonfun$resolve$1.apply(Dataset.scala:219)
        at org.apache.spark.sql.Dataset$$anonfun$resolve$1.apply(Dataset.scala:219)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.sql.Dataset.resolve(Dataset.scala:218)
        at org.apache.spark.sql.Dataset.col(Dataset.scala:1073)
        at org.apache.spark.sql.Dataset.apply(Dataset.scala:1059)
        at arith$.delayedEndpoint$arith$1(arith.scala:19)
        at arith$delayedInit$body.apply(arith.scala:7)
        at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
        at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
        at scala.App$$anonfun$main$1.apply(App.scala:76)
        at scala.App$$anonfun$main$1.apply(App.scala:76)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
        at scala.App$class.main(App.scala:76)
        at arith$.main(arith.scala:7)
        at arith.main(arith.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
    
  • Admin
    Admin almost 7 years
    Thanks for the response. It worked as there are extra spaces in the column names