How to change a dataframe column from String type to Double type in PySpark?

326,616

Solution 1

There is no need for an UDF here. Column already provides cast method with DataType instance :

from pyspark.sql.types import DoubleType

changedTypedf = joindf.withColumn("label", joindf["show"].cast(DoubleType()))

or short string:

changedTypedf = joindf.withColumn("label", joindf["show"].cast("double"))

where canonical string names (other variations can be supported as well) correspond to simpleString value. So for atomic types:

from pyspark.sql import types 

for t in ['BinaryType', 'BooleanType', 'ByteType', 'DateType', 
          'DecimalType', 'DoubleType', 'FloatType', 'IntegerType', 
           'LongType', 'ShortType', 'StringType', 'TimestampType']:
    print(f"{t}: {getattr(types, t)().simpleString()}")
BinaryType: binary
BooleanType: boolean
ByteType: tinyint
DateType: date
DecimalType: decimal(10,0)
DoubleType: double
FloatType: float
IntegerType: int
LongType: bigint
ShortType: smallint
StringType: string
TimestampType: timestamp

and for example complex types

types.ArrayType(types.IntegerType()).simpleString()   
'array<int>'
types.MapType(types.StringType(), types.IntegerType()).simpleString()
'map<string,int>'

Solution 2

Preserve the name of the column and avoid extra column addition by using the same name as input column:

from pyspark.sql.types import DoubleType
changedTypedf = joindf.withColumn("show", joindf["show"].cast(DoubleType()))

Solution 3

Given answers are enough to deal with the problem but I want to share another way which may be introduced the new version of Spark (I am not sure about it) so given answer didn't catch it.

We can reach the column in spark statement with col("colum_name") keyword:

from pyspark.sql.functions import col
changedTypedf = joindf.withColumn("show", col("show").cast("double"))

Solution 4

PySpark version:

df = <source data>
df.printSchema()

from pyspark.sql.types import *

# Change column type
df_new = df.withColumn("myColumn", df["myColumn"].cast(IntegerType()))
df_new.printSchema()
df_new.select("myColumn").show()

Solution 5

the solution was simple -

toDoublefunc = UserDefinedFunction(lambda x: float(x),DoubleType())
changedTypedf = joindf.withColumn("label",toDoublefunc(joindf['show']))
Share:
326,616
Abhishek Choudhary
Author by

Abhishek Choudhary

Programming Language Scala, Python, Java Used – Clojure, R-programming Distributed Data &amp; Streaming Technologies Apache Spark, Hadoop, Apache Kafka, Apache Ignite, Apache Flink, Koalas, Apache Druid, Pinot, Apache Pulsar Distributed Databases &amp; Technologies Cassandra, HBase, RocksDB, Elasticsearch, Neo4J Distributed Analytics Presto, Metabase, Apache Preset, Databricks Data Discovery/Catalog &amp; Metadata Management Amundsen, Marquez, Apache Atlas, Metacat Orchestration &amp; Workflow Technologies Docker, Terraform, Kubernetes, Alluxio, Apache Airflow, Prefect, Dagster, Argo, Kubeflow Machine Learning Spark ML, Pytorch, Tensorflow, Scipy, Scikit-Learn, H20 MLOPs &amp; Infrastructure MLFlow, ZenML, DVC, LakeFS, Feast, Hopsworks github - https://github.com/abhishek-ch

Updated on October 01, 2021

Comments

  • Abhishek Choudhary
    Abhishek Choudhary about 2 years

    I have a dataframe with column as String. I wanted to change the column type to Double type in PySpark.

    Following is the way, I did:

    toDoublefunc = UserDefinedFunction(lambda x: x,DoubleType())
    changedTypedf = joindf.withColumn("label",toDoublefunc(joindf['show']))
    

    Just wanted to know, is this the right way to do it as while running through Logistic Regression, I am getting some error, so I wonder, is this the reason for the trouble.

  • WestCoastProjects
    WestCoastProjects over 6 years
    Thanks I was looking for how to retain original column name
  • alfredox
    alfredox about 6 years
    is there a list somewhere of the short string data types Spark will identify?
  • Staza
    Staza over 5 years
    Using the col function also works. from pyspark.sql.functions import col, changedTypedf = joindf.withColumn("label", col("show").cast(DoubleType()))
  • Quetzalcoatl
    Quetzalcoatl over 5 years
    this solution also works splendidly in a loop e.g. from pyspark.sql.types import IntegerType for ftr in ftr_list: df = df.withColumn(f, df[f].cast(IntegerType()))
  • Wirawan Purwanto
    Wirawan Purwanto almost 5 years
    What are the possible values of cast() argument (the "string" syntax)?
  • Wirawan Purwanto
    Wirawan Purwanto almost 5 years
    I can't believe how terse Spark doc was on the valid string for the datatype. The closest reference I could find was this: docs.tibco.com/pub/sfire-analyst/7.7.1/doc/html/en-US/… .
  • hui chen
    hui chen almost 4 years
    How to convert multiple columns in one go?
  • pitchblack408
    pitchblack408 over 3 years
    How do I change nullable to false?
  • Sheldore
    Sheldore almost 3 years
    @Quetzalcoatl Your code is wrong. What is f? Where are you using ftr?
  • Quetzalcoatl
    Quetzalcoatl almost 3 years
    Yeh, thanks -- 'f' should be 'ftr'. Others likely figured that out.
  • ZygD
    ZygD about 2 years
    Thank you! Using 'double' is more elegant than DoubleType() which may also need to be imported.