How do I increase decimal precision in Spark?
I think the error is pretty self explanatory- you need to be using a DecimalType
not a DoubleType
.
Try this:
...
.cast(DecimalType(6)))
Read on:
https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/types/DecimalType.html
http://spark.apache.org/docs/2.0.2/api/python/_modules/pyspark/sql/types.html
datatype for handling big numbers in pyspark
Related videos on Youtube
Ross Lewis
At IBM, I'm focused on big data and machine learning. I demo Spark to customers who are interested in learning more about it. For fun I drum, spend time with friends, and try new things in the city. I'm currently considering going to graduate school.
Updated on September 15, 2022Comments
-
Ross Lewis over 1 year
I have a large DataFrame made up of ~550 columns of doubles and two columns of longs (ids). The 550 columns are being read in from a csv, and I add two id columns. The only other things I do with the data is change some of the csv data from strings to doubles ("Inf" -> "0" then cast the column to double) and replace NaN's with 0:
df = df.withColumn(col.name + "temp", regexp_replace( regexp_replace(df(col.name),"Inf","0") ,"NaN","0").cast(DoubleType)) df = df.drop(col.name).withColumnRenamed(col.name + "temp",col.name) df = df.withColumn("timeId", monotonically_increasing_id.cast(LongType)) df = df.withColumn("patId", lit(num).cast(LongType)) df = df.na.fill(0)
When I do a count, I get the following error:
IllegalArgumentException: requirement failed: Decimal precision 6 exceeds max precision 5
There are hundreds of thousands of rows, and I'm reading in the data from multiple csvs. How do I increase the decimal precision? Is there something else that could be going on? I am only getting this error when I read in some of the csvs. Could they have more decimals than the others?
-
Ross Lewis almost 7 years