Spark Dataframe: Select distinct rows

java sql dataframe apache-spark apache-spark-sql

34,898

Solution 1

The problem you face is explicitly stated in the exception message - because MapType columns are neither hashable nor orderable cannot be used as a part of grouping or partitioning expression.

Your take on SQL solution is not logically equivalent to distinct on Dataset. If you want to deduplicate data based on a set of compatible columns you should use dropDuplicates:

df.dropDuplicates("timestamp")

which would be equivalent to

SELECT timestamp, first(c1) AS c1, first(c2) AS c2,  ..., first(cn) AS cn,
       first(canvasHashes) AS canvasHashes
FROM df GROUP BY timestamp

Unfortunately if your goal is actual DISTINCT it won't be so easy. On possible solution is to leverage Scala* Map hashing. You could define Scala udf like this:

spark.udf.register("scalaHash", (x: Map[String, String]) => x.##)

and then use it in your Java code to derive column that can be used to dropDuplicates:

 df
  .selectExpr("*", "scalaHash(canvasHashes) AS hash_of_canvas_hashes")
  .dropDuplicates(
    // All columns excluding canvasHashes / hash_of_canvas_hashes
    "timestamp",  "c1", "c2", ..., "cn" 
    // Hash used as surrogate of canvasHashes
    "hash_of_canvas_hashes"         
  )

with SQL equivalent

SELECT 
  timestamp, c1, c2, ..., cn,   -- All columns excluding canvasHashes
  first(canvasHashes) AS canvasHashes
FROM df GROUP BY
  timestamp, c1, c2, ..., cn    -- All columns excluding canvasHashes

* Please note that java.util.Map with its hashCode won't work, as hashCode is not consistent.

Solution 2

1) If you want to distinct based on coluns you can use it

val df = sc.parallelize(Array((1, 2), (3, 4), (1, 6))).toDF("no", "age")


scala> df.show
+---+---+
| no|age|
+---+---+
|  1|  2|
|  3|  4|
|  1|  6|
+---+---+

val distinctValuesDF = df.select(df("no")).distinct

scala> distinctValuesDF.show
+---+
| no|
+---+
|  1|
|  3|
+---+

2) If you have want unique on all column use dropduplicate

scala> val df = sc.parallelize(Array((1, 2), (3, 4),(3, 4), (1, 6))).toDF("no", "age")



scala> df.show

+---+---+
| no|age|
+---+---+
|  1|  2|
|  3|  4|
|  3|  4|
|  1|  6|
+---+---+


scala> df.dropDuplicates().show()
+---+---+
| no|age|
+---+---+
|  1|  2|
|  3|  4|
|  1|  6|
+---+---+

Solution 3

Yes, the syntax is incorrect, it should be:

Dataset<Row> landingDF = sqlContext.sql("SELECT distinct * from df");

34,898

Author by

Himanshu Yadav

Big data and distributed systems

Updated on July 09, 2022

Comments

Himanshu Yadav almost 2 years

I tried two ways to find distinct rows from parquet but it doesn't seem to work.
Attemp 1: Dataset<Row> df = sqlContext.read().parquet("location.parquet").distinct();
But throws

Cannot have map type columns in DataFrame which calls set operations
(intersect, except, etc.), 
but the type of column canvasHashes is map<string,string>;;

Attemp 2: Tried running sql queries:

Dataset<Row> df = sqlContext.read().parquet("location.parquet");
    rawLandingDS.createOrReplaceTempView("df");
    Dataset<Row> landingDF = sqlContext.sql("SELECT distinct on timestamp * from df");

error I get:

= SQL ==
SELECT distinct on timestamp * from df
-----------------------------^^^

Is there a way to get distinct records while reading parquet files? Any read option I can use.

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

Spark - Group by HAVING with dataframe syntax?

SPARK : failure: ``union'' expected but `(' found

Difference between === null and isNull in Spark DataDrame

How to select the first row of each group?

Where is the union() method on the Spark DataFrame class?

Creating a new Spark DataFrame with new column value based on column in first dataframe Java

Querying Spark SQL DataFrame with complex types

Date and Interval Addition in SparkSQL

java.lang.NoClassDefFoundError: Could not initialize class when launching spark job via spark-submit in scala code

rank() function usage in Spark SQL