Show distinct column values in pyspark dataframe

304,806

Solution 1

Let's assume we're working with the following representation of data (two columns, k and v, where k contains three entries, two unique:

+---+---+
|  k|  v|
+---+---+
|foo|  1|
|bar|  2|
|foo|  3|
+---+---+

With a Pandas dataframe:

import pandas as pd
p_df = pd.DataFrame([("foo", 1), ("bar", 2), ("foo", 3)], columns=("k", "v"))
p_df['k'].unique()

This returns an ndarray, i.e. array(['foo', 'bar'], dtype=object)

You asked for a "pyspark dataframe alternative for pandas df['col'].unique()". Now, given the following Spark dataframe:

s_df = sqlContext.createDataFrame([("foo", 1), ("bar", 2), ("foo", 3)], ('k', 'v'))

If you want the same result from Spark, i.e. an ndarray, use toPandas():

s_df.toPandas()['k'].unique()

Alternatively, if you don't need an ndarray specifically and just want a list of the unique values of column k:

s_df.select('k').distinct().rdd.map(lambda r: r[0]).collect()

Finally, you can also use a list comprehension as follows:

[i.k for i in s_df.select('k').distinct().collect()]

Solution 2

This should help to get distinct values of a column:

df.select('column1').distinct().collect()

Note that .collect() doesn't have any built-in limit on how many values can return so this might be slow -- use .show() instead or add .limit(20) before .collect() to manage this.

Solution 3

You can use df.dropDuplicates(['col1','col2']) to get only distinct rows based on colX in the array.

Solution 4

If you want to see the distinct values of a specific column in your dataframe, you would just need to write the following code. It would show the 100 distinct values (if 100 values are available) for the colname column in the df dataframe.

df.select('colname').distinct().show(100, False)

If you want to do something fancy on the distinct values, you can save the distinct values in a vector:

a = df.select('colname').distinct()

Solution 5

collect_set can help to get unique values from a given column of pyspark.sql.DataFrame:

df.select(F.collect_set("column").alias("column")).first()["column"]
Share:
304,806
Satya
Author by

Satya

Trust Me,I want to be a programmer and still confused between whether i am already a one or still my status is in-progress. In both-way i like my status and preferably the "in-progress".

Updated on December 26, 2021

Comments

  • Satya
    Satya over 2 years

    With pyspark dataframe, how do you do the equivalent of Pandas df['col'].unique().

    I want to list out all the unique values in a pyspark dataframe column.

    Not the SQL type way (registertemplate then SQL query for distinct values).

    Also I don't need groupby then countDistinct, instead I want to check distinct VALUES in that column.

  • Satya
    Satya over 7 years
    Hi eddies, the last code line distinct().map() didn't worked for me. Error:AttributeError: 'DataFrame' object has no attribute 'map'. I am on spark 2.0. And toPandas thing, i will not say it is an alternative, it converts spark dataframe to pandas dataframe first then doing pandas operation on it.
  • eddies
    eddies over 7 years
    Hi satya. Just updated the answer by adding a .rdd call after distinct(). It worked without that in Spark 1.6.2, but I just confirmed that the edited answer works in Spark 2.0.0 as well.
  • Satya
    Satya almost 7 years
    @seufagner-yes I can do a df.dropDuplictes(['col1']) to see (mark SEE ) the unique values, but without a collect(to_rdd or to pandas DF then df['col'].unique()), I can't get the unique values list. Thanks for suggestion.
  • Laurens Koppenol
    Laurens Koppenol over 6 years
    Why try to avoid spark dataframe operations by converting to a pandas dataframe (hurts if its gigantic) or utilizing rdd operations when spark dataframes are perfectly capable of doing this? see below answer of @Pabbati
  • eddies
    eddies over 6 years
    @Laurens There were three solutions in the answer above, depending on what the poster really wanted. In all cases, the poster wanted some form of a list/array of the distinct values (c.f. poster's response to seufagner's answer). The third solution above does use Spark's dataframe api just as Pabbati's answer but actually returns a list, as per the poster's requirements.
  • RNHTTR
    RNHTTR over 5 years
    The question specifically asks how to "show" unique values of a dataframe column. I think @Pabbati 's answer does this effectively: df.select('column1').distinct().show()
  • eddies
    eddies over 5 years
    Yes, the question title includes the word "show". But the poster specifically clarified that SEEing the results wasn't adequate and wanted a list. As mentioned above, see the poster's comment to seufagner's answer.
  • RNHTTR
    RNHTTR over 5 years
    good point. i suggested an edit to the question to ask for a list instead of to show the results.
  • Abhi
    Abhi over 5 years
    this code returns data that's not iterable, i.e. I see the distinct data bit am not able to iterate over it in code. Any other way that enables me to do it. I tried using toPandas() to convert in it into Pandas df and then get the iterable with unique values. However, running into '' Pandas not found' error message
  • Satya
    Satya over 5 years
    @Abhi: inplace of .show() instead do a .collect(), that way you will get a iterable of all the distinct values of that particular column. But make sure your master node have enough memory to keep hold of those unique values, because collect will push all the requested data(in this case unique values of column) to master Node :)
  • Utsav Jha
    Utsav Jha about 4 years
    The user did not ask how to display non duplicate values.. He just wanted to get a list of all unique/distinct items, which includes duplicates too!
  • ahrooran
    ahrooran almost 4 years
    [i.k for i in s_df.select('k').distinct().collect()] ;I have a doubt. I only have column name in string format (for example 'k'). Is there any way to write i.k when I only have string column name. Actually column name is taken as a user input. That's why I'm asking this
  • MichaelChirico
    MichaelChirico almost 4 years
    @Satya I've edited your comment into the answer, thanks
  • Ric S
    Ric S about 2 years
    To obtain a Python list of actual values and not Row objects, you need to use a list comprehension like in this answer: stackoverflow.com/a/60896261/7465462
  • Palash Mondal
    Palash Mondal almost 2 years
    This is correct because df.select(<column name>).collect() is an expensive operation which may lead to stage failure error.
  • stinodego
    stinodego almost 2 years
    I agree on using collect_set, but I think the following would be cleaner: df.agg(F.collect_set("column")).collect()[0][0]