PySpark DataFrames - way to enumerate without converting to Pandas?
Solution 1
It doesn't work because:
- the second argument for
withColumn
should be aColumn
not a collection.np.array
won't work here - when you pass
"index in indexes"
as a SQL expression towhere
indexes
is out of scope and it is not resolved as a valid identifier
PySpark >= 1.4.0
You can add row numbers using respective window function and query using Column.isin
method or properly formated query string:
from pyspark.sql.functions import col, rowNumber
from pyspark.sql.window import Window
w = Window.orderBy()
indexed = df.withColumn("index", rowNumber().over(w))
# Using DSL
indexed.where(col("index").isin(set(indexes)))
# Using SQL expression
indexed.where("index in ({0})".format(",".join(str(x) for x in indexes)))
It looks like window functions called without PARTITION BY
clause move all data to the single partition so above may be not the best solution after all.
Any faster and simpler way to deal with it?
Not really. Spark DataFrames don't support random row access.
PairedRDD
can be accessed using lookup
method which is relatively fast if data is partitioned using HashPartitioner
. There is also indexed-rdd project which supports efficient lookups.
Edit:
Independent of PySpark version you can try something like this:
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, LongType
row = Row("char")
row_with_index = Row("char", "index")
df = sc.parallelize(row(chr(x)) for x in range(97, 112)).toDF()
df.show(5)
## +----+
## |char|
## +----+
## | a|
## | b|
## | c|
## | d|
## | e|
## +----+
## only showing top 5 rows
# This part is not tested but should work and save some work later
schema = StructType(
df.schema.fields[:] + [StructField("index", LongType(), False)])
indexed = (df.rdd # Extract rdd
.zipWithIndex() # Add index
.map(lambda ri: row_with_index(*list(ri[0]) + [ri[1]])) # Map to rows
.toDF(schema)) # It will work without schema but will be more expensive
# inSet in Spark < 1.3
indexed.where(col("index").isin(indexes))
Solution 2
If you want a number range that's guaranteed not to collide but does not require a .over(partitionBy())
then you can use monotonicallyIncreasingId()
.
from pyspark.sql.functions import monotonicallyIncreasingId
df.select(monotonicallyIncreasingId().alias("rowId"),"*")
Note though that the values are not particularly "neat". Each partition is given a value range and the output will not be contiguous. E.g. 0, 1, 2, 8589934592, 8589934593, 8589934594
.
This was added to Spark on Apr 28, 2015 here: https://github.com/apache/spark/commit/d94cd1a733d5715792e6c4eac87f0d5c81aebbe2
Solution 3
You certainly can add an array for indexing, an array of your choice indeed: In Scala, first we need to create an indexing Array:
val index_array=(1 to df.count.toInt).toArray
index_array: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
You can now append this column to your DF. First, For that, you need to open up our DF and get it as an array, then zip it with your index_array and then we convert the new array back into and RDD. The final step is to get it as a DF:
final_df = sc.parallelize((df.collect.map(
x=>(x(0),x(1))) zip index_array).map(
x=>(x._1._1.toString,x._1._2.toString,x._2))).
toDF("column_name")
The indexing would be more clear after that.
Maria Koroliuk
Updated on July 05, 2022Comments
-
Maria Koroliuk almost 2 years
I have a very big pyspark.sql.dataframe.DataFrame named df. I need some way of enumerating records- thus, being able to access record with certain index. (or select group of records with indexes range)
In pandas, I could make just
indexes=[2,3,6,7] df[indexes]
Here I want something similar, (and without converting dataframe to pandas)
The closest I can get to is:
Enumerating all the objects in the original dataframe by:
indexes=np.arange(df.count()) df_indexed=df.withColumn('index', indexes)
- Searching for values I need using where() function.
QUESTIONS:
- Why it doesn't work and how to make it working? How to add a row to a dataframe?
Would it work later to make something like:
indexes=[2,3,6,7] df1.where("index in indexes").collect()
Any faster and simpler way to deal with it?
-
titipata over 7 yearsHello @zero323, I tried the snippet. Everything works except
indexed.where(col("index").inSet(indexes))
that doesn't work. It returnsTypeError: 'Column' object is not callable
for me. Do you have an update on the snippet if I want to query multiple indexes? -
CharlesG over 5 yearsThis answer is downvoted because it does not explain how to create the id column.