How to create sequential number column in pyspark dataframe?
11,200
Solution 1
You can do this using range
df_len = 100
freq =1
ref = spark.range(
5, df_len, freq
).toDF("id")
ref.show(10)
+---+
| id|
+---+
| 5|
| 6|
| 7|
| 8|
| 9|
| 10|
| 11|
| 12|
| 13|
| 14|
+---+
only showing top 10 rows
Solution 2
Three simple steps:
from pyspark.sql.window import Window
from pyspark.sql.functions import monotonically_increasing_id,row_number
df =df.withColumn("row_idx",row_number().over(Window.orderBy(monotonically_increasing_id())))
Author by
max04
Updated on June 18, 2022Comments
-
max04 almost 2 years
I would like to create column with sequential numbers in pyspark dataframe starting from specified number. For instance, I want to add column A to my dataframe df which will start from 5 to the length of my dataframe, incrementing by one, so 5, 6, 7, ..., length(df).
Some simple solution using pyspark methods?
-
pault almost 6 yearsEasiest way is probably
df = df.rdd.zipWithIndex().toDF(cols + ["index"]).withColumn("index", f.col("index") + 5)
wherecols = df.columns
andf
refers topyspark.sql.functions
. But you should ask yourself why you're doing this, bc almost surely there's a better way. DataFrames are inherently unordered, so this operation is not efficient. -
max04 almost 6 yearsThank you! At the end I want to add the final results to Hive table. I have to take max(id) from this table and add new records with id starting from max(id) + 1.
-
pault almost 6 yearsI do not think it is possible to get a serial id column in Hive like that. Hive/Spark is intended for parallel processing. Even though the code in my comment works for you and you may be able to come up with a way to achieve your desired result, this is not really a good use case for spark or hive.
-
max04 almost 6 yearsI handled it by adding new column to my df like this:
max(id) + spark_func.row_number().over(Window.orderBy(unique_field_in_my_df)
-
-
Remis Haroon - رامز about 3 yearsThe question is to add a "new" column to an existing dataframe