What is the difference between rowsBetween and rangeBetween?

28,994

Solution 1

It is simple:

  • ROWS BETWEEN doesn't care about the exact values. It cares only about the order of rows, and takes fixed number of preceding and following rows when computing frame.
  • RANGE BETWEEN considers values when computing frame.

Let's use an example using two window definitions:

  • ORDER BY x ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
  • ORDER BY x RANGE BETWEEN 2 PRECEDING AND CURRENT ROW

and data as

+---+
|  x|
+---+
| 10|
| 20|
| 30|
| 31|
+---+

Assuming the current row is the one with value 31 for the first window following rows will be included (current one, and two preceding):

+---+----------------------------------------------------+
|  x|ORDER BY x ROWS BETWEEN 2  PRECEDING AND CURRENT ROW|
+---+----------------------------------------------------+
| 10|                                               false|
| 20|                                                true|
| 30|                                                true|
| 31|                                                true|
+---+----------------------------------------------------+

and for the second one following (current one, and all preceding where x >= 31 - 2):

+---+-----------------------------------------------------+
|  x|ORDER BY x RANGE BETWEEN 2  PRECEDING AND CURRENT ROW|
+---+-----------------------------------------------------+
| 10|                                                false|
| 20|                                                false|
| 30|                                                 true|
| 31|                                                 true|
+---+-----------------------------------------------------+

Solution 2

The Java spark docs add clarity: https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/expressions/WindowSpec.html#rowsBetween-long-long-

rangeBetween

A range-based boundary is based on the actual value of the ORDER BY expression(s). An offset is used to alter the value of the ORDER BY expression, for instance if the current order by expression has a value of 10 and the lower bound offset is -3, the resulting lower bound for the current row will be 10 - 3 = 7. This however puts a number of constraints on the ORDER BY expressions: there can be only one expression and this expression must have a numerical data type. An exception can be made when the offset is unbounded, because no value modification is needed, in this case multiple and non-numeric ORDER BY expression are allowed.

rowBetween

A row based boundary is based on the position of the row within the partition. An offset indicates the number of rows above or below the current row, the frame for the current row starts or ends. For instance, given a row based sliding frame with a lower bound offset of -1 and a upper bound offset of +2. The frame for row with index 5 would range from index 4 to index 6.

Solution 3

rowsBetween: - With rowsBetween you define a boundary frame of rows to calculate, which frame is calculated independently.

Frame in rowsBetween does not depend on orderBy clause.

df = spark.read.csv(r'C:\Users\akashSaini\Desktop\TT.csv',inferSchema =True, header=True).na.drop()
w =Window.partitionBy('DEPARTMENT').orderBy('SALARY').rowsBetween(Window.unboundedPreceding,Window.currentRow)
df.withColumn('RowsBetween', F.sum(df.SALARY).over(w)).show()


first_name|Department|Salary|RowsBetween|

 Sofia|     Sales| 20000| 20000|
Gordon|     Sales| 25000| 45000|
Gracie|     Sales| 25000| 70000|
Cellie|     Sales| 25000| 95000|
Jervis|     Sales| 30000|125000|
 Akash|  Analysis| 30000| 30000|
Richard|   Account| 12000| 12000|
 Joelly|   Account| 15000| 27000|
Carmiae|   Account| 15000| 42000|
    Bob|   Account| 20000| 62000|
  Gally|   Account| 28000| 90000

rangeBetween: - With rangeBetween, you define a boundary frame of rows to calculate, which may change.

Frame in rangeBetween depends on orderBy clause. rangeBetween will include all the rows which has same value in orderBy clause like Gordon, Gracie and Cellie have same salary so included with the current frame.

For more understanding see below example: -

df = spark.read.csv(r'C:\Users\asaini28.EAD\Desktop\TT.csv',inferSchema =True, header=True).na.drop()
w =Window.partitionBy('DEPARTMENT').orderBy('SALARY').rangeBetween(Window.unboundedPreceding,Window.currentRow)
df.withColumn('RangeBetween', F.sum(df.SALARY).over(w)).select('first_name','Department','Salary','Test').show()

 first_name|Department|Salary|RangeBetween|
  Sofia|     Sales| 20000| 20000|
 Gordon|     Sales| 25000| 95000|
 Gracie|     Sales| 25000| 95000|
 Cellie|     Sales| 25000| 95000|
 Jervis|     Sales| 30000|125000|
  Akash|  Analysis| 30000| 30000|
Richard|   Account| 12000| 12000|
 Joelly|   Account| 15000| 42000|
Carmiae|   Account| 15000| 42000|
    Bob|   Account| 20000| 62000|
  Gally|   Account| 28000| 90000|

Solution 4

RANGEbetween looks at the ORDER BY clause to determine if a row is included in a window.

ROWSbetween looks at the order of the rows.

RANGE between checks if the ORDER BY is within some specified range, and will include them in a window.

ROWSbetween will form your window based on what rows are positionally around your current row (irregardless of the value of the ORDER BY column for those rows)

Share:
28,994
Evan Zamir
Author by

Evan Zamir

Updated on July 13, 2022

Comments

  • Evan Zamir
    Evan Zamir over 1 year

    From the PySpark docs rangeBetween:

    rangeBetween(start, end)

    Defines the frame boundaries, from start (inclusive) to end (inclusive).

    Both start and end are relative from the current row. For example, “0” means “current row”, while “-1” means one off before the current row, and “5” means the five off after the current row.

    Parameters:

    • start – boundary start, inclusive. The frame is unbounded if this is -sys.maxsize (or lower).
    • end – boundary end, inclusive. The frame is unbounded if this is sys.maxsize (or higher). New in version 1.4.

    while rowsBetween

    rowsBetween(start, end)

    Defines the frame boundaries, from start (inclusive) to end (inclusive).

    Both start and end are relative positions from the current row. For example, “0” means “current row”, while “-1” means the row before the current row, and “5” means the fifth row after the current row.

    Parameters:

    • start – boundary start, inclusive. The frame is unbounded if this is -sys.maxsize (or lower).
    • end – boundary end, inclusive. The frame is unbounded if this is sys.maxsize (or higher). New in version 1.4.

    For rangeBetween how is "1 off" different from "1 row", for example?