2023-03-08

Window Functions in SQL

What is a Window Function in SQL

A window function in SQL is a type of function that performs a calculation across a set of table rows, which are somehow related to the current row. Through window functions, every input row can potentially have a unique result, and this can be used to solve complex problems with relative ease. Window functions are widely utilized in the realm of data analysis.

Purpose of Window Functions

The primary purpose of window functions is to provide data analysts with a powerful tool to solve problems that need to compute something over a group of rows, potentially partitioned by certain values, ordered by other values. For instance, you could calculate a moving average, cumulative sum, or compute a running total of certain fields.

One of the key advantages of window functions is their ability to perform calculations across rows that are related to the current row. In contrast, with other types of SQL operations, rows are treated independently.

The Concept of Partition and Frame

Partitions

A partition in SQL is a division of the total set of rows into smaller sets. The PARTITION BY clause is responsible for this operation in a window function.

Let's consider a table sales that contains sales data for a company:

Salesperson Region Sales
John West 2000
Anna East 1500
Luke West 2500
Mark East 1000
John East 2500
Anna West 2000

If we partition this table by the Region column:

sql
SELECT Region, Sales,
SUM(Sales) OVER (PARTITION BY Region) AS total_sales
FROM sales;

The result set is divided into two partitions: East and West, and the sum of sales is calculated for each partition.

Frames

A frame is a subset of the current partition and can be thought of as a "sliding window" that moves over the partition and defines the current set of rows for calculating the window function. Frames are defined by the ROWS clause in the window function.

To understand frames better, let's continue with the sales table. Suppose we want to calculate a running total of sales within each region:

sql
SELECT Salesperson, Region, Sales,
SUM(Sales) OVER (PARTITION BY Region ORDER BY Salesperson ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM sales;

The frame, or "window", starts at the first row of each partition (UNBOUNDED PRECEDING) and ends at the current row (CURRENT ROW), effectively creating a running total.

Structure of Window Functions

The OVER Clause

The OVER clause defines a window or set of rows where the window function will be applied. It is used in conjunction with window functions. Without the OVER clause, window functions cannot function. The window is defined by an ORDER BY clause and a PARTITION BY clause.

For instance, we may want to find the total salary of each department in a company. The SQL query with OVER clause could be:

sql
SELECT department, salary,
SUM(salary) OVER (PARTITION BY department) AS total_salary
FROM employees;

This will return a result where each row contains the department, individual salary, and the total salary for that department.

The PARTITION BY Clause

The PARTITION BY clause divides the result set into partitions (or groups). The window function is applied to each partition separately and computation restarts for each partition.

For example, if you want to calculate the highest salary in each department, you would write:

sql
SELECT department, salary,
MAX(salary) OVER (PARTITION BY department) AS highest_salary
FROM employees;

This will return a table where each row contains the department, the salary of an employee, and the highest salary in that department.

The ORDER BY Clause

The ORDER BY clause within the OVER clause defines the order of the rows within each partition for the window function. This can change the result of the window function.

For instance, to find the cumulative salary in each department, you can use:

sql
SELECT department, salary,
SUM(salary) OVER (PARTITION BY department ORDER BY salary) AS cumulative_salary
FROM employees;

This will result in a table with each row containing the department, the salary of an employee, and the cumulative salary in that department ordered by salary.

The ROWS clause

The ROWS clause in SQL defines a window frame within the partition in terms of physical offsets from the position of the current input row. Different keywords help define this window frame:

UNBOUNDED PRECEDING

This keyword specifies that the window frame starts at the first row of the partition. For example:

sql
SELECT Salesperson, Region, Sales,
SUM(Sales) OVER (PARTITION BY Region ORDER BY Salesperson ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM sales;

In this case, for each row, the running_total includes all sales from the start of the partition to the current row.

n PRECEDING

This specifies that the window frame starts at the nth row before the current row within the partition. For instance, if n is 2:

sql
SELECT Salesperson, Region, Sales,
AVG(Sales) OVER (PARTITION BY Region ORDER BY Salesperson ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS average_sales
FROM sales;

Here, the average_sales at each row is the average of the current row and the two preceding rows within the same partition.

CURRENT ROW

This keyword represents the current row in the partition. The window frame can end or start at the current row. For instance:

sql
SELECT Salesperson, Region, Sales,
SUM(Sales) OVER (PARTITION BY Region ORDER BY Salesperson ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS running_total
FROM sales;

Here, the running_total at each row is the total sales from the current row to the end of the partition.

n FOLLOWING

This specifies that the window frame ends at the nth row after the current row within the partition. If n is 1:

sql
SELECT Salesperson, Region, Sales,
AVG(Sales) OVER (PARTITION BY Region ORDER BY Salesperson ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING) AS average_sales
FROM sales;

Here, the average_sales at each row is the average of the current row and the next row within the same partition.

UNBOUNDED FOLLOWING

This keyword specifies that the window frame ends at the last row of the partition. For instance:

sql
SELECT Salesperson, Region, Sales,
SUM(Sales) OVER (PARTITION BY Region ORDER BY Salesperson ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS running_total
FROM sales;

In this case, for each row, the running_total includes all sales from the current row to the end of the partition.

The RANGE BETWEEN Clause

The RANGE BETWEEN clause in window functions defines a frame of rows with the same ORDER BY value as the current row, and the n PRECEDING and n FOLLOWING rows within the partition. It operates on the logical, rather than the physical, representation of the data.

To use the RANGE BETWEEN clause, you will need an ORDER BY clause in the window function. For example, suppose we want to calculate the total sales for each salesperson and the two salespeople who sold just below and just above them:

sql
SELECT Salesperson, Sales,
SUM(Sales) OVER (ORDER BY Sales RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS sales_range_total
FROM sales;

This will return a result where each row contains the salesperson, their sales, and the sales_range_total that includes their sales and the sales of the salesperson who sold just below and just above them.

Comparing Window Functions with GROUP BY

SQL window functions and the GROUP BY clause are both tools used for data aggregation. However, they serve different purposes and behave differently.

GROUP BY

The GROUP BY clause is used with aggregate functions like SUM, AVG, MAX, MIN, COUNT, etc., to group the result-set by one or more columns. When using GROUP BY, the database system divides the rows returned from the SELECT statement into groups. For each group, you can apply an aggregate function.

For example, you can group a sales table by the Salesperson column and calculate the total sales for each salesperson:

sql
SELECT Salesperson, SUM(Sales) as total_sales
FROM sales
GROUP BY Salesperson;

This will return a result where each row contains a salesperson and their total sales.

Differences between Window Functions and GROUP BY

The fundamental difference between GROUP BY and window functions is how they handle the rows of the dataset:

  • When you use GROUP BY, the rows are grouped into a smaller set of output rows, losing their individual identities. An aggregate function is then applied, returning a single output row for each group.
  • In contrast, with window functions, each row retains its identity and can potentially have a unique result. A window function performs calculations across a set of rows related to the current row.

For example, suppose you want to calculate the total sales and the average sales for each salesperson. Using GROUP BY, you could write:

sql
SELECT Salesperson, SUM(Sales) as total_sales, AVG(Sales) as average_sales
FROM sales
GROUP BY Salesperson;

This will return a result where each row contains a salesperson, their total sales, and their average sales.

But what if you want to include these total sales and average sales in each row of the original table? This is where window functions come into play:

sql
SELECT Salesperson, Sales,
SUM(Sales) OVER (PARTITION BY Salesperson) as total_sales,
AVG(Sales) OVER (PARTITION BY Salesperson) as average_sales
FROM sales;

This will return a result where each row contains a salesperson, their sales, their total sales, and their average sales.

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!