What is a Window Function in SQL
A window function in SQL is a type of function that performs a calculation across a set of table rows, which are somehow related to the current row. Through window functions, every input row can potentially have a unique result, and this can be used to solve complex problems with relative ease. Window functions are widely utilized in the realm of data analysis.
Purpose of Window Functions
The primary purpose of window functions is to provide data analysts with a powerful tool to solve problems that need to compute something over a group of rows, potentially partitioned by certain values, ordered by other values. For instance, you could calculate a moving average, cumulative sum, or compute a running total of certain fields.
One of the key advantages of window functions is their ability to perform calculations across rows that are related to the current row. In contrast, with other types of SQL operations, rows are treated independently.
The Concept of Partition and Frame
Partitions
A partition in SQL is a division of the total set of rows into smaller sets. The PARTITION BY
clause is responsible for this operation in a window function.
Let's consider a table sales
that contains sales data for a company:
Salesperson | Region | Sales |
---|---|---|
John | West | 2000 |
Anna | East | 1500 |
Luke | West | 2500 |
Mark | East | 1000 |
John | East | 2500 |
Anna | West | 2000 |
If we partition this table by the Region
column:
SELECT Region, Sales,
SUM(Sales) OVER (PARTITION BY Region) AS total_sales
FROM sales;
The result set is divided into two partitions: East and West, and the sum of sales is calculated for each partition.
Frames
A frame is a subset of the current partition and can be thought of as a "sliding window" that moves over the partition and defines the current set of rows for calculating the window function. Frames are defined by the ROWS clause in the window function.
To understand frames better, let's continue with the sales
table. Suppose we want to calculate a running total of sales within each region:
SELECT Salesperson, Region, Sales,
SUM(Sales) OVER (PARTITION BY Region ORDER BY Salesperson ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM sales;
The frame, or "window", starts at the first row of each partition (UNBOUNDED PRECEDING
) and ends at the current row (CURRENT ROW
), effectively creating a running total.
Structure of Window Functions
The OVER Clause
The OVER clause defines a window or set of rows where the window function will be applied. It is used in conjunction with window functions. Without the OVER
clause, window functions cannot function. The window is defined by an ORDER BY
clause and a PARTITION BY
clause.
For instance, we may want to find the total salary of each department in a company. The SQL query with OVER
clause could be:
SELECT department, salary,
SUM(salary) OVER (PARTITION BY department) AS total_salary
FROM employees;
This will return a result where each row contains the department, individual salary, and the total salary for that department.
The PARTITION BY Clause
The PARTITION BY
clause divides the result set into partitions (or groups). The window function is applied to each partition separately and computation restarts for each partition.
For example, if you want to calculate the highest salary in each department, you would write:
SELECT department, salary,
MAX(salary) OVER (PARTITION BY department) AS highest_salary
FROM employees;
This will return a table where each row contains the department, the salary of an employee, and the highest salary in that department.
The ORDER BY Clause
The ORDER BY
clause within the OVER clause defines the order of the rows within each partition for the window function. This can change the result of the window function.
For instance, to find the cumulative salary in each department, you can use:
SELECT department, salary,
SUM(salary) OVER (PARTITION BY department ORDER BY salary) AS cumulative_salary
FROM employees;
This will result in a table with each row containing the department, the salary of an employee, and the cumulative salary in that department ordered by salary.
The ROWS clause
The ROWS
clause in SQL defines a window frame within the partition in terms of physical offsets from the position of the current input row. Different keywords help define this window frame:
UNBOUNDED PRECEDING
This keyword specifies that the window frame starts at the first row of the partition. For example:
SELECT Salesperson, Region, Sales,
SUM(Sales) OVER (PARTITION BY Region ORDER BY Salesperson ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM sales;
In this case, for each row, the running_total
includes all sales from the start of the partition to the current row.
n PRECEDING
This specifies that the window frame starts at the nth row before the current row within the partition. For instance, if n is 2:
SELECT Salesperson, Region, Sales,
AVG(Sales) OVER (PARTITION BY Region ORDER BY Salesperson ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS average_sales
FROM sales;
Here, the average_sales
at each row is the average of the current row and the two preceding rows within the same partition.
CURRENT ROW
This keyword represents the current row in the partition. The window frame can end or start at the current row. For instance:
SELECT Salesperson, Region, Sales,
SUM(Sales) OVER (PARTITION BY Region ORDER BY Salesperson ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS running_total
FROM sales;
Here, the running_total
at each row is the total sales from the current row to the end of the partition.
n FOLLOWING
This specifies that the window frame ends at the nth row after the current row within the partition. If n is 1:
SELECT Salesperson, Region, Sales,
AVG(Sales) OVER (PARTITION BY Region ORDER BY Salesperson ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING) AS average_sales
FROM sales;
Here, the average_sales
at each row is the average of the current row and the next row within the same partition.
UNBOUNDED FOLLOWING
This keyword specifies that the window frame ends at the last row of the partition. For instance:
SELECT Salesperson, Region, Sales,
SUM(Sales) OVER (PARTITION BY Region ORDER BY Salesperson ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS running_total
FROM sales;
In this case, for each row, the running_total
includes all sales from the current row to the end of the partition.
The RANGE BETWEEN Clause
The RANGE BETWEEN
clause in window functions defines a frame of rows with the same ORDER BY
value as the current row, and the n PRECEDING
and n FOLLOWING
rows within the partition. It operates on the logical, rather than the physical, representation of the data.
To use the RANGE BETWEEN
clause, you will need an ORDER BY
clause in the window function. For example, suppose we want to calculate the total sales for each salesperson and the two salespeople who sold just below and just above them:
SELECT Salesperson, Sales,
SUM(Sales) OVER (ORDER BY Sales RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS sales_range_total
FROM sales;
This will return a result where each row contains the salesperson, their sales, and the sales_range_total
that includes their sales and the sales of the salesperson who sold just below and just above them.
Comparing Window Functions with GROUP BY
SQL window functions and the GROUP BY
clause are both tools used for data aggregation. However, they serve different purposes and behave differently.
GROUP BY
The GROUP BY
clause is used with aggregate functions like SUM
, AVG
, MAX
, MIN
, COUNT
, etc., to group the result-set by one or more columns. When using GROUP BY, the database system divides the rows returned from the SELECT
statement into groups. For each group, you can apply an aggregate function.
For example, you can group a sales
table by the Salesperson
column and calculate the total sales for each salesperson:
SELECT Salesperson, SUM(Sales) as total_sales
FROM sales
GROUP BY Salesperson;
This will return a result where each row contains a salesperson and their total sales.
Differences between Window Functions and GROUP BY
The fundamental difference between GROUP BY
and window functions is how they handle the rows of the dataset:
- When you use
GROUP BY
, the rows are grouped into a smaller set of output rows, losing their individual identities. An aggregate function is then applied, returning a single output row for each group. - In contrast, with window functions, each row retains its identity and can potentially have a unique result. A window function performs calculations across a set of rows related to the current row.
For example, suppose you want to calculate the total sales and the average sales for each salesperson. Using GROUP BY
, you could write:
SELECT Salesperson, SUM(Sales) as total_sales, AVG(Sales) as average_sales
FROM sales
GROUP BY Salesperson;
This will return a result where each row contains a salesperson, their total sales, and their average sales.
But what if you want to include these total sales and average sales in each row of the original table? This is where window functions come into play:
SELECT Salesperson, Sales,
SUM(Sales) OVER (PARTITION BY Salesperson) as total_sales,
AVG(Sales) OVER (PARTITION BY Salesperson) as average_sales
FROM sales;
This will return a result where each row contains a salesperson, their sales, their total sales, and their average sales.