SQL开窗函数(窗口函数)总结
一. 窗口函数概念
SQL开窗(Window Functions)是一种在关系数据库中执行特定计算的高级技术,它能够让用户对查询结果集中的每一行应用聚合函数,同时指定计算范围,例如指定子集合或窗口,以便在处理数据时能够更灵活和精确地控制计算。
我们在使用 group by 的时候,除了 group by 后面的字段信息可以保留外,其他字段都需要聚合函数处理,假如我不需要聚合函数处理,想保留其他字段信息怎么办,那开窗函数便是你的选择。看后面的示例
二. 准备测试数据
CREATE TABLE person (
name VARCHAR(32),
city VARCHAR(32),
age INT,
salary INT
);
INSERT INTO person(name, city, age, salary)
VALUES ('Tom', 'BeiJing', 20, 3000),
('Tim', 'ChengDu', 21, 4000),
('Jim', 'BeiJing', 22, 3500),
('Lily', 'London', 21, 2000),
('John', 'NewYork', 22, 1000),
('YaoMing', 'BeiJing', 20, 3000),
('Swing', 'London', 22, 2000),
('Guo', 'NewYork', 20, 2800),
('YuQian', 'BeiJing', 24, 8000),
('Ketty', 'London', 25, 8500),
('Kitty', 'ChengDu', 25, 3000),
('Merry', 'BeiJing', 23, 3500),
('Smith', 'ChengDu', 30, 3000),
('Bill', 'BeiJing', 25, 2000),
('Jerry', 'NewYork', 24, 3300);
- 开窗函数示例:查询 salary 小于 3000 的用户,同时每一行都展示 salary 小于3000 的人数,可以看到与聚合函数不同的是,开窗函数在聚合函数后增加了一个 OVER 关键字。
SELECT name,
city,
salary,
COUNT(1) OVER () AS less_then_3000
FROM person
WHERE salary < 3000;
name | city | salary | less_then_3000 |
---|---|---|---|
Lily | London | 2000 | 5 |
John | NewYork | 1000 | 5 |
Swing | London | 2000 | 5 |
Guo | NewYork | 2800 | 5 |
Bill | BeiJing | 2000 | 5 |
三. 开窗函数格式
开窗函数格式为:函数名(列) OVER(选项)
- OVER 关键字表示把前面的函数当成开窗函数而不是聚合函数。SQL 标准允许将所有聚合函数用做开窗函数,使用 OVER 关键字来区分这两种用法。
- 对上一节的示例来说,查询结果的每一行都返回所有符合条件的行的条数。
- OVER 关键字后的括号中还经常添加选项用以改变进行聚合运算的窗口范围。如果 OVER 关键字后的括号中的选项为空,则开窗函数会对结果集中的所有行进行聚合运算。
- 聚合函数开窗函数区别:聚合函数每组只返回一个值,开窗函数每组可返回多个值。
四. PARTITION BY 子句
- OVER 关键字后括号中的可以使用 PARTITION BY 子句来定义行的分区进行聚合计算。
- 与 GROUP BY 子句不同,PARTITION BY 子句创建的分区是独立于结果集的,创建的分区只进行聚合计算,而且不同的开窗函数所创建的分区也不互相影响。
看下面的例子
SELECT name,
age,
city,
salary,
COUNT(1) OVER (PARTITION BY city) AS city_number,
COUNT(1) OVER (PARTITION BY age) AS age_number
FROM person;
name | age | city | salary | city_number | age_number |
---|---|---|---|---|---|
Tom | 20 | BeiJing | 3000 | 6 | 3 |
YaoMing | 20 | BeiJing | 3000 | 6 | 3 |
Guo | 20 | NewYork | 2800 | 3 | 3 |
Tim | 21 | ChengDu | 4000 | 3 | 2 |
Lily | 21 | London | 2000 | 3 | 2 |
Jim | 22 | BeiJing | 3500 | 6 | 3 |
Swing | 22 | London | 2000 | 3 | 3 |
John | 22 | NewYork | 1000 | 3 | 3 |
Merry | 23 | BeiJing | 3500 | 6 | 1 |
YuQian | 24 | BeiJing | 8000 | 6 | 2 |
Jerry | 24 | NewYork | 3300 | 3 | 2 |
Bill | 25 | BeiJing | 2000 | 6 | 3 |
Kitty | 25 | ChengDu | 3000 | 3 | 3 |
Ketty | 25 | London | 8500 | 3 | 3 |
Smith | 30 | ChengDu | 3000 | 3 | 1 |
- 表示对结果集按照 city 进行分区,并且计算当前行所属的组的聚合计算结果。比如对于 name 等于 Tom 的行,它所属的城市是 BeiJing,同属于 BeiJing 的人员一共有 6 个,所以对于这一列的显示结果为6。
- 在同一个 SELECT 语句中可以同时使用多个开窗函数,而且这些开窗函数并不会相互干扰。
五. ORDER BY子句
OVER 关键字后的选项中使用 ORDER BY 子句来指定排序规则,同时可以指定聚合的范围,语法为:ORDER BY 字段名 ASC|DESC RANGE|ROWS BETWEEN 边界规则1 AND 边界规则2
- RANGE 表示按照值的范围进行范围的定义。
- ROWS 表示按照行的范围进行范围的定义。
- "RANGE|ROWS BETWEEN 边界规则1 AND 边界规则2" 部分用来定位聚合计算范围,这个子句又被称为定位框架。
边界规则的可取值见下表
可取值 | 说明 | 示例 |
---|---|---|
CURRENT ROW | 当前行 | |
N PRECEDING | 前N行 | 2 PRECEDING |
UNBOUNDED PRECEDING | 一直到第一条记录 | |
N FOLLOWING | 后N行 | 2 FOLLOWING |
UNBOUNDED FOLLOWING | 一直到最后一条记录 |
ROWS 和 RANGE 的区别
ROWS 按照物理行计算,RANGE 按照值计算,看示例,非常重要:
SELECT name,
salary,
SUM(salary) OVER (ORDER BY salary RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS a,
SUM(salary) OVER (ORDER BY salary RANGE BETWEEN 2 PRECEDING AND CURRENT ROW ) AS b,
SUM(salary) OVER (ORDER BY salary ROWS BETWEEN 2 PRECEDING AND CURRENT ROW ) AS c
FROM person;
name | salary | a | b | c |
---|---|---|---|---|
John | 1000 | 1000 | 1000 | 1000 |
Lily | 2000 | 7000 | 6000 | 3000 |
Swing | 2000 | 7000 | 6000 | 5000 |
Bill | 2000 | 7000 | 6000 | 6000 |
Guo | 2800 | 9800 | 2800 | 6800 |
Tom | 3000 | 21800 | 12000 | 7800 |
YaoMing | 3000 | 21800 | 12000 | 8800 |
Kitty | 3000 | 21800 | 12000 | 9000 |
Smith | 3000 | 21800 | 12000 | 9000 |
Jerry | 3300 | 25100 | 3300 | 9300 |
Jim | 3500 | 32100 | 7000 | 9800 |
Merry | 3500 | 32100 | 7000 | 10300 |
Tim | 4000 | 36100 | 4000 | 11000 |
YuQian | 8000 | 44100 | 8000 | 15500 |
Ketty | 8500 | 52600 | 8500 | 20500 |
- a 表示从第一行到当前行的值的这个范围内的 salary 总和。
- b 表示从当前行的值小 2 的值的行,到当前行值的范围内 salary 总和。
- c 表示从第一行到当前行的 salary 总和。
第一行到当前行的工资总和
按照工资升序后,查询从第一行到当前行的工资总和,这里 SUM(salary) 是开窗函数,不是聚集函数,表示按照 salary 排序,然后计算从第一行(UNBOUNDED PRECEDING)到当前行(CURRENT ROW)的和。
SELECT name,
age,
city,
salary,
SUM(salary) OVER (ORDER BY salary ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS sum_before
FROM person;
name | age | city | salary | sum_before |
---|---|---|---|---|
John | 22 | NewYork | 1000 | 1000 |
Lily | 21 | London | 2000 | 3000 |
Swing | 22 | London | 2000 | 5000 |
Bill | 25 | BeiJing | 2000 | 7000 |
Guo | 20 | NewYork | 2800 | 9800 |
Tom | 20 | BeiJing | 3000 | 12800 |
YaoMing | 20 | BeiJing | 3000 | 15800 |
Kitty | 25 | ChengDu | 3000 | 18800 |
Smith | 30 | ChengDu | 3000 | 21800 |
Jerry | 24 | NewYork | 3300 | 25100 |
Jim | 22 | BeiJing | 3500 | 28600 |
Merry | 23 | BeiJing | 3500 | 32100 |
Tim | 21 | ChengDu | 4000 | 36100 |
YuQian | 24 | BeiJing | 8000 | 44100 |
Ketty | 25 | London | 8500 | 52600 |
第一行值到当前行值的工资总和
按照工资升序后,查询从第一行值到当前行值范围的工资总和。注意的是,该示例和上一个示例的唯一区别就是:ROWS 改成了 RANGE,ROWS是按照行数进行范围定位的,而 RANGE 则是按照值范围进行定位的,这两个不同的定位方式主要用来处理并列排序的情况。比如 Lily、Swing、Bill 这三个人的工资都是 2000 元,如果按照 RANGE 进行范围定位,则仍然计算从第一条到当前行的累积和,不过由于等于 2000 元的工资有三个人,所以计算的累积和为从第一条到 2000 元工资的人员间,所以对 Lily、Swing、Bill 这三个人进行开窗函数聚合计算的时候得到的都是7000( 1000+2000+2000+2000 )。
SELECT name,
age,
city,
salary,
SUM(salary) OVER (ORDER BY salary ASC RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS sum_before
FROM person;
name | age | city | salary | sum_before |
---|---|---|---|---|
John | 22 | NewYork | 1000 | 1000 |
Lily | 21 | London | 2000 | 7000 |
Swing | 22 | London | 2000 | 7000 |
Bill | 25 | BeiJing | 2000 | 7000 |
Guo | 20 | NewYork | 2800 | 9800 |
Tom | 20 | BeiJing | 3000 | 21800 |
YaoMing | 20 | BeiJing | 3000 | 21800 |
Kitty | 25 | ChengDu | 3000 | 21800 |
Smith | 30 | ChengDu | 3000 | 21800 |
Jerry | 24 | NewYork | 3300 | 25100 |
Jim | 22 | BeiJing | 3500 | 32100 |
Merry | 23 | BeiJing | 3500 | 32100 |
Tim | 21 | ChengDu | 4000 | 36100 |
YuQian | 24 | BeiJing | 8000 | 44100 |
Ketty | 25 | London | 8500 | 52600 |
第一行到当前行的工资的值的最大值
查询按照年龄升序后,到当前行的工资的值的最大值。
SELECT name,
age,
city,
salary,
MAX(salary) OVER (ORDER BY age RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS max_before
FROM person;
name | age | city | salary | max_before |
---|---|---|---|---|
Tom | 20 | BeiJing | 3000 | 3000 |
YaoMing | 20 | BeiJing | 3000 | 3000 |
Guo | 20 | NewYork | 2800 | 3000 |
Tim | 21 | ChengDu | 4000 | 4000 |
Lily | 21 | London | 2000 | 4000 |
Jim | 22 | BeiJing | 3500 | 4000 |
John | 22 | NewYork | 1000 | 4000 |
Swing | 22 | London | 2000 | 4000 |
Merry | 23 | BeiJing | 3500 | 4000 |
YuQian | 24 | BeiJing | 8000 | 8000 |
Jerry | 24 | NewYork | 3300 | 8000 |
Ketty | 25 | London | 8500 | 8500 |
Kitty | 25 | ChengDu | 3000 | 8500 |
Bill | 25 | BeiJing | 2000 | 8500 |
Smith | 30 | ChengDu | 3000 | 8500 |
查询同龄人的最高工资
注意:partition by 可以 和 order by 组合使用,此时表示在分区内进行排序
SELECT name,
age,
city,
salary,
MAX(salary) OVER (PARTITION BY age) AS max_same_age
FROM person;
name | age | city | salary | max_same_age |
---|---|---|---|---|
Tom | 20 | BeiJing | 3000 | 3000 |
YaoMing | 20 | BeiJing | 3000 | 3000 |
Guo | 20 | NewYork | 2800 | 3000 |
Tim | 21 | ChengDu | 4000 | 4000 |
Lily | 21 | London | 2000 | 4000 |
Jim | 22 | BeiJing | 3500 | 3500 |
John | 22 | NewYork | 1000 | 3500 |
Swing | 22 | London | 2000 | 3500 |
Merry | 23 | BeiJing | 3500 | 3500 |
YuQian | 24 | BeiJing | 8000 | 8000 |
Jerry | 24 | NewYork | 3300 | 8000 |
Ketty | 25 | London | 8500 | 8500 |
Kitty | 25 | ChengDu | 3000 | 8500 |
Bill | 25 | BeiJing | 2000 | 8500 |
Smith | 30 | ChengDu | 3000 | 3000 |
OVER () 与 OVER (ORDER BY)
总结:只要有 order by,就是累计到当前行值的计算,不是这个组内的计算。
OVER ()
表示应用于当前组(全部数据或者partition by指定的组),而 OVER (ORDER BY)
实际上是 OVER (ORDER BY xxx RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW )
的省略写法。注意,不是 ROWS BETWEEN
,RANGE BETWEEN
是按照值范围进行定位的,看下面示例。
示例1
SELECT name,
salary,
COUNT(1) OVER () AS a,
COUNT(1) OVER (ORDER BY salary) AS b
FROM person;
name | salary | a | b |
---|---|---|---|
John | 1000 | 15 | 1 |
Lily | 2000 | 15 | 4 |
Swing | 2000 | 15 | 4 |
Bill | 2000 | 15 | 4 |
Guo | 2800 | 15 | 5 |
Tom | 3000 | 15 | 9 |
YaoMing | 3000 | 15 | 9 |
Kitty | 3000 | 15 | 9 |
Smith | 3000 | 15 | 9 |
Jerry | 3300 | 15 | 10 |
Jim | 3500 | 15 | 12 |
Merry | 3500 | 15 | 12 |
Tim | 4000 | 15 | 13 |
YuQian | 8000 | 15 | 14 |
Ketty | 8500 | 15 | 15 |
示例2
SELECT name,
salary,
SUM(salary) OVER () AS a,
SUM(salary) OVER (ORDER BY salary) AS b
FROM person;
SELECT name,
salary,
SUM(salary) OVER () AS a,
SUM(salary) OVER (ORDER BY salary RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS b
FROM person;
如上两种写法完全等效,结果如下:
name | salary | a | b |
---|---|---|---|
John | 1000 | 52600 | 1000 |
Lily | 2000 | 52600 | 7000 |
Swing | 2000 | 52600 | 7000 |
Bill | 2000 | 52600 | 7000 |
Guo | 2800 | 52600 | 9800 |
Tom | 3000 | 52600 | 21800 |
YaoMing | 3000 | 52600 | 21800 |
Kitty | 3000 | 52600 | 21800 |
Smith | 3000 | 52600 | 21800 |
Jerry | 3300 | 52600 | 25100 |
Jim | 3500 | 52600 | 32100 |
Merry | 3500 | 52600 | 32100 |
Tim | 4000 | 52600 | 36100 |
YuQian | 8000 | 52600 | 44100 |
Ketty | 8500 | 52600 | 52600 |
六. 排名与排序 RANK 等
ROW_NUMBER,RANK,DENSE_RANK 都可以用于计算一行的排名,区别如下:
- ROW_NUMBER:简单给一行标记一个行号
- RANK:按照 ORDER BY 指定的字段进行排名,如果排名字段值相同,则名次也相同,但是会占用下一个名次。
- DENSE_RANK:按照 ORDER BY 指定的字段进行排名,如果排名字段值相同,则名次也相同,但是不会占用下一个名次。
注意 :如果有 partition by 的话,则表示在分区内排名。
SELECT name,
age,
salary,
ROW_NUMBER() OVER (ORDER BY salary DESC ) AS v_row_number,
RANK() OVER (ORDER BY salary DESC ) AS v_rank,
DENSE_RANK() OVER (ORDER BY salary DESC ) AS v_dense_rank
FROM person;
name | age | salary | v_row_number | v_rank | v_dense_rank |
---|---|---|---|---|---|
Ketty | 25 | 8500 | 1 | 1 | 1 |
YuQian | 24 | 8000 | 2 | 2 | 2 |
Tim | 21 | 4000 | 3 | 3 | 3 |
Jim | 22 | 3500 | 4 | 4 | 4 |
Merry | 23 | 3500 | 5 | 4 | 4 |
Jerry | 24 | 3300 | 6 | 6 | 5 |
Tom | 20 | 3000 | 7 | 7 | 6 |
YaoMing | 20 | 3000 | 8 | 7 | 6 |
Kitty | 25 | 3000 | 9 | 7 | 6 |
Smith | 30 | 3000 | 10 | 7 | 6 |
Guo | 20 | 2800 | 11 | 11 | 7 |
Lily | 21 | 2000 | 12 | 12 | 8 |
Swing | 22 | 2000 | 13 | 12 | 8 |
Bill | 25 | 2000 | 14 | 12 | 8 |
John | 22 | 1000 | 15 | 15 | 9 |
同龄人的工资排序
SELECT name,
age,
salary,
ROW_NUMBER() OVER (PARTITION BY age ORDER BY salary DESC ) AS v_row_number
FROM person;
name | age | salary | v_row_number |
---|---|---|---|
Tom | 20 | 3000 | 1 |
YaoMing | 20 | 3000 | 2 |
Guo | 20 | 2800 | 3 |
Tim | 21 | 4000 | 1 |
Lily | 21 | 2000 | 2 |
Jim | 22 | 3500 | 1 |
Swing | 22 | 2000 | 2 |
John | 22 | 1000 | 3 |
Merry | 23 | 3500 | 1 |
YuQian | 24 | 8000 | 1 |
Jerry | 24 | 3300 | 2 |
Ketty | 25 | 8500 | 1 |
Kitty | 25 | 3000 | 2 |
Bill | 25 | 2000 | 3 |
Smith | 30 | 3000 | 1 |
七. first_value 与 last_value
- first_value:表示排序后的第一行的值(如果有partition by,则在分区内的第一行)
- last_value:表示排序后的最后一行的值,即当前行的值(如果有partition by,则在分区内的最后一行)
SELECT name,
age,
salary,
first_value(salary) OVER (PARTITION BY age ORDER BY salary DESC ) AS v_first_value,
last_value(salary) OVER (PARTITION BY age ORDER BY salary DESC ) AS v_last_value
FROM person;
name | age | salary | v_first_value | v_last_value |
---|---|---|---|---|
Tom | 20 | 3000 | 3000 | 3000 |
YaoMing | 20 | 3000 | 3000 | 3000 |
Guo | 20 | 2800 | 3000 | 2800 |
Tim | 21 | 4000 | 4000 | 4000 |
Lily | 21 | 2000 | 4000 | 2000 |
Jim | 22 | 3500 | 3500 | 3500 |
Swing | 22 | 2000 | 3500 | 2000 |
John | 22 | 1000 | 3500 | 1000 |
Merry | 23 | 3500 | 3500 | 3500 |
YuQian | 24 | 8000 | 8000 | 8000 |
Jerry | 24 | 3300 | 8000 | 3300 |
Ketty | 25 | 8500 | 8500 | 8500 |
Kitty | 25 | 3000 | 8500 | 3000 |
Bill | 25 | 2000 | 8500 | 2000 |
Smith | 30 | 3000 | 3000 | 3000 |
八. lag 与 lead
LAG(col, n, DEFAULT)
用于统计窗口内,不算当前行,往上数第 n 行值的 col 列值
- 第一个参数为要取的列名
- 第二个参数为往上第n行(可选,默认为 1 )
- 第三个参数为默认值(当往上第 n 行为 NULL 时候,取默认值,如不指定,则为NULL)
SELECT name,
age,
salary,
LAG(salary, 1, 0) OVER (PARTITION BY age ORDER BY salary DESC ) AS v_first_value
FROM person;
name | age | salary | v_first_value |
---|---|---|---|
Tom | 20 | 3000 | 0 |
YaoMing | 20 | 3000 | 3000 |
Guo | 20 | 2800 | 3000 |
Tim | 21 | 4000 | 0 |
Lily | 21 | 2000 | 4000 |
Jim | 22 | 3500 | 0 |
Swing | 22 | 2000 | 3500 |
John | 22 | 1000 | 2000 |
Merry | 23 | 3500 | 0 |
YuQian | 24 | 8000 | 0 |
Jerry | 24 | 3300 | 8000 |
Ketty | 25 | 8500 | 0 |
Kitty | 25 | 3000 | 8500 |
Bill | 25 | 2000 | 3000 |
Smith | 30 | 3000 | 0 |
LEAD(col, n, DEFAULT)
, 与 LAG 相反 , 用于统计窗口内往下第n行值,不算当前行。
九. ntile 将数据分成 n 等分
用于将数据分成 n 等分,如果不能将数据平分,则前面的几份会多几条。通常情况下,在分块的时候,为了使数据可控,最好排序以后进行分块。
- 示例 将同龄人分成两份
SELECT name,
age,
salary,
NTILE(2) OVER (PARTITION BY age ORDER BY salary) AS block
FROM person;
name | age | salary | block |
---|---|---|---|
Guo | 20 | 2800 | 1 |
Tom | 20 | 3000 | 1 |
YaoMing | 20 | 3000 | 2 |
Lily | 21 | 2000 | 1 |
Tim | 21 | 4000 | 2 |
John | 22 | 1000 | 1 |
Swing | 22 | 2000 | 1 |
Jim | 22 | 3500 | 2 |
Merry | 23 | 3500 | 1 |
Jerry | 24 | 3300 | 1 |
YuQian | 24 | 8000 | 2 |
Bill | 25 | 2000 | 1 |
Kitty | 25 | 3000 | 1 |
Ketty | 25 | 8500 | 2 |
Smith | 30 | 3000 | 1 |
十. cume_dist 行数占比
统计小于等于(或者大于等于)当前行值的行数占比。比如:统计小于等于当前工资的人数占比。
SELECT name,
salary,
cume_dist() OVER (ORDER BY salary ASC ) AS percent
FROM person;
name | salary | percent |
---|---|---|
John | 1000 | 0.06666666666666667 |
Lily | 2000 | 0.26666666666666666 |
Swing | 2000 | 0.26666666666666666 |
Bill | 2000 | 0.26666666666666666 |
Guo | 2800 | 0.3333333333333333 |
Tom | 3000 | 0.6 |
YaoMing | 3000 | 0.6 |
Kitty | 3000 | 0.6 |
Smith | 3000 | 0.6 |
Jerry | 3300 | 0.6666666666666666 |
Jim | 3500 | 0.8 |
Merry | 3500 | 0.8 |
Tim | 4000 | 0.8666666666666667 |
YuQian | 8000 | 0.9333333333333333 |
Ketty | 8500 | 1 |
如果需要:统计大于等于当前工资的人数占比,只需要倒序排即可。