场景:有用户表 user_info,字段 uid, city。现运要选 10w 人发调查问卷,要求所选人群的 city 分布,和全量用户的 city 分布一致。
思路:
- 开窗函数求出每个城市的占比。
- row_number 每个城市按照随机排序,也就是打乱数据。
- 用每个城市的占比乘以 10w 人,得到需要每个城市需要的人。
完整SQL
SELECT uid, city
FROM (
SELECT uid,
city,
((COUNT(uid) OVER (PARTITION BY city)) / (COUNT(uid) OVER ())) AS city_percent,
ROW_NUMBER() OVER (PARTITION BY city ORDER BY RAND()) AS rand_rank
FROM user_info
) tab1
WHERE rand_rank <= (10000 * city_percent);