Python正则表达式

yuziyue

4 Jul 2023 · 5 min read

一. Python的模式编译

Python的模式编译使用compile，在博文后面有部分常用的例子可供参考。

# 先编译模式然后再使用
some_pattern = re.compile(pattern[,flag])
some_pattern.match(text)
# 比如：
text = 'Today is 2023-07-04. Welcom to 2023-07-05.'
some_pattern = re.compile(r'Today')
some_pattern.match(text)

# 不先编译模式直接使用
re.match(some_pattern, string[, flags])
# 比如：
text = '2023-07-04'
res = re.match(r'(\d{4})-(\d{2})-(\d{2})', text)

二. Python匹配使用函数

re.match(some_pattern, string[, flags])  # 返回对象。只从头部开始匹配， 头部匹配失败，则匹配失败。
re.search(some_pattern, string[, flags]) # 返回对象。只匹配一次， 匹配一次成功后后面不会再匹配。

re.split(some_pattern, string[, maxsplit])   # 返回数组。分割以多个分隔符分隔的字符串
re.findall(some_pattern, string[, flags])    # 返回数组。查找模式匹配的所有结果。
re.finditer(some_pattern, string[, flags])   # 返回可迭代对象。和findall类似，需要用迭代的方式访问匹配结果。

re.sub(some_pattern, repl, string[, count])  # 返回替换后的字符串。正则表达式的搜索替换。
re.subn(some_pattern, repl, string[, count]) # 返回替换后的字符串。正则表达式的搜索替换，同时返回替换的次数。

三. 判断是否匹配成功

some_pattern.match(text) # 是否匹配成功用 if 来判断，然后继续后续操作。

四. flags可选参数

I(IGNORECASE): 忽略大小写
L(LOCALE):     使用\w,\W,\b,\B时依据本地配置
M(MULTILINE):  多行，^匹配每行的开头，$匹配每行的末尾 
S(DOTALL):     使.匹配包含换行符在内的任意字符
X(VERBOSE):    忽略空白处和注释内容
U(UNICODE):    使\w,\W,\b,\B依靠UNICODE编码

# 忽略大小写查找life单词。
text = "Life is short! I use Python!"
res = re.findall(r'life', text, flags=re.I)
res = re.findall(r'life', text, re.I)

五. 贪婪匹配

使用贪婪匹配

text = ' I like "python", but you like "java" '
res = re.findall(r'\"(.*)\"', text)
['python", but you like "java']

用 ? 避免贪婪匹配，下面是非贪婪匹配

text = ' I like "python", but you like "java" '
res = re.findall(r'\"(.*?)\"', text)
['python', 'java']

六. 常用例子

只从头部开始匹配，头部匹配失败，则匹配失败。re.match(pattern, string[, flags])返回一个 SRE_Match 对象

# 匹配成功例子
text = 'Today is 2023-07-04. Welcom to 2023-07-05.'
some_pattern = re.compile(r'Today')
some_pattern.match(text)


# 匹配失败例子，因为 Today 前面有一个空格，不是开头的部分。
text = 'Today is 2023-07-04. Welcom to 2023-07-05.'
some_pattern = re.compile(r' Today')
some_pattern.match(text)

# 分组匹配，返回一个 SRE_Match 对象， 使用groups() group() 获取分组的值
text = '2023-07-04'
some_pattern = re.compile(r'(\d{4})-(\d{2})-(\d{2})')
res = some_pattern.match(text)
res.groups()     # 返回匹配的一个元组
res.group(0)     # 返回原始字符串
res.group(1)     # 返回第一个分组
res.group(2)     # 返回第二个分组

re.search(pattern, string[, flags]) 只匹配一次，匹配一次成了后后面不会再匹配。返回一个 SRE_Match 对象

text = 'Today is 2017-07-20. Welcom to 2018-07-21.'
some_pattern = re.compile(r'(\d{4})-(\d{2})-(\d{2})')
res = some_pattern.search(text)
res.groups()     # 返回匹配的元组
res.group(0)     # 返回匹配的字符串
res.group(1)     # 返回第一个分组
res.group(2)     # 返回第二个分组

分割以多个分隔符的字符串。re.split(pattern, string[, maxsplit]) 使用 re 模块的正则表达式可以实现分割多个分隔符的字符串。你说霸气不霸气。。。

info = " my teacher where; are , you . my,friend"
[x for x in re.split(r'\s*[;,.\s]\s*', info) if x ]
# 结果为：
['my', 'teacher', 'where', 'are', 'you', 'my', 'friend']

查找字符串任意部分的模式出现位置，并返回列表。re.findall(pattern, string[, flags])

# 如果匹配到多次，则将每次的结果追加到一个数组中。
text = 'Today is 2023-07-04. Welcom to 2023-07-05.'
some_pattern = re.compile(r'\d{4}-\d{2}-\d{2}')
res = some_pattern.findall(text)
# res 为一个数组 ['2023-07-04', '2023-07-05']


# 如果用括号括起来分组了，则返回分组组成的列表。
text = 'Today is 2023-07-04. Welcom to 2023-07-05.'
some_pattern = re.compile(r'(\d{4})-\d{2}-\d{2}')
res = some_pattern.findall(text)
# res 为一个数组 ['2023', '2023']

全局搜索指定匹配模式，然后返回一个可迭代对象。re.finditer(pattern, string[, flags])，finditer 和 findall 类似，不同点就是需要用迭代的方式访问匹配结果。

re.finditer(pattern, string[, flags])

正则表达式的搜索替换。re.sub(pattern, repl, string[, count])

# 将时间格式 2023-07-04 改为 07/04/2023 形式，反斜杠数字比如 \3 指向前面模式的捕获组号。
text = 'Today is 2023-07-04. Welcom to 2023-07-05.'
res = re.sub(r'(\d+)-(\d+)-(\d+)', r'\2/\3/\1', text)


# 将 python 和 life 对调
text = "python is short! i use life!"
res = re.sub(r'(python)(.*)(life)', r'\3\2\1', text)

搜索替换，返回替换后的字符串，同时返回替换的次数。re.subn(pattern, repl, string[, count])

# 将时间格式 2023-07-04 改为 07/04/2023 形式，反斜杠数字比如 \3 指向前面模式的捕获组号。
text = 'Today is 2023-07-04. Welcom to 2023-07-05.'
res, n = re.subn(r'(\d+)-(\d+)-(\d+)', r'\2/\3/\1', text)


# 将 python 和 life 对调
text = "python is short! i Use life!"
res, n = re.subn(r'(python)(.*)(life)', r'\3\2\1', text)

以上的匹配都是单行匹配，如果要匹配的字符串是多行，就需要使用多行模式来匹配。比如当用点(.)匹配任意字符的时候，忘记了点(.)不能匹配换行符。比如，匹配C语言分割的注释

comment = re.compile(r'/\*(.*?)\*/')
text1 = '/* this is a comment */'
text2 = '''/*
           this is a
           multiline comment
           */
        '''
comment.findall(text1)
[' this is a comment ']

comment.findall(text2)
[]


# 现在只需要加一个标志参数 re.DOTALL 即可
comment = re.compile(r'/\*(.*?)\*/', re.DOTALL)
comment.findall(text2)
# 结果
[' this is a\n multiline comment ']

🔗

阅读原文：https://yuchaoshui.com/e7d1786/

Python正则表达式

一. Python的模式编译

二. Python匹配使用函数

三. 判断是否匹配成功

四. flags可选参数

五. 贪婪匹配

六. 常用例子

Airflow清理logs日志

使用Python安全地修改文件内容

Python正则表达式

一. Python的模式编译

二. Python匹配使用函数

三. 判断是否匹配成功

四. flags可选参数

五. 贪婪匹配

六. 常用例子

Airflow清理logs日志

使用Python安全地修改文件内容

文章推荐

到底什么是ChatGPT的函数调用function calling

使用Django开发简单的后台管理

OpenAI ChatGPT总结上下

Gradio如何获取客户端信息

Python线程池处理多任务并发

Python错误 certificate verify failed: unable to get local issuer certificate 解决