正则表达式入门（十二）Miscellaneous

hj170520 · 发表于 2020-5-26 22:12

入门代码（十二）：

using dict	replacement string based on the matched text as dictionary key
	ex: re.sub(r'pat', lambda m: d.get(m[0], default), s)
re.subn()	gives tuple of modified string and number of substitutions
\G	regex module, restricts matching from start of string like \A
	continues matching from end of match as new anchor until it fails
	ex: regex.findall(r'\G\d+-?', '12-34 42') gives ['12-', '34']
subexpression call	regex module, helps to define recursive matching
	ex: r'\((?:[^()]++\|(?0))++\)' matches nested sets of parentheses
[[:digit:]]	regex module, named character set for \d
[[:^digit:]]	to indicate \D
	See regular-expressions: POSIX Bracket for full list
(?V1)	inline flag to enable version 1 for regex module
	regex.DEFAULT_VERSION=regex.VERSION1 can also be used
	(?V0) or regex.VERSION0 to get back default version
set operations	V1 enables this feature for character classes, nested [] allowed
\|\|	union
~~	symmetric difference
&&	intersection
--	difference
	ex: (?V1)[[:punct:]--[.!?]] punctuation except . ! and ?
pat(SKIP)(F)	regex module, ignore text matched by pat
	ex: "[^"]++"(SKIP)(F)\|, will match , but not inside
	double quoted pairs

[Python] 纯文本查看 复制代码

import re
import regex

'''Using dict'''
# one to one mappings
d = { '1': 'one', '2': 'two', '4': 'four' }
print(re.sub(r'[124]', lambda m: d[m[0]], '9234012'))
# 返回 '9two3four0onetwo'

# if the matched text doesn't exist as a key, default value will be used
print(re.sub(r'\d', lambda m: d.get(m[0], 'X'), '9234012'))
# 返回 'XtwoXfourXonetwo'

# For swapping two or more portions without using intermediate result, using a dict is recommended.

swap = { 'cat': 'tiger', 'tiger': 'cat' }
words = 'cat tiger dog tiger cat'

# replace word if it exists as key, else leave it as is
print(re.sub(r'\w+', lambda m: swap.get(m[0], m[0]), words))
# 返回 'tiger cat dog cat tiger'

# or, build the alternation list manually for simple cases
print(re.sub(r'cat|tiger', lambda m: swap[m[0]], words))
# 返回 'tiger cat dog cat tiger'


# For dict that have many entries and likely to undergo changes during development, 
# building alternation list manually is not a good choice. 
# Also, recall that as per precedence rules, longest length string should come first.

d = { 'hand': 1, 'handy': 2, 'handful': 3, 'a^b': 4 }

# take care of metacharacter escaping first
words = [re.escape(k) for k in d.keys()]
# build alternation list
# add anchors and flags as needed to construct the final RE
print('|'.join(sorted(words, key=len, reverse=True)))
# 返回 'handful|handy|hand|a\\^b'


'''re.subn
The re.subn function returns a tuple of modified string after substitution and number of substitutions made.
 This can be used to perform conditional operations based on whether the substitution was successful. 
 Or, the value of count itself may be needed for solving the given problem.
'''
word = 'coffining'
# recursively delete 'fin'
while True:
     word, cnt = re.subn(r'fin', r'', word)
     if cnt == 0:
         break

print(word)
# 返回 'cog'

# Here's an example that won't work if greedy quantifier is used instead of possessive quantifier.

row = '421,foo,2425,42,5,foo,6,6,42'

# lookarounds used to ensure start/end of column matching
# possessive quantifier used to ensure partial column is not captured
# if a column has same text as another column, the latter column is deleted
while True:
     row, cnt = regex.subn(r'(?<=\A|,)([^,]++).*\K,\1(?=,|\Z)', r'', row)
     if cnt == 0:
         break

print(row)
# 返回 '421,foo,2425,42,5,6'

'''\G anchor
The \G anchor (provided by regex module) restricts matching from start of string like the \A anchor. In addition, after a match is done, ending of that match is considered as the new anchor location. This process is repeated again and continues until the given RE fails to match (assuming multiple matches with sub, findall etc).
'''
# all non-whitespace characters from start of string
print(regex.findall(r'\G\S', '123-87-593 42 foo'))
# 返回 ['1', '2', '3', '-', '8', '7', '-', '5', '9', '3']
print(regex.sub(r'\G\S', r'*', '123-87-593 42 foo'))
# 返回 '********** 42 foo'

# all digits and optional hyphen combo from start of string
print(regex.findall(r'\G\d+-?', '123-87-593 42 foo'))
# 返回 ['123-', '87-', '593']
print(regex.sub(r'\G(\d+)(-?)', r'(\1)\2', '123-87-593 42 foo'))
# 返回 '(123)-(87)-(593) 42 foo'

# all word characters from start of string
# only if it is followed by word character
print(regex.findall(r'\G\w(?=\w)', 'cat12 bat pin'))
# 返回 ['c', 'a', 't', '1']
print(regex.sub(r'\G\w(?=\w)', r'\g<0>:', 'cat12 bat pin'))
# 返回 'c:a:t:1:2 bat pin'

# all lowercase alphabets or space from start of string
print(regex.sub(r'\G[a-z ]', r'(\g<0>)', 'par tar-den hen-food mood'))
# 返回 '(p)(a)(r)( )(t)(a)(r)-den hen-food mood'

'''Recursive matching
First up, a RE to match a set of parentheses that is not nested (termed as level-one RE for reference).
'''
# note the use of possessive quantifier
eqn0 = 'a + (b * c) - (d / e)'
print(regex.findall(r'\([^()]++\)', eqn0))
# 返回 ['(b * c)', '(d / e)']

eqn1 = '((f+x)^y-42)*((3-g)^z+2)'
print(regex.findall(r'\([^()]++\)', eqn1))
# 返回 ['(f+x)', '(3-g)']

# Next, matching a set of parentheses which may optionally contain any number of non-nested sets of parentheses 
# (termed as level-two RE for reference). 
# See debuggex for a railroad diagram, notice the recursive nature of this RE.

eqn1 = '((f+x)^y-42)*((3-g)^z+2)'
# note the use of non-capturing group
print(regex.findall(r'\((?:[^()]++|\([^()]++\))++\)', eqn1))
# 返回 ['((f+x)^y-42)', '((3-g)^z+2)']

eqn2 = 'a + (b) + ((c)) + (((d)))'
print(regex.findall(r'\((?:[^()]++|\([^()]++\))++\)', eqn2))
# 返回 ['(b)', '((c))', '((d))']

# That looks very cryptic. Better to use regex.X flag for clarity as well as for comparing against the recursive version.
# Breaking down the RE, you can see ( and ) have to be matched literally. 
# Inside that, valid string is made up of either non-parentheses characters or a non-nested parentheses sequence (level-one RE).

lvl2 = regex.compile('''
          \(              #literal (
            (?:           #start of non-capturing group
             [^()]++      #non-parentheses characters
             |            #OR
             \([^()]++\)  #level-one RE
            )++           #end of non-capturing group, 1 or more times
          \)              #literal )
          ''', flags=regex.X)

print(lvl2.findall(eqn1))
# 返回 ['((f+x)^y-42)', '((3-g)^z+2)']

print(lvl2.findall(eqn2))
# 返回 ['(b)', '((c))', '((d))']

# To recursively match any number of nested sets of parentheses, use a capture group and call it within the capture group itself. 
# Since entire RE needs to be called here, you can use the default zeroth capture group (this also helps to avoid having to use finditer). 
# Comparing with level-two RE, the only change is that (?0) is used instead of the level-one RE in the second alternation.
# To recursively match any number of nested sets of parentheses, use a capture group and call it within the capture group itself. 
# Since entire RE needs to be called here, you can use the default zeroth capture group (this also helps to avoid having to use finditer). 
# Comparing with level-two RE, the only change is that (?0) is used instead of the level-one RE in the second alternation.

lvln = regex.compile('''
          \(           #literal (
            (?:        #start of non-capturing group
             [^()]++   #non-parentheses characters
             |         #OR
             (?0)      #recursive call
            )++        #end of non-capturing group, 1 or more times
          \)           #literal )
          ''', flags=regex.X)

print(lvln.findall(eqn0))
# 返回 ['(b * c)', '(d / e)']

print(lvln.findall(eqn1))
# 返回 ['((f+x)^y-42)', '((3-g)^z+2)']

print(lvln.findall(eqn2))
# 返回 ['(b)', '((c))', '(((d)))']

eqn3 = '(3+a) * ((r-2)*(t+2)/6) + 42 * (a(b(c(d(e)))))'
print(lvln.findall(eqn3))
# 返回 ['(3+a)', '((r-2)*(t+2)/6)', '(a(b(c(d(e)))))']


'''Named character setsA named character set is defined by a name enclosed between [: and :] and has to be used within a character class [], along with any other characters as needed. Using [:^ instead of [: will negate the named character set. 
'''
# similar to: r'\d+' or r'[0-9]+'
print(regex.split(r'[[:digit:]]+', 'Sample123string42with777numbers'))
# 返回 ['Sample', 'string', 'with', 'numbers']
# similar to: r'[a-zA-Z]+'
print(regex.sub(r'[[:alpha:]]+', r':', 'Sample123string42with777numbers'))
# 返回 ':123:42:777:'

# similar to: r'[\w\s]+'
print(regex.findall(r'[[:word:][:space:]]+', 'tea sea-pit sit-lean\tbean'))
# 返回 ['tea sea', 'pit sit', 'lean\tbean']
# similar to: r'\S+'
print(regex.findall(r'[[:^space:]]+', 'tea sea-pit sit-lean\tbean'))
# 返回 ['tea', 'sea-pit', 'sit-lean', 'bean']

# words not surrounded by punctuation characters
print(regex.findall(r'(?<![[:punct:]])\b\w+\b(?![[:punct:]])', 'tie. ink eat;'))
# 返回 ['ink']

'''Character class set operations
Set operations can be applied inside character class between sets. 
Mostly used to get intersection or difference between two sets, where one/both of them is a character range or predefined character set. 
To aid in such definitions, you can use [] in nested fashion. The four operators, in increasing order of precedence, are:

|| union
~~ symmetric difference
&& intersection
-- difference
'''
# [^aeiou] will match any non-vowel character
# which means space is also a valid character to be matched
print(re.findall(r'\b[^aeiou]+\b', 'tryst glyph pity why'))
# 返回 ['tryst glyph ', ' why']
# intersection or difference can be used here
# to get a positive definition of characters to match
print(regex.findall(r'(?V1)\b[a-z&&[^aeiou]]+\b', 'tryst glyph pity why'))
# 返回 ['tryst', 'glyph', 'why']

# [[a-l]~~[g-z]] is same as [a-fm-z]
print(regex.findall(r'(?V1)\b[[a-l]~~[g-z]]+\b', 'gets eat top sigh'))
# 返回 ['eat', 'top']

# remove all punctuation characters except . ! and ?
para = '"Hi", there! How *are* you? All fine here.'
print(regex.sub(r'(?V1)[[:punct:]--[.!?]]+', r'', para))
# 返回 'Hi there! How are you? All fine here.'

'''Skipping matches
Sometimes, you want to change or extract all matches except particular matches. Usually, there are common characteristics between the two types of matches that makes it hard or impossible to define RE only for the required matches. For example, changing field values unless it is a particular name, or perhaps don't touch double quoted values and so on. To use the skipping feature, define the matches to be ignored suffixed by (*SKIP)(*FAIL) and then define the matches required as part of alternation. (*F) can also be used instead of (*FAIL).
'''
# change lowercase words other than imp or rat
words = 'tiger imp goat eagle rat'
print(regex.sub(r'\b(?:imp|rat)\b(*SKIP)(*F)|[a-z]++', r'(\g<0>)', words))
# 返回 '(tiger) imp (goat) (eagle) rat'

# change all commas other than those inside double quotes
row = '1,"cat,12",nice,two,"dog,5"'
print(regex.sub(r'"[^"]++"(*SKIP)(*F)|,', r'|', row))
# 返回 '1|"cat,12"|nice|two|"dog,5"'

hshcompass · 发表于 2020-5-27 08:00

还要考英语吗？

hj170520 · 发表于 2020-5-26 22:24

习题留空层

haliluyadada · 发表于 2020-5-27 07:30

太好了收藏了

sxlcity · 发表于 2020-5-27 07:56

谢谢分享，学习了

hxw0204 · 发表于 2020-5-27 08:41

回贴支持一下

帐号		自动登录	找回密码
密码			注册[Register]

[Python 转载] 正则表达式入门（十二）Miscellaneous

免费评分

个人中心