This module provides regular expression matching operations over a sequence of tuples (or a sequence of sequence) data structure. It looks like the following:
seq_m_n = [[str_11, str_12, ... str_1n], [str_21, str_22, ... str_2n], ..., [str_m1, str_m2, ... str_mn]]
The sequence is a homogeneous 2D array, that is a matrix with m rows and n columns. In practice, m maybe vary from sequence to sequence, while n is usually a fixed-size.
A element in the tuple of the sequence can be considered as either a string, a word, a phrase, a char, a flag, a token or a tag, and maybe a set of tags or values (multi-values) in the future.
To match a pattern over a sequence of tuples, the SEQ RE patterns is written like one of the examples:
([;;PERSON]+) [was|has been] [an]? .{0,3} ([^painter|drawing artist|画家]) (?P<name@0,1,2>[;;PERSON]) [;VERB be;] [born] [on] (?P<birthday@0:3>([;;NUMBER|MONTH]|[-]){2,3})
A SEQ RE pattern is very similar to the ordinary regular express (RE) used in Python,
in which the delimiters [...]
is to indicate a tuple -- the second dimension of the sequence.
[
and]
is the beginning and end delimiter of the tuple, e.g.
[...]
.;
separates each element which the tuple contains, and the continuous
;
at the tail can be omitted, e.g.[A|B;X;;]
,[A|B;X]
.|
indicates the different values of one element, e.g.
A|B
. These values form a set, and any string in the set will be matched, e.g.A|B
will matchA
orB
.^
be the first character of an element, all the string that are not in the value set of this element will be matched. And
^
has no special meaning if it’s not the first character of the element. If^
comes the first character of an element but it is a part of a literal string,\^
should be used to escape it.The priority of above-mentioned operations:
[
]
<;
<^
(not literal) <|
<^
(literal) .\
is an escaping symbol before aforementioned special characters. Characters other than
]
,:
or\
lose their special meaning inside[...]
. To express]
,:
or|
in literal,\
should be added before]
,:
or|
. Meanwhile, to represent a literal backslash\
before]
,;
or|
,\\
should be used in the plain text that is to say'\\\\'
must be used in the Python code.
- The special meanings of special characters in the ordinary RE are available here,
but with the limitations discussed below.
- Not support
[
and]
as special characters to indicate a set of characters. - Not support the following escaped special characters:
\number
,\A
,\b
,\B
,\d
,\D
,\s
,\S
,\w
,\W
,\Z
,\a
,\b
,\f
,\n
,\r
,\t
,\v
,\x
. - Not support ranges of characters,
such as
[0-9A-Za-z]
,[\u4E00-\u9FBB\u3007]
(Unihan and Chinese character〇
) used in ordinary RE. - The whitespace and non-special characters are ignored.
- Not support
.
is an abbreviation of an arbitrary tuple[]
or[;]
.- The named groups in the pattern are very useful.
As an extension, a format string starting with
@
can be followed after the group name, to describe which element of the tuples belonging this group will be output as the result. For example:(?P<name@d1,d2:d3>...)
, in whichd1
,d2
andd3
are all 0-based position index number of elements in the tuple.@0,2:4
means in the matched result only the 0th and from 2nd to 3rd elements of tuples will be output.@@
means the pattern of the group itself will be output other than the matched result. one can choose whether to include the group name and parentheses or not.@
means all elements of tuples in the matched result will be output.
Given a sequence of 3-tuple [[s1, s2, s3], ... ]
,
AND
[X;;Y]
will matchs1
==X
&&s3
==Y
. Its behavior looks like the ordinary RE pattern(?:X.Y)
.OR
[X;;]|[;;Y]
will matchs1
==X
||s3
==Y
. Its behavior looks like the ordinary RE pattern(?:X..)|(?:..Y)
NOT
If
[;^P;]
will matchs2
!=P
. Its behavior looks like the ordinary RE pattern(?:.[^P].)
.We can also use a negative lookahead assertion of the ordinary RE, to give a negative covering its following. e.g.
(?![;P;][Q])[;;][;;]
<==>[;^P;][^Q;;]
, which behavior looks like the ordinary RE pattern(?!(?:.P.)(?:Q..))...
.
Not support comparing the number of figures.
Multi-values of one element is not supported now, but this feature may be improved in the future.
Although SEQ RE has sufficient ability to express a pattern over sequences of tuples, it is still not a cascaded regular expressions (see also: Stanford TokensRegex).
The usage of seq_re module:
from __future__ import print_function import seq_re n = 3 pattern = ('(?P<name@0>[;;PERSON]+) [is|was|has been] [a|an]? ' '(?P<attrib@0,1>.{0,3}) ([artist])') seq = [['Vincent van Gogh', 'NNP', 'PERSON'], ['was', 'VBD', 'O'], ['a', 'DT', 'O'], ['Dutch', 'JJ', 'O'], ['Post-Impressionist', 'NN', 'O'], ['painter', 'NN', 'OCCUPATION'], ['who', 'WP', 'O'], ['is', 'VBZ', 'O'], ['among', 'IN', 'O'], ['the', 'DT', 'O'], ['most', 'RBS', 'O'], ['famous', 'JJ', 'O'], ['and', 'CC', 'O'], ['influential', 'JJ', 'O'], ['figures', 'NNS', 'O'], ['in', 'IN', 'O'], ['the', 'DT', 'O'], ['history', 'NN', 'O'], ['of', 'IN', 'O'], ['Western art', 'NNP', 'DOMAIN'], ['.', '.', 'O']] placeholder_dict = {'artist': ['painter', 'drawing artist']} sr = seq_re.SeqRegex(n).compile(pattern, **placeholder_dict) match = sr.search(seq) if match: for g in match.group_list: print(' '.join(['`'.join(tup) for tup in g[1]])) for name in sorted(match.named_group_dict, key=lambda gn: match.named_group_dict[gn][0]): print(name, match.format_group_to_str(name, True))