EDIT
La mienne n'est pas une bonne (ou en fait, après y avoir regardé de plus près, aucune) solution au problème, qui nécessite trop de tracas pour une seule regex. Le comptage et la suppression des ensembles ne sont pas ce que vous voulez intégrer dans une seule regex. Utilisez une logique auxiliaire. Voir la solution de Tim.
J'utilise Perl, l'outil de prédilection :) pour ce genre de travail :
@blacklist = qw(four five);
$blacklist = do { local $" = "|"; qr/@blacklist/ };
@whitelist = qw(one two three);
$whitelist = do { local $" = "|"; qr/@whitelist/ };
if ($string =~ /\b$whitelist\b/ && $string !~ /\b$blacklist\b/) { ... }
Perl les compilera en structures de données trie, et les exécutera donc extrêmement rapidement, quel que soit le nombre d'alternatives que vous avez. Cette optimisation O(1) est très importante dans de nombreuses applications du monde réel.
EDIT
J'ai mal interprété le problème en disant que je voulais une liste blanche et pas de liste noire, et non pas une liste blanche et pas de liste noire. Ce n'est pas vraiment ce que j'appellerais une liste blanche. Peut-être une liste de souhaits et une liste de non-désirs, peut-être.
Quoi qu'il en soit, voici une mise à jour qui fonctionne dans le sens "tout et rien", même si, pour des raisons de performances, il faut inverser l'ordre et faire "tout et rien". (Cela suppose que les mots ne se chevauchent pas ; je ). peut faire le cas de chevauchement, mais c'est plus facile à comprendre sans cela).
#!/usr/bin/env perl
use strict;
use warnings;
my @blacklist = qw(four five);
my $blacklist = do { local $" = "|"; qr/@blacklist/ };
my @whitelist = qw(one two three);
my $whitelist = do { local $" = "|"; qr/@whitelist/ };
while (<DATA>) {
s/\s*#\s*(.*)$// && print "This $1\n\t";
if (/$blacklist/ == 0 && @whitelist == (() = /\b$whitelist\b/g)) {
print "GOOD: ";
} else {
print "EVIL: ";
}
print;
}
__END__
three one two # is a matching text (contains all whitelist words)
one three two six # is a matching text (contains all whitelist words)
two one # is not a matching text (lacks a whitelist word three)
one four two three # is not a matching text (contains a blacklist word four)
Lorsqu'il est exécuté, il rapporte :
This is a matching text (contains all whitelist words)
GOOD: three one two
This is a matching text (contains all whitelist words)
GOOD: one three two six
This is not a matching text (lacks a whitelist word three)
EVIL: two one
This is not a matching text (contains a blacklist word four)
EVIL: one four two three
Pour déboguer la regex, y compris pour voir comment les deux structures de données trie se compilent et s'exécutent, incluez simplement -Mre=debug
sur la ligne de commande ou use re "debug";
dans le code. Voici ce que cela donne :
% perl -Mre=debug /tmp/trie
Compiling REx "\s*#\s*(.*)$"
synthetic stclass "ANYOF{i}[\x09\x0a\x0c\x0d #][{non-utf8-latin1-all}{unicode_all}]".
Final program:
1: STAR (3)
2: SPACE (0)
3: EXACT <#> (5)
5: STAR (7)
6: SPACE (0)
7: OPEN1 (9)
9: STAR (11)
10: REG_ANY (0)
11: CLOSE1 (13)
13: EOL (14)
14: END (0)
floating "#" at 0..2147483647 (checking floating) stclass ANYOF{i}[\x09\x0a\x0c\x0d #][{non-utf8-latin1-all}{unicode_all}] minlen 1
Compiling REx "four|five"
Final program:
1: EXACT <f> (3)
3: TRIE-EXACT[io] (7)
<our>
<ive>
7: END (0)
anchored "f" at 0 (checking anchored) minlen 4
Compiling REx "one|two|three"
Final program:
1: TRIEC-EXACT[ot] (11)
<one>
<two>
<three>
11: END (0)
stclass AHOCORASICKC-EXACT[ot] minlen 3
Guessing start of match in sv for REx "\s*#\s*(.*)$" against "three one two # is a matching text (contains all whitelist w"...
Found floating substr "#" at offset 14...
start_shift: 0 check_at: 14 s: 0 endpos: 15
By STCLASS: moving 0 --> 5
Guessed: match at offset 5
Matching REx "\s*#\s*(.*)$" against " one two # is a matching text (contains all whitelist words)"...
Matching stclass ANYOF{i}[\x09\x0a\x0c\x0d #][{non-utf8-latin1-all}{unicode_all}] against " one two # is a matching text (contains all whitelist words)"... (61 bytes)
5 <three> < one two #> | 1:STAR(3)
SPACE can match 1 times out of 2147483647...
failed...
9 <e one> < two # is > | 1:STAR(3)
SPACE can match 1 times out of 2147483647...
failed...
13 <e two> < # is a ma> | 1:STAR(3)
SPACE can match 1 times out of 2147483647...
14 < two > <# is a mat> | 3: EXACT <#>(5)
15 <two #> < is a matc> | 5: STAR(7)
SPACE can match 1 times out of 2147483647...
16 <wo # > <is a match> | 7: OPEN1(9)
16 <wo # > <is a match> | 9: STAR(11)
REG_ANY can match 49 times out of 2147483647...
65 <list words)> <%n> | 11: CLOSE1(13)
65 <list words)> <%n> | 13: EOL(14)
65 <list words)> <%n> | 14: END(0)
Match successful!
This is a matching text (contains all whitelist words)
Guessing start of match in sv for REx "four|five" against "three one two%n"
Did not find anchored substr "f"...
Match rejected by optimizer
Compiling REx "\b(?^:one|two|three)\b"
Final program:
1: BOUND (2)
2: TRIEC-EXACT[ot] (13)
<one>
<two>
<three>
13: BOUND (14)
14: END (0)
stclass BOUND minlen 3
Matching REx "\b(?^:one|two|three)\b" against "three one two%n"
Matching stclass BOUND against "three one tw" (12 bytes)
0 <> <three one > | 1:BOUND(2)
0 <> <three one > | 2:TRIEC-EXACT[ot](13)
0 <> <three one > | State: 1 Accepted: N Charid: 4 CP: 74 After State: 5
1 <t> <hree one t> | State: 5 Accepted: N Charid: 6 CP: 68 After State: 8
2 <th> <ree one tw> | State: 8 Accepted: N Charid: 7 CP: 72 After State: 9
3 <thr> <ee one two> | State: 9 Accepted: N Charid: 3 CP: 65 After State: a
4 <thre> <e one two> | State: a Accepted: N Charid: 3 CP: 65 After State: b
5 <three> < one two%n> | State: b Accepted: Y Charid: 0 CP: 0 After State: 0
got 1 possible matches
TRIE matched word #3, continuing
only one match left, short-circuiting: #3 <three>
5 <three> < one two%n> | 13:BOUND(14)
5 <three> < one two%n> | 14:END(0)
Match successful!
Matching REx "\b(?^:one|two|three)\b" against " one two%n"
Matching stclass BOUND against " one tw" (7 bytes)
5 <three> < one two%n> | 1:BOUND(2)
5 <three> < one two%n> | 2:TRIEC-EXACT[ot](13)
failed to match trie start class...
6 <hree > <one two%n> | 1:BOUND(2)
6 <hree > <one two%n> | 2:TRIEC-EXACT[ot](13)
6 <hree > <one two%n> | State: 1 Accepted: N Charid: 1 CP: 6f After State: 2
7 <ree o> <ne two%n> | State: 2 Accepted: N Charid: 2 CP: 6e After State: 3
8 <ree on> <e two%n> | State: 3 Accepted: N Charid: 3 CP: 65 After State: 4
9 <ree one> < two%n> | State: 4 Accepted: Y Charid: 0 CP: 0 After State: 0
got 1 possible matches
TRIE matched word #1, continuing
only one match left, short-circuiting: #1 <one>
9 <ree one> < two%n> | 13:BOUND(14)
9 <ree one> < two%n> | 14:END(0)
Match successful!
Matching REx "\b(?^:one|two|three)\b" against " two%n"
Matching stclass BOUND against " tw" (3 bytes)
9 <ree one> < two%n> | 1:BOUND(2)
9 <ree one> < two%n> | 2:TRIEC-EXACT[ot](13)
failed to match trie start class...
10 <ree one > <two%n> | 1:BOUND(2)
10 <ree one > <two%n> | 2:TRIEC-EXACT[ot](13)
10 <ree one > <two%n> | State: 1 Accepted: N Charid: 4 CP: 74 After State: 5
11 <ree one t> <wo%n> | State: 5 Accepted: N Charid: 5 CP: 77 After State: 6
12 <ree one tw> <o%n> | State: 6 Accepted: N Charid: 1 CP: 6f After State: 7
13 <ree one two> <%n> | State: 7 Accepted: Y Charid: 0 CP: 0 After State: 0
got 1 possible matches
TRIE matched word #2, continuing
only one match left, short-circuiting: #2 <two>
13 <ree one two> <%n> | 13:BOUND(14)
13 <ree one two> <%n> | 14:END(0)
Match successful!
GOOD: three one two
Guessing start of match in sv for REx "\s*#\s*(.*)$" against "one three two six # is a matching text (contains all whiteli"...
Found floating substr "#" at offset 18...
start_shift: 0 check_at: 18 s: 0 endpos: 19
By STCLASS: moving 0 --> 3
Guessed: match at offset 3
Matching REx "\s*#\s*(.*)$" against " three two six # is a matching text (contains all whitelist "...
Matching stclass ANYOF{i}[\x09\x0a\x0c\x0d #][{non-utf8-latin1-all}{unicode_all}] against " three two six # is a matching text (contains all whitelist "... (67 bytes)
3 <one> < three two> | 1:STAR(3)
SPACE can match 1 times out of 2147483647...
failed...
9 <three> < two six #> | 1:STAR(3)
SPACE can match 1 times out of 2147483647...
failed...
13 <e two> < six # is > | 1:STAR(3)
SPACE can match 1 times out of 2147483647...
failed...
17 <o six> < # is a ma> | 1:STAR(3)
SPACE can match 1 times out of 2147483647...
18 < six > <# is a mat> | 3: EXACT <#>(5)
19 <six #> < is a matc> | 5: STAR(7)
SPACE can match 1 times out of 2147483647...
20 <ix # > <is a match> | 7: OPEN1(9)
20 <ix # > <is a match> | 9: STAR(11)
REG_ANY can match 49 times out of 2147483647...
69 <list words)> <%n> | 11: CLOSE1(13)
69 <list words)> <%n> | 13: EOL(14)
69 <list words)> <%n> | 14: END(0)
Match successful!
This is a matching text (contains all whitelist words)
Guessing start of match in sv for REx "four|five" against "one three two six%n"
Did not find anchored substr "f"...
Match rejected by optimizer
Matching REx "\b(?^:one|two|three)\b" against "one three two six%n"
Matching stclass BOUND against "one three two si" (16 bytes)
0 <> <one three > | 1:BOUND(2)
0 <> <one three > | 2:TRIEC-EXACT[ot](13)
0 <> <one three > | State: 1 Accepted: N Charid: 1 CP: 6f After State: 2
1 <o> <ne three t> | State: 2 Accepted: N Charid: 2 CP: 6e After State: 3
2 <on> <e three tw> | State: 3 Accepted: N Charid: 3 CP: 65 After State: 4
3 <one> < three two> | State: 4 Accepted: Y Charid: 0 CP: 0 After State: 0
got 1 possible matches
TRIE matched word #1, continuing
only one match left, short-circuiting: #1 <one>
3 <one> < three two> | 13:BOUND(14)
3 <one> < three two> | 14:END(0)
Match successful!
Matching REx "\b(?^:one|two|three)\b" against " three two six%n"
Matching stclass BOUND against " three two si" (13 bytes)
3 <one> < three two> | 1:BOUND(2)
3 <one> < three two> | 2:TRIEC-EXACT[ot](13)
failed to match trie start class...
4 <one > <three two > | 1:BOUND(2)
4 <one > <three two > | 2:TRIEC-EXACT[ot](13)
4 <one > <three two > | State: 1 Accepted: N Charid: 4 CP: 74 After State: 5
5 <one t> <hree two s> | State: 5 Accepted: N Charid: 6 CP: 68 After State: 8
6 <ne th> <ree two si> | State: 8 Accepted: N Charid: 7 CP: 72 After State: 9
7 <e thr> <ee two six> | State: 9 Accepted: N Charid: 3 CP: 65 After State: a
8 < thre> <e two six> | State: a Accepted: N Charid: 3 CP: 65 After State: b
9 <three> < two six%n> | State: b Accepted: Y Charid: 0 CP: 0 After State: 0
got 1 possible matches
TRIE matched word #3, continuing
only one match left, short-circuiting: #3 <three>
9 <three> < two six%n> | 13:BOUND(14)
9 <three> < two six%n> | 14:END(0)
Match successful!
Matching REx "\b(?^:one|two|three)\b" against " two six%n"
Matching stclass BOUND against " two si" (7 bytes)
9 <three> < two six%n> | 1:BOUND(2)
9 <three> < two six%n> | 2:TRIEC-EXACT[ot](13)
failed to match trie start class...
10 <hree > <two six%n> | 1:BOUND(2)
10 <hree > <two six%n> | 2:TRIEC-EXACT[ot](13)
10 <hree > <two six%n> | State: 1 Accepted: N Charid: 4 CP: 74 After State: 5
11 <ree t> <wo six%n> | State: 5 Accepted: N Charid: 5 CP: 77 After State: 6
12 <ree tw> <o six%n> | State: 6 Accepted: N Charid: 1 CP: 6f After State: 7
13 <ree two> < six%n> | State: 7 Accepted: Y Charid: 0 CP: 0 After State: 0
got 1 possible matches
TRIE matched word #2, continuing
only one match left, short-circuiting: #2 <two>
13 <ree two> < six%n> | 13:BOUND(14)
13 <ree two> < six%n> | 14:END(0)
Match successful!
Matching REx "\b(?^:one|two|three)\b" against " six%n"
Matching stclass BOUND against " si" (3 bytes)
13 <ree two> < six%n> | 1:BOUND(2)
13 <ree two> < six%n> | 2:TRIEC-EXACT[ot](13)
failed to match trie start class...
14 <ree two > <six%n> | 1:BOUND(2)
14 <ree two > <six%n> | 2:TRIEC-EXACT[ot](13)
failed to match trie start class...
Contradicts stclass... [regexec_flags]
Match failed
GOOD: one three two six
Guessing start of match in sv for REx "\s*#\s*(.*)$" against "two one # is not a matching text (lacks a whitelist word thr"...
Found floating substr "#" at offset 8...
start_shift: 0 check_at: 8 s: 0 endpos: 9
By STCLASS: moving 0 --> 3
Guessed: match at offset 3
Matching REx "\s*#\s*(.*)$" against " one # is not a matching text (lacks a whitelist word three)"...
Matching stclass ANYOF{i}[\x09\x0a\x0c\x0d #][{non-utf8-latin1-all}{unicode_all}] against " one # is not a matching text (lacks a whitelist word three)"... (61 bytes)
3 <two> < one # is > | 1:STAR(3)
SPACE can match 1 times out of 2147483647...
failed...
7 <o one> < # is not > | 1:STAR(3)
SPACE can match 1 times out of 2147483647...
8 < one > <# is not a> | 3: EXACT <#>(5)
9 <one #> < is not a > | 5: STAR(7)
SPACE can match 1 times out of 2147483647...
10 <ne # > <is not a m> | 7: OPEN1(9)
10 <ne # > <is not a m> | 9: STAR(11)
REG_ANY can match 53 times out of 2147483647...
63 <word three)> <%n> | 11: CLOSE1(13)
63 <word three)> <%n> | 13: EOL(14)
63 <word three)> <%n> | 14: END(0)
Match successful!
This is not a matching text (lacks a whitelist word three)
Guessing start of match in sv for REx "four|five" against "two one%n"
Did not find anchored substr "f"...
Match rejected by optimizer
Matching REx "\b(?^:one|two|three)\b" against "two one%n"
Matching stclass BOUND against "two on" (6 bytes)
0 <> <two one%n> | 1:BOUND(2)
0 <> <two one%n> | 2:TRIEC-EXACT[ot](13)
0 <> <two one%n> | State: 1 Accepted: N Charid: 4 CP: 74 After State: 5
1 <t> <wo one%n> | State: 5 Accepted: N Charid: 5 CP: 77 After State: 6
2 <tw> <o one%n> | State: 6 Accepted: N Charid: 1 CP: 6f After State: 7
3 <two> < one%n> | State: 7 Accepted: Y Charid: 0 CP: 0 After State: 0
got 1 possible matches
TRIE matched word #2, continuing
only one match left, short-circuiting: #2 <two>
3 <two> < one%n> | 13:BOUND(14)
3 <two> < one%n> | 14:END(0)
Match successful!
Matching REx "\b(?^:one|two|three)\b" against " one%n"
Matching stclass BOUND against " on" (3 bytes)
3 <two> < one%n> | 1:BOUND(2)
3 <two> < one%n> | 2:TRIEC-EXACT[ot](13)
failed to match trie start class...
4 <two > <one%n> | 1:BOUND(2)
4 <two > <one%n> | 2:TRIEC-EXACT[ot](13)
4 <two > <one%n> | State: 1 Accepted: N Charid: 1 CP: 6f After State: 2
5 <two o> <ne%n> | State: 2 Accepted: N Charid: 2 CP: 6e After State: 3
6 <two on> <e%n> | State: 3 Accepted: N Charid: 3 CP: 65 After State: 4
7 <two one> <%n> | State: 4 Accepted: Y Charid: 0 CP: 0 After State: 0
got 1 possible matches
TRIE matched word #1, continuing
only one match left, short-circuiting: #1 <one>
7 <two one> <%n> | 13:BOUND(14)
7 <two one> <%n> | 14:END(0)
Match successful!
EVIL: two one
Guessing start of match in sv for REx "\s*#\s*(.*)$" against "one four two three # is not a matching text (contains a blac"...
Found floating substr "#" at offset 19...
start_shift: 0 check_at: 19 s: 0 endpos: 20
By STCLASS: moving 0 --> 3
Guessed: match at offset 3
Matching REx "\s*#\s*(.*)$" against " four two three # is not a matching text (contains a blackli"...
Matching stclass ANYOF{i}[\x09\x0a\x0c\x0d #][{non-utf8-latin1-all}{unicode_all}] against " four two three # is not a matching text (contains a blackli"... (74 bytes)
3 <one> < four two > | 1:STAR(3)
SPACE can match 1 times out of 2147483647...
failed...
8 < four> < two three> | 1:STAR(3)
SPACE can match 1 times out of 2147483647...
failed...
12 <r two> < three # i> | 1:STAR(3)
SPACE can match 1 times out of 2147483647...
failed...
18 <three> < # is not > | 1:STAR(3)
SPACE can match 1 times out of 2147483647...
19 <hree > <# is not a> | 3: EXACT <#>(5)
20 <ree #> < is not a > | 5: STAR(7)
SPACE can match 1 times out of 2147483647...
21 <ee # > <is not a m> | 7: OPEN1(9)
21 <ee # > <is not a m> | 9: STAR(11)
REG_ANY can match 55 times out of 2147483647...
76 < word four)> <%n> | 11: CLOSE1(13)
76 < word four)> <%n> | 13: EOL(14)
76 < word four)> <%n> | 14: END(0)
Match successful!
This is not a matching text (contains a blacklist word four)
Guessing start of match in sv for REx "four|five" against "one four two three%n"
Found anchored substr "f" at offset 4...
Starting position does not contradict /^/m...
Guessed: match at offset 4
Matching REx "four|five" against "four two three%n"
4 <one > <four two t> | 1:EXACT <f>(3)
5 <one f> <our two th> | 3:TRIE-EXACT[io](7)
5 <one f> <our two th> | State: 2 Accepted: N Charid: 2 CP: 6f After State: 3
6 <ne fo> <ur two thr> | State: 3 Accepted: N Charid: 3 CP: 75 After State: 4
7 <e fou> <r two thre> | State: 4 Accepted: N Charid: 4 CP: 72 After State: 5
8 < four> < two three> | State: 5 Accepted: Y Charid: 0 CP: 0 After State: 0
got 1 possible matches
TRIE matched word #1, continuing
only one match left, short-circuiting: #1 <our>
8 < four> < two three> | 7:END(0)
Match successful!
EVIL: one four two three
Freeing REx: "one|two|three"
Freeing REx: "\s*#\s*(.*)$"
Je ne connais aucun autre langage qui vous permette de déboguer la compilation et l'exécution de vos regex de cette manière, ce qui, à lui seul, vaut le prix d'entrée dans mon livre.