Apache Pig - MATCHES with multiple match criteria
Solution 1
Since you're using Pig you don't actually need an involved regular expression, you can just use the boolean operators supplied by pig combined with a couple of easy regular expressions, example:
T = load 'matches.txt' as (str:chararray);
F = filter T by ((str matches '.*(Foo|Foo Bar|FooBar).*' and str matches '.*(test|testA|TestB).*') or str matches '.*TestZ.*');
dump F;
Solution 2
You can use this regex for matches
method
^((?=.*\\bTestZ\\b)|(?=.*\\b(FooBar|Foo Bar|Foo)\\b)(?=.*\\b(testA|testB|test)\\b)).*
- note that
"Foo" OR "Foo Bar" OR "FooBar"
should be written asFooBar|Foo Bar|Foo
notFoo|Foo Bar|FooBar
to prevent matching onlyFoo
in string containingFooBar
orFoo Bar
- also since look-ahead is zero-width you need to pass
.*
at the end of regex to let matches match entire string.
Demo
String[] data = { "The quick brown Foo jumped over the lazy test",
"the was something going on in TestZ",
"the quick brown Foo jumped over the lazy dog" };
String regex = "^((?=.*\\bTestZ\\b)|(?=.*\\b(FooBar|Foo Bar|Foo)\\b)(?=.*\\b(testA|testB|test)\\b)).*";
for (String s : data) {
System.out.println(s.matches(regex) + " : " + s);
}
output:
true : The quick brown Foo jumped over the lazy test
true : the was something going on in TestZ
false : the quick brown Foo jumped over the lazy dog
user2495234
Updated on July 09, 2022Comments
-
user2495234 almost 2 years
I am trying to take a logical match criteria like:
(("Foo" OR "Foo Bar" OR FooBar) AND ("test" OR "testA" OR "TestB")) OR TestZ
and apply this as a match against a file in pig using
result = filter inputfields by text matches (some regex expression here));
The problem is I have no idea how to trun the logical expression above into a regex expression for the matches method.
I have fiddled around with various things and the closest I have come to is something like this:
((?=.*?\bFoo\b | \bFoo Bar\b))(?=.*?\bTestZ\b)
Any ideas? I also need to try to do this conversion programatically if possible.
Some examples:
a - The quick brown Foo jumped over the lazy test (This should pass as it contains foo and test)
b - the was something going on in TestZ (This passes also as it contains testZ)
c - the quick brown Foo jumped over the lazy dog (This should fail as it contains Foo but not test,testA or TestB)
Thanks