How do I create a Stream of regex matches?
Solution 1
Well, in Java 8, there is Pattern.splitAsStream
which will provide a stream of items split by a delimiter pattern but unfortunately no support method for getting a stream of matches.
If you are going to implement such a Stream
, I recommend implementing Spliterator
directly rather than implementing and wrapping an Iterator
. You may be more familiar with Iterator
but implementing a simple Spliterator
is straight-forward:
final class MatchItr extends Spliterators.AbstractSpliterator<String> {
private final Matcher matcher;
MatchItr(Matcher m) {
super(m.regionEnd()-m.regionStart(), ORDERED|NONNULL);
matcher=m;
}
public boolean tryAdvance(Consumer<? super String> action) {
if(!matcher.find()) return false;
action.accept(matcher.group());
return true;
}
}
You may consider overriding forEachRemaining
with a straight-forward loop, though.
If I understand your attempt correctly, the solution should look more like:
Pattern pattern = Pattern.compile(
"[a-zA-Z0-9.!#$%&’*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\\.[a-zA-Z0-9-]+)");
try(BufferedReader br=new BufferedReader(System.console().reader())) {
br.lines()
.flatMap(line -> StreamSupport.stream(new MatchItr(pattern.matcher(line)), false))
.collect(Collectors.groupingBy(o->o, TreeMap::new, Collectors.counting()))
.forEach((k, v) -> System.out.printf("%s\t%s\n",k,v));
}
Java 9 provides a method Stream<MatchResult> results()
directly on the Matcher
. But for finding matches within a stream, there’s an even more convenient method on Scanner
. With that, the implementation simplifies to
try(Scanner s = new Scanner(System.console().reader())) {
s.findAll(pattern)
.collect(Collectors.groupingBy(MatchResult::group,TreeMap::new,Collectors.counting()))
.forEach((k, v) -> System.out.printf("%s\t%s\n",k,v));
}
This answer contains a back-port of Scanner.findAll
that can be used with Java 8.
Solution 2
Going off of Holger's solution, we can support arbitrary Matcher
operations (such as getting the nth group) by having the user provide a Function<Matcher, String>
operation. We can also hide the Spliterator
as an implementation detail, so that callers can just work with the Stream
directly. As a rule of thumb StreamSupport
should be used by library code, rather than users.
public class MatcherStream {
private MatcherStream() {}
public static Stream<String> find(Pattern pattern, CharSequence input) {
return findMatches(pattern, input).map(MatchResult::group);
}
public static Stream<MatchResult> findMatches(
Pattern pattern, CharSequence input) {
Matcher matcher = pattern.matcher(input);
Spliterator<MatchResult> spliterator = new Spliterators.AbstractSpliterator<MatchResult>(
Long.MAX_VALUE, Spliterator.ORDERED|Spliterator.NONNULL) {
@Override
public boolean tryAdvance(Consumer<? super MatchResult> action) {
if(!matcher.find()) return false;
action.accept(matcher.toMatchResult());
return true;
}};
return StreamSupport.stream(spliterator, false);
}
}
You can then use it like so:
MatcherStream.find(Pattern.compile("\\w+"), "foo bar baz").forEach(System.out::println);
Or for your specific task (borrowing again from Holger):
try(BufferedReader br = new BufferedReader(System.console().reader())) {
br.lines()
.flatMap(line -> MatcherStream.find(pattern, line))
.collect(Collectors.groupingBy(o->o, TreeMap::new, Collectors.counting()))
.forEach((k, v) -> System.out.printf("%s\t%s\n", k, v));
}
Solution 3
If you want to use a Scanner
together with regular expressions using the findWithinHorizon
method you could also convert a regular expression into a stream of strings.
Here we use a stream builder which is very convenient to use during a conventional while
loop.
Here is an example:
private Stream<String> extractRulesFrom(String text, Pattern pattern, int group) {
Stream.Builder<String> builder = Stream.builder();
try(Scanner scanner = new Scanner(text)) {
while (scanner.findWithinHorizon(pattern, 0) != null) {
builder.accept(scanner.match().group(group));
}
}
return builder.build();
}
Alfredo Diaz
I have been working developing software for the last 15 years. I am curious and love experimenting with new technologies and paradigms like cloud computing, big data, machine learning, reactive programming or continuous deployment. I feel highly productive working in cross-functional and autonomous teams that follow the devops and agile principles. Likewise I prefer paradigms that fit well with those principles like microservice and serverless architectures.
Updated on July 26, 2020Comments
-
Alfredo Diaz almost 4 years
I am trying to parse standard input and extract every string that matches with a specific pattern, count the number of occurrences of each match, and print the results alphabetically. This problem seems like a good match for the Streams API, but I can't find a concise way to create a stream of matches from a Matcher.
I worked around this problem by implementing an iterator over the matches and wrapping it into a Stream, but the result is not very readable. How can I create a stream of regex matches without introducing additional classes?
public class PatternCounter { static private class MatcherIterator implements Iterator<String> { private final Matcher matcher; public MatcherIterator(Matcher matcher) { this.matcher = matcher; } public boolean hasNext() { return matcher.find(); } public String next() { return matcher.group(0); } } static public void main(String[] args) throws Throwable { Pattern pattern = Pattern.compile("[a-zA-Z0-9.!#$%&’*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\\.[a-zA-Z0-9-]+)"); new TreeMap<String, Long>(new BufferedReader(new InputStreamReader(System.in)) .lines().map(line -> { Matcher matcher = pattern.matcher(line); return StreamSupport.stream( Spliterators.spliteratorUnknownSize(new MatcherIterator(matcher), Spliterator.ORDERED), false); }).reduce(Stream.empty(), Stream::concat).collect(groupingBy(o -> o, counting())) ).forEach((k, v) -> { System.out.printf("%s\t%s\n",k,v); }); } }