How do I create a Stream of regex matches?

29,546

Solution 1

Well, in Java 8, there is Pattern.splitAsStream which will provide a stream of items split by a delimiter pattern but unfortunately no support method for getting a stream of matches.

If you are going to implement such a Stream, I recommend implementing Spliterator directly rather than implementing and wrapping an Iterator. You may be more familiar with Iterator but implementing a simple Spliterator is straight-forward:

final class MatchItr extends Spliterators.AbstractSpliterator<String> {
    private final Matcher matcher;
    MatchItr(Matcher m) {
        super(m.regionEnd()-m.regionStart(), ORDERED|NONNULL);
        matcher=m;
    }
    public boolean tryAdvance(Consumer<? super String> action) {
        if(!matcher.find()) return false;
        action.accept(matcher.group());
        return true;
    }
}

You may consider overriding forEachRemaining with a straight-forward loop, though.


If I understand your attempt correctly, the solution should look more like:

Pattern pattern = Pattern.compile(
                 "[a-zA-Z0-9.!#$%&’*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\\.[a-zA-Z0-9-]+)");

try(BufferedReader br=new BufferedReader(System.console().reader())) {

    br.lines()
      .flatMap(line -> StreamSupport.stream(new MatchItr(pattern.matcher(line)), false))
      .collect(Collectors.groupingBy(o->o, TreeMap::new, Collectors.counting()))
      .forEach((k, v) -> System.out.printf("%s\t%s\n",k,v));
}

Java 9 provides a method Stream<MatchResult> results() directly on the Matcher. But for finding matches within a stream, there’s an even more convenient method on Scanner. With that, the implementation simplifies to

try(Scanner s = new Scanner(System.console().reader())) {
    s.findAll(pattern)
     .collect(Collectors.groupingBy(MatchResult::group,TreeMap::new,Collectors.counting()))
     .forEach((k, v) -> System.out.printf("%s\t%s\n",k,v));
}

This answer contains a back-port of Scanner.findAll that can be used with Java 8.

Solution 2

Going off of Holger's solution, we can support arbitrary Matcher operations (such as getting the nth group) by having the user provide a Function<Matcher, String> operation. We can also hide the Spliterator as an implementation detail, so that callers can just work with the Stream directly. As a rule of thumb StreamSupport should be used by library code, rather than users.

public class MatcherStream {
  private MatcherStream() {}

  public static Stream<String> find(Pattern pattern, CharSequence input) {
    return findMatches(pattern, input).map(MatchResult::group);
  }

  public static Stream<MatchResult> findMatches(
      Pattern pattern, CharSequence input) {
    Matcher matcher = pattern.matcher(input);

    Spliterator<MatchResult> spliterator = new Spliterators.AbstractSpliterator<MatchResult>(
        Long.MAX_VALUE, Spliterator.ORDERED|Spliterator.NONNULL) {
      @Override
      public boolean tryAdvance(Consumer<? super MatchResult> action) {
        if(!matcher.find()) return false;
        action.accept(matcher.toMatchResult());
        return true;
      }};

    return StreamSupport.stream(spliterator, false);
  }
}

You can then use it like so:

MatcherStream.find(Pattern.compile("\\w+"), "foo bar baz").forEach(System.out::println);

Or for your specific task (borrowing again from Holger):

try(BufferedReader br = new BufferedReader(System.console().reader())) {
  br.lines()
    .flatMap(line -> MatcherStream.find(pattern, line))
    .collect(Collectors.groupingBy(o->o, TreeMap::new, Collectors.counting()))
    .forEach((k, v) -> System.out.printf("%s\t%s\n", k, v));
}

Solution 3

If you want to use a Scanner together with regular expressions using the findWithinHorizon method you could also convert a regular expression into a stream of strings. Here we use a stream builder which is very convenient to use during a conventional while loop.

Here is an example:

private Stream<String> extractRulesFrom(String text, Pattern pattern, int group) {
    Stream.Builder<String> builder = Stream.builder();
    try(Scanner scanner = new Scanner(text)) {
        while (scanner.findWithinHorizon(pattern, 0) != null) {
            builder.accept(scanner.match().group(group));
        }
    }
    return builder.build();
} 
Share:
29,546
Alfredo Diaz
Author by

Alfredo Diaz

I have been working developing software for the last 15 years. I am curious and love experimenting with new technologies and paradigms like cloud computing, big data, machine learning, reactive programming or continuous deployment. I feel highly productive working in cross-functional and autonomous teams that follow the devops and agile principles. Likewise I prefer paradigms that fit well with those principles like microservice and serverless architectures.

Updated on July 26, 2020

Comments

  • Alfredo Diaz
    Alfredo Diaz almost 4 years

    I am trying to parse standard input and extract every string that matches with a specific pattern, count the number of occurrences of each match, and print the results alphabetically. This problem seems like a good match for the Streams API, but I can't find a concise way to create a stream of matches from a Matcher.

    I worked around this problem by implementing an iterator over the matches and wrapping it into a Stream, but the result is not very readable. How can I create a stream of regex matches without introducing additional classes?

    public class PatternCounter
    {
        static private class MatcherIterator implements Iterator<String> {
            private final Matcher matcher;
            public MatcherIterator(Matcher matcher) {
                this.matcher = matcher;
            }
            public boolean hasNext() {
                return matcher.find();
            }
            public String next() {
                return matcher.group(0);
            }
        }
    
        static public void main(String[] args) throws Throwable {
            Pattern pattern = Pattern.compile("[a-zA-Z0-9.!#$%&’*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\\.[a-zA-Z0-9-]+)");
    
            new TreeMap<String, Long>(new BufferedReader(new InputStreamReader(System.in))
                .lines().map(line -> {
                    Matcher matcher = pattern.matcher(line);
                    return StreamSupport.stream(
                            Spliterators.spliteratorUnknownSize(new MatcherIterator(matcher), Spliterator.ORDERED), false);
                }).reduce(Stream.empty(), Stream::concat).collect(groupingBy(o -> o, counting()))
            ).forEach((k, v) -> {
                System.out.printf("%s\t%s\n",k,v);
            });
        }
    }