Efficient way to search a stream for a string

35,789

Solution 1

There are three good solutions here:

  1. If you want something that is easy and reasonably fast, go with no buffer, and instead implement a simple nondeterminstic finite-state machine. Your state will be a list of indices into the string you are searching, and your logic looks something like this (pseudocode):

    String needle;
    n = needle.length();
    
    for every input character c do
      add index 0 to the list
      for every index i in the list do
        if c == needle[i] then
          if i + 1 == n then
            return true
          else
            replace i in the list with i + 1
          end
        else
          remove i from the list
        end
      end
    end
    

    This will find the string if it exists and you will never need a buffer.

  2. Slightly more work but also faster: do an NFA-to-DFA conversion that figures out in advance what lists of indices are possible, and assign each one to a small integer. (If you read about string search on Wikipedia, this is called the powerset construction.) Then you have a single state and you make a state-to-state transition on each incoming character. The NFA you want is just the DFA for the string preceded with a state that nondeterministically either drops a character or tries to consume the current character. You'll want an explicit error state as well.

  3. If you want something faster, create a buffer whose size is at least twice n, and user Boyer-Moore to compile a state machine from needle. You'll have a lot of extra hassle because Boyer-Moore is not trivial to implement (although you'll find code online) and because you'll have to arrange to slide the string through the buffer. You'll have to build or find a circular buffer that can 'slide' without copying; otherwise you're likely to give back any performance gains you might get from Boyer-Moore.

Solution 2

The Knuth-Morris-Pratt search algorithm never backs up; this is just the property you want for your stream search. I've used it before for this problem, though there may be easier ways using available Java libraries. (When this came up for me I was working in C in the 90s.)

KMP in essence is a fast way to build a string-matching DFA, like Norman Ramsey's suggestion #2.

Solution 3

This answer applied to the initial version of the question where the key was to read the stream only as far as necessary to match on a String, if that String was present. This solution would not meet the requirement to guarantee fixed memory utilisation, but may be worth considering if you have found this question and are not bound by that constraint.

If you are bound by the constant memory usage constraint, Java stores arrays of any type on the heap, and as such nulling the reference does not deallocate memory in any way; I think any solution involving arrays in a loop will consume memory on the heap and require GC.


For simple implementation, maybe Java 5's Scanner which can accept an InputStream and use a java.util.regex.Pattern to search the input for might save you worrying about the implementation details.

Here's an example of a potential implementation:

public boolean streamContainsString(Reader reader, String searchString)
            throws IOException {
      Scanner streamScanner = new Scanner(reader);
      if (streamScanner.findWithinHorizon(searchString, 0) != null) {
        return true;
      } else {
        return false;
      }
}

I'm thinking regex because it sounds like a job for a Finite State Automaton, something that starts in an initial state, changing state character by character until it either rejects the string (no match) or gets to an accept state.

I think this is probably the most efficient matching logic you could use, and how you organize the reading of the information can be divorced from the matching logic for performance tuning.

It's also how regexes work.

Solution 4

Instead of having your buffer be an array, use an abstraction that implements a circular buffer. Your index calculation will be buf[(next+i) % sizeof(buf)], and you'll have to be careful to full the buffer one-half at a time. But as long as the search string fits in half the buffer, you'll find it.

Solution 5

If you're not tied to using a Reader, then you can use Java's NIO API to efficiently load the file. For example (untested, but should be close to working):

public boolean streamContainsString(File input, String searchString) throws IOException {
    Pattern pattern = Pattern.compile(Pattern.quote(searchString));

    FileInputStream fis = new FileInputStream(input);
    FileChannel fc = fis.getChannel();

    int sz = (int) fc.size();
    MappedByteBuffer bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, sz);

    CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
    CharBuffer cb = decoder.decode(bb);

    Matcher matcher = pattern.matcher(cb);

    return matcher.matches();
}

This basically mmap()'s the file to search and relies on the operating system to do the right thing regarding cache and memory usage. Note however that map() is more expensive the just reading the file in to a large buffer for files less than around 10 KiB.

Share:
35,789
Alex Spurling
Author by

Alex Spurling

Updated on July 09, 2022

Comments

  • Alex Spurling
    Alex Spurling almost 2 years

    Let's suppose that have a stream of text (or Reader in Java) that I'd like to check for a particular string. The stream of text might be very large so as soon as the search string is found I'd like to return true and also try to avoid storing the entire input in memory.

    Naively, I might try to do something like this (in Java):

    public boolean streamContainsString(Reader reader, String searchString) throws IOException {
        char[] buffer = new char[1024];
        int numCharsRead;
        while((numCharsRead = reader.read(buffer)) > 0) {
            if ((new String(buffer, 0, numCharsRead)).indexOf(searchString) >= 0)
                return true;
        }
        return false;
    }
    

    Of course this fails to detect the given search string if it occurs on the boundary of the 1k buffer:

    Search text: "stackoverflow"
    Stream buffer 1: "abc.........stack"
    Stream buffer 2: "overflow.......xyz"

    How can I modify this code so that it correctly finds the given search string across the boundary of the buffer but without loading the entire stream into memory?

    Edit: Note when searching a stream for a string, we're trying to minimise the number of reads from the stream (to avoid latency in a network/disk) and to keep memory usage constant regardless of the amount of data in the stream. Actual efficiency of the string matching algorithm is secondary but obviously, it would be nice to find a solution that used one of the more efficient of those algorithms.

  • Jeremy Huiskamp
    Jeremy Huiskamp about 15 years
    java.util.regex (and most mainstream regex engines) use a backtracking scheme that would actually be extremely inefficient for something like this. Imagine a pattern that starts with ".*" (and remember that '*' is greedy). Of course this can be done much more efficiently, but for this specific question where it looks like no metacharacters are to be supported, regular expressions are probably not the way to go.
  • Alan Moore
    Alan Moore about 15 years
    If you use a regex that starts with .* it's your own fault if it's too slow. But you can avoid any regex-related complications escaping the search string, i.e.: "\\Q"+searchString+"\\E"
  • Jeremy Huiskamp
    Jeremy Huiskamp about 15 years
    @AlanM, that's a bit dangerous in the case that the search string has a "\\E" in it. And the ".*" example was just that, an example. There are lots of ways that regexes don't fit this problem well. @safetydan, I don't see Pattern.escape() in the javadocs. Am I missing something?
  • MaxG
    MaxG about 15 years
    Those don't really apply since he has a stream of char's coming in. He doesn't have access to the entire String. So he's only going to be able to scan sequentially. And he specifically says he doesn't want to load the entire string into memory.
  • Alex
    Alex about 15 years
    some algorithms does not require loading of full text at one time. For example Boyer-Moor algorithm applies indexing to search string but not to full text
  • alphazero
    alphazero about 15 years
    Boyer Moore seems to require a complete search space given that it works backwards, starting from the last character of the pattern string. In this case, half of the problem is that the buffer may be incomplete, and yes, using a buffer that is twice the size of pattern string reduces the probability of a partial read, but the case where a portion of the pattern is read is not eliminated.
  • alphazero
    alphazero about 15 years
    I didn't pay close attention and missed the circular buffer bit. I think Norman's option 3 is the most efficient solution. +1.
  • Alex Spurling
    Alex Spurling about 15 years
    The Scanner is a nice solution but unfortunately, it doesn't meet the requirement of being memory efficient. The javadocs for findWithinHorizon state: "[with a horizon of 0] In this case it may buffer all of the input searching for the pattern." and in fact debugging the class, you can see this is exactly what it does. So close, but no cigar... :p
  • Cecil Has a Name
    Cecil Has a Name almost 15 years
    Why not use a buffered stream in-between to minimize reads from the external stream; then the penalty for reading a single character at a time is at a minimum?
  • George Hawkins
    George Hawkins almost 11 years
    This question is specifically about searching a stream so I agree with @DariusBacon that KMP is a very good match for this (left to right behavior, i.e. never backs up). And it's easy to implement which is always good :)
  • George Hawkins
    George Hawkins almost 11 years
    I feel bad down voting this answer as I like solution 1 - super simple and good enough performance wise for most people (I guess it's obvious, but maybe it's worth noting that the length of the list used will grow to at most the length of the pattern). However I'm down voting as I'm unconvinced Boyer-Moore (with its right-left behavior), even with a sliding circular buffer, makes sense for stream searching. I'm voting for KMP for this.
  • George Hawkins
    George Hawkins almost 11 years
    Also for solution 1 when you write that your "state will be a list of indices into the string you are searching" shouldn't you write "searching for", i.e. the indices are into needle not into the text being searched.