C: searching for a string in a file

13,068

Solution 1

If you can't fit the whole file into memory, and you have access to the GNU memmem() extension, then:

  • Read as much as you can into a buffer;
  • Search the buffer with memmem(buffer, len, mystr, strlen(mystr) + 1);
  • Discard all but the last strlen(mystr) characters of the buffer, and move those to the start;
  • Repeat until end of file reached.

If you don't have memmem, then you can implement it in plain C using memchr and memcmp, like so:

/*
 * The memmem() function finds the start of the first occurrence of the
 * substring 'needle' of length 'nlen' in the memory area 'haystack' of
 * length 'hlen'.
 *
 * The return value is a pointer to the beginning of the sub-string, or
 * NULL if the substring is not found.
 */
void *memmem(const void *haystack, size_t hlen, const void *needle, size_t nlen)
{
    int needle_first;
    const void *p = haystack;
    size_t plen = hlen;

    if (!nlen)
        return NULL;

    needle_first = *(unsigned char *)needle;

    while (plen >= nlen && (p = memchr(p, needle_first, plen - nlen + 1)))
    {
        if (!memcmp(p, needle, nlen))
            return (void *)p;

        p++;
        plen = hlen - (p - haystack);
    }

    return NULL;
}

Solution 2

Because there is no memmem or memstr to find a string in a binary array (others suggested to read it into memory and use strstr - no this doesn't work) you have to read it byte by byte with "fgetch" and write a small state machine to match it while reading.

Solution 3

See here:

http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm

for a Boyer-Moore implementation in C99. This is a very common string searching algorithm and runs in O(n).

Share:
13,068
igul222
Author by

igul222

Updated on June 04, 2022

Comments

  • igul222
    igul222 almost 2 years

    If I have:

    const char *mystr = "cheesecakes";
    FILE *myfile = fopen("path/to/file.exe","r");
    

    I need to write a function to determine whether myfile contains any occurrences of mystr. Could anyone help me? Thanks!

    UPDATE: So it turns out the platform I need to deploy to doesn't have memstr. Does anyone know of a free implementation I can use in my code?

  • R Samuel Klatchko
    R Samuel Klatchko about 14 years
    If you are on Linux and don't care about portability, glibc does have memmem.
  • R Samuel Klatchko
    R Samuel Klatchko about 14 years
    Good try. Unfortunately this has an edge case where it will fail when you have a search string like "112" and in the file you have "1112"
  • pm100
    pm100 about 14 years
    i dont think so, i start at the beginning of the match string when i hit a mismatch
  • igul222
    igul222 about 14 years
    Thanks; I ended up loading the whole thing into memory, then using your memmem to find the needle.
  • EvilTeach
    EvilTeach about 14 years
    Loading it byte by byte will hurt performance. Load it into memory, then you search for it in memory.
  • Matthew Flaschen
    Matthew Flaschen about 14 years
    pm100, yes you reset your needle index, but you've already lost the part of the haystack you need.
  • R Samuel Klatchko
    R Samuel Klatchko about 14 years
    you think wrong. Yes, you reset depth and start at the beginning of the match string, but you don't seek back in the data stream. Try testing it (of course, you'll actually have to make it work for the non-edge case first).
  • Zibri
    Zibri about 6 years
    this finds only the FIRST occurrence.. what about the others?
  • caf
    caf about 6 years
    @Zibri: You call it in a loop. The return value, if non-NULL, is a pointer to the found substring within the search buffer, so you can use that to work out where to start searching for the next occurrence (depending on whether you want to count occurrences that overlap with previously-found occurrences or not).
  • Zibri
    Zibri about 6 years
    @cat yep, thanks I figured that out already. by the way (off topic) do you have any idea why using fwrite on an mmapped input file (to write it to disk) fails unless you do a malloc buf, memcpy and write the file in blocks?
  • caf
    caf about 6 years
    @Zibri: That should work just fine. I guess post a new question.