C: searching for a string in a file
Solution 1
If you can't fit the whole file into memory, and you have access to the GNU memmem()
extension, then:
- Read as much as you can into a buffer;
- Search the buffer with
memmem(buffer, len, mystr, strlen(mystr) + 1)
; - Discard all but the last
strlen(mystr)
characters of the buffer, and move those to the start; - Repeat until end of file reached.
If you don't have memmem
, then you can implement it in plain C using memchr
and memcmp
, like so:
/*
* The memmem() function finds the start of the first occurrence of the
* substring 'needle' of length 'nlen' in the memory area 'haystack' of
* length 'hlen'.
*
* The return value is a pointer to the beginning of the sub-string, or
* NULL if the substring is not found.
*/
void *memmem(const void *haystack, size_t hlen, const void *needle, size_t nlen)
{
int needle_first;
const void *p = haystack;
size_t plen = hlen;
if (!nlen)
return NULL;
needle_first = *(unsigned char *)needle;
while (plen >= nlen && (p = memchr(p, needle_first, plen - nlen + 1)))
{
if (!memcmp(p, needle, nlen))
return (void *)p;
p++;
plen = hlen - (p - haystack);
}
return NULL;
}
Solution 2
Because there is no memmem or memstr to find a string in a binary array (others suggested to read it into memory and use strstr - no this doesn't work) you have to read it byte by byte with "fgetch" and write a small state machine to match it while reading.
Solution 3
See here:
http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
for a Boyer-Moore implementation in C99. This is a very common string searching algorithm and runs in O(n).
igul222
Updated on June 04, 2022Comments
-
igul222 almost 2 years
If I have:
const char *mystr = "cheesecakes"; FILE *myfile = fopen("path/to/file.exe","r");
I need to write a function to determine whether
myfile
contains any occurrences ofmystr
. Could anyone help me? Thanks!UPDATE: So it turns out the platform I need to deploy to doesn't have
memstr
. Does anyone know of a free implementation I can use in my code? -
R Samuel Klatchko about 14 yearsIf you are on Linux and don't care about portability, glibc does have memmem.
-
R Samuel Klatchko about 14 yearsGood try. Unfortunately this has an edge case where it will fail when you have a search string like "112" and in the file you have "1112"
-
pm100 about 14 yearsi dont think so, i start at the beginning of the match string when i hit a mismatch
-
igul222 about 14 yearsThanks; I ended up loading the whole thing into memory, then using your memmem to find the needle.
-
EvilTeach about 14 yearsLoading it byte by byte will hurt performance. Load it into memory, then you search for it in memory.
-
Matthew Flaschen about 14 yearspm100, yes you reset your needle index, but you've already lost the part of the haystack you need.
-
R Samuel Klatchko about 14 yearsyou think wrong. Yes, you reset depth and start at the beginning of the match string, but you don't seek back in the data stream. Try testing it (of course, you'll actually have to make it work for the non-edge case first).
-
Zibri about 6 yearsthis finds only the FIRST occurrence.. what about the others?
-
caf about 6 years@Zibri: You call it in a loop. The return value, if non-NULL, is a pointer to the found substring within the search buffer, so you can use that to work out where to start searching for the next occurrence (depending on whether you want to count occurrences that overlap with previously-found occurrences or not).
-
Zibri about 6 years@cat yep, thanks I figured that out already. by the way (off topic) do you have any idea why using fwrite on an mmapped input file (to write it to disk) fails unless you do a malloc buf, memcpy and write the file in blocks?
-
caf about 6 years@Zibri: That should work just fine. I guess post a new question.