Search for string in text file C

71,973

Solution 1

I am assuming this is a learning exercise and you are simply looking for a place to start. Otherwise, you should not reinvent the wheel.

The code below should give you an idea of what is involved. It is a program that allows you to specify the name of file to be searched and a single argument to search in that file. You should be able to modify this to put the phrases to search for in an array of strings and check if any of the words in that array appear in any of the lines read.

The key function you are looking for is strstr.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

#ifdef DEBUG
#define INITIAL_ALLOC 2
#else
#define INITIAL_ALLOC 512
#endif

char *
read_line(FILE *fin) {
    char *buffer;
    char *tmp;
    int read_chars = 0;
    int bufsize = INITIAL_ALLOC;
    char *line = malloc(bufsize);

    if ( !line ) {
        return NULL;
    }

    buffer = line;

    while ( fgets(buffer, bufsize - read_chars, fin) ) {
        read_chars = strlen(line);

        if ( line[read_chars - 1] == '\n' ) {
            line[read_chars - 1] = '\0';
            return line;
        }

        else {
            bufsize = 2 * bufsize;
            tmp = realloc(line, bufsize);
            if ( tmp ) {
                line = tmp;
                buffer = line + read_chars;
            }
            else {
                free(line);
                return NULL;
            }
        }
    }
    return NULL;
}

int
main(int argc, char *argv[]) {
    FILE *fin;
    char *line;

    if ( argc != 3 ) {
        return EXIT_FAILURE;
    }

    fin = fopen(argv[1], "r");

    if ( fin ) {
        while ( line = read_line(fin) ) {
            if ( strstr(line, argv[2]) ){
                fprintf(stdout, "%s\n", line);
            }
            free(line);
        }
    }

    fclose(fin);
    return 0;
}

Sample output:

E:\Temp> searcher.exe searcher.c char
char *
    char *buffer;
    char *tmp;
    int read_chars = 0;
    char *line = malloc(bufsize);
    while ( fgets(buffer, bufsize - read_chars, fin) ) {
        read_chars = strlen(line);
        if ( line[read_chars - 1] == '\n' ) {
            line[read_chars - 1] = '\0';
                buffer = line + read_chars;
main(int argc, char *argv[]) {
    char *line;

Solution 2

Remember: fgetc(), getc(), getchar() all return an integer, not a char. The integer might be EOF or a valid character - but it returns one more value than the range supported by the char type.

You're writing a surrogate for the 'fgrep' command:

fgrep -f strings.txt text_file.txt > out.txt

Instead of reading characters, you are going to need to read lines - using fgets(). (Forget that the gets() function exists!)

I indented your code and inserted a return 0; at the end for you (though C99 does an implicit 'return 0;' if you fall off the end of main()). However, C99 also demands an explicit return type for every function - and I added the 'int' to 'int main()' for you (but you can't use the C99-compliant excuse for not returning 0 at the end). Error messages should be written to standard error rather than standard output.

You'll probably need to use dynamic allocation for the list of strings. A simple-minded search will simply apply 'strstr()' searching for each of the required strings in each line of input (making sure to break the loop once you've found a match so a line is not repeated if there are multiple matches on a single line).

A more sophisticated search would precompute which characters can be ignored so that you can search for all the strings in parallel, skipping through the text faster than the loop-in-a-loop. This might be a modification of a search algorithm such as Boyer-Moore or Knuth-Morris-Pratt (added: or Rabin-Karp which is designed for parallel searching for multiple strings).

Solution 3

cat strings.txt |while read x; do grep "$x" text_file.txt; done

Solution 4

Reading by blocks is always better, because it's how works the underlying file system.

Hence just read by blocks, check if any of your words appear in buffer, then read another buffer full. You just have to be cautious to recopy the last few characters of previous buffer in the new one to avoid missing detection if search words are at buffer boundary.

If this trivial algorithm is not enough (in your case it probably is) there is much more sophisticated algorithm for searching simultaneously several substrings in one buffer cf Rabin-Karp.

Share:
71,973
CHR_1980
Author by

CHR_1980

Updated on July 09, 2022

Comments

  • CHR_1980
    CHR_1980 almost 2 years

    The following code reads a text file one character at the time and print it to stdout:

    #include <stdio.h>
    
    int main()
    {
        char file_to_open[] = "text_file.txt", ch;
        FILE *file_ptr;
    
        if((file_ptr = fopen(file_to_open, "r")) != NULL)
        {
            while((ch = fgetc(file_ptr)) != EOF)
            {
                putchar(ch);
            }
        }
        else
        {
            printf("Could not open %s\n", file_to_open);
            return 1;
        }
        return(0);
    }
    

    But instead of printing to stdout [putchar(ch)] I want to search the file for specific strings provided in another textfile ie. strings.txt and output the line with the match to out.txt

    text_file.txt:

    1993 - 1999 Pentium
    1997 - 1999 Pentium II
    1999 - 2003 Pentium III
    1998 - 2009 Xeon
    2006 - 2009 Intel Core 2
    

    strings.txt:

    Nehalem
    AMD Athlon
    Pentium
    

    In this case the three first lines of text_file.txt would match. I have done some research on file operations in C, and it seems that I can read one character at the time with fgetc [like I do in my code], one line with fgets and one block with fread, but no word as I guess would be perfect in my situation?

  • asveikau
    asveikau over 14 years
    when you use fgetc(), I'm fairly certain stdio will read by blocks and buffer characters...
  • kriss
    kriss over 14 years
    true, but calling fgetc has it's cost by itself and if you want to compare input with a string (or several strings) you will have to copy it somewhere. That has a much greater cost than reading a full buffer and working with it. Reading a full line as Jonathan propose is also a good alternative to reading a full buffer if you don't want to manage yourself the gory details to directly reading buffers.
  • Jonathan Leffler
    Jonathan Leffler over 14 years
    You meant fgrep -f strings.txt text_file.txt > out.txt?
  • asveikau
    asveikau over 14 years
    personally I prefer writing a function to buffer characters... using fgets alone gives you arbitrary limits on line length.
  • kriss
    kriss over 14 years
    @asveikau: I don't see the difference ? When using fgets we provide the buffer, we can set it any size we want. And if lines in strings.txt are longer than buffer we are in trouble anyway... Do you mean we should manage the buffer overflow case even when using fgets ? yes indeed and it's less obvious than with an untyped buffer.
  • Ewan Todd
    Ewan Todd over 14 years
    Yes, yes, fgrep -f strings.txt text_file.txt. I guess more exposure means more options.
  • Jonathan Leffler
    Jonathan Leffler over 14 years
    fgets() reads up to the given buffer length; if it has not encountered a newline by the time it runs out of space, it stops and returns. So, if the last character is not newline and the buffer is full, then you can find some more space (reallocate?) to put the extra characters in, call fgets() again (carefully - starting where it finished, only telling it about the extra space) and get more of the line. So yes, you can write your own reader to get data into a dynamically allocated buffer that grows - or use fgets() to do the reading while handling the buffer.
  • Jonathan Leffler
    Jonathan Leffler over 14 years
    You could also decide that if the line is longer than the POSIX line length (_POSIX2_LINE_MAX, the minimum value of which is 2048), then it doesn't matter if it is split or truncated. I tend to use 4096 as 'long line buffer'.
  • Jerry Coffin
    Jerry Coffin over 14 years
    At least unless strings.txt is a lot different from what he's showing, he can seek to the end of it, get the position, and use that as the size of the buffer -- since that contains all the strings he's searching for, it's at least as long as any one string he's searching for. His only real requirement is that the single longest string he's looking for fits into the buffer at once. Anything beyond that just doesn't matter -- a single word in the input that doesn't fit in that size of buffer can't match any of the words he cares about.
  • sean riley
    sean riley over 14 years
    thank you. writing a C program to do this is a complete waste of time.
  • CHR_1980
    CHR_1980 over 14 years
    This looks very interesting. You are correctly assuming, this is a learning exercise for me, and i can see that the source consist of elements that i have previously worked with, so i should be able to fully understand this code.
  • CHR_1980
    CHR_1980 over 14 years
    Thank you for the information about C99, i have never added int to main, because i remember reading, that if no type was defined, int would be default. Regarding messages to stderr i remember reading about putc with this function i was able to choose filestream.
  • CHR_1980
    CHR_1980 over 14 years
    Anything related to learning is not a waste of time, at least i my opinion. But if this was not to learn something new, you are properly right.
  • Jonathan Leffler
    Jonathan Leffler over 14 years
    In C89 and earlier (pre-standard) versions, then 'default int' was correct. That allowed you to write: `somefunc(a,b,c,d,e) char *b; double d,e; int *c; { ... }'; as you can clearly see, the function returned an int, and the parameter a was an int. C99 does not allow 'implicit int' for function return types (and prototype functions never did allow implicit int for parameters; only K&R style function definitions could do that).
  • fIwJlxSzApHEZIl
    fIwJlxSzApHEZIl almost 12 years
    I'm fairly new to C code but I just replaced the entire read_line function call with the fgets function call and allocated char* line in the main function to an arbitrarily large number since fgets stops on the '\n' character. Can you perhaps explain the intended purpose of the read_line function? Seems like a lot of superfluous code is in there.
  • silbana
    silbana almost 12 years
    @advocate How large is large enough? I start with a reasonably sized buffer and keep expanding it as necessary. There should actually be another check for the buffer becoming too large to prevent your computer from running out of memory if someone is feeding a stream with no line endings to it, but this was a simple learning exercise.
  • zangetsKid
    zangetsKid almost 12 years
    To Sinan Unur, could you explain your code for me please? why do you specify your buffer size for the fgets() with bufsize-readchars? Also, when I play around with your code and printed the read_chars for the first time(first loop) it prints 19? shouldn't the strlen(line) yield to 0 since it hasn't been initialized? (I'm using the code by feeding the searcher.c file) Thanks Edward
  • silbana
    silbana almost 12 years
    Uninitialized memory contains garbage, so you should not have any expectations. The buffer gets filled until a complete line is read.