C: Reading a text file (with variable-length lines) line-by-line using fread()/fgets() instead of fgetc() (block I/O vs. character I/O)

16,340

Solution 1

Don't use fread. Use fgets. I take it this is a homework/classproject problem so I'm not providing a complete answer, but if you say it's not, I'll give more advice. It is definitely possible to provide 100% of the semantics of GNU-style getline, including embedded null bytes, using purely fgets, but it requires some clever thinking.

OK, update since this isn't homework:

  • memset your buffer to '\n'.
  • Use fgets.
  • Use memchr to find the first '\n'.
  • If no '\n' is found, the line is longer than your buffer. Englarge the buffer, fill the new portion with '\n', and fgets into the new portion, repeating as necessary.
  • If the character following '\n' is '\0', then fgets terminated due to reaching end of a line.
  • Otherwise, fgets terminated due to reaching EOF, the '\n' is left over from your memset, the previous character is the terminating null that fgets wrote, and the character before that is the last character of actual data read.

You can eliminate the memset and use strlen in place of memchr if you don't care about supporting lines with embedded nulls (either way, the null will not terminate reading; it will just be part of your read-in line).

There's also a way to do the same thing with fscanf and the "%123[^\n]" specifier (where 123 is your buffer limit), which gives you the flexibility to stop at non-newline characters (ala GNU getdelim). However it's probably slow unless your system has a very fancy scanf implementation.

Solution 2

There isn't a big performance difference between fgets and fgetc/setvbuf. Try:

int c;
FILE *f = fopen("blah.txt","r");
setvbuf(f,NULL,_IOLBF,4096); /* !!! check other values for last parameter in your OS */
while( (c=fgetc(f))!=EOF )
{
  if( c=='\n' )
    ...
  else
    ...
} 
Share:
16,340
Julienne Goldberg
Author by

Julienne Goldberg

Updated on June 09, 2022

Comments

  • Julienne Goldberg
    Julienne Goldberg almost 2 years

    Is there a getline function that uses fread (block I/O) instead of fgetc (character I/O)?

    There's a performance penalty to reading a file character by character via fgetc. We think that to improve performance, we can use block reads via fread in the inner loop of getline. However, this introduces the potentially undesirable effect of reading past the end of a line. At the least, this would require the implementation of getline to keep track of the "unread" part of the file, which requires an abstraction beyond the ANSI C FILE semantics. This isn't something we want to implement ourselves!

    We've profiled our application, and the slow performance is isolated to the fact that we are consuming large files character by character via fgetc. The rest of the overhead actually has a trivial cost by comparison. We're always sequentially reading every line of the file, from start to finish, and we can lock the entire file for the duration of the read. This probably makes an fread-based getline easier to implement.

    So, does a getline function that uses fread (block I/O) instead of fgetc (character I/O) exist? We're pretty sure it does, but if not, how should we implement it?

    Update Found a useful article, Handling User Input in C, by Paul Hsieh. It's a fgetc-based approach, but it has an interesting discussion of the alternatives (starting with how bad gets is, then discussing fgets):

    On the other hand the common retort from C programmers (even those considered experienced) is to say that fgets() should be used as an alternative. Of course, by itself, fgets() doesn't really handle user input per se. Besides having a bizarre string termination condition (upon encountering \n or EOF, but not \0) the mechanism chosen for termination when the buffer has reached capacity is to simply abruptly halt the fgets() operation and \0 terminate it. So if user input exceeds the length of the preallocated buffer, fgets() returns a partial result. To deal with this programmers have a couple choices; 1) simply deal with truncated user input (there is no way to feed back to the user that the input has been truncated, while they are providing input) 2) Simulate a growable character array and fill it in with successive calls to fgets(). The first solution, is almost always a very poor solution for variable length user input because the buffer will inevitably be too large most of the time because its trying to capture too many ordinary cases, and too small for unusual cases. The second solution is fine except that it can be complicated to implement correctly. Neither deals with fgets' odd behavior with respect to '\0'.

    Exercise left to the reader: In order to determine how many bytes was really read by a call to fgets(), one might try by scanning, just as it does, for a '\n' and skip over any '\0' while not exceeding the size passed to fgets(). Explain why this is insufficient for the very last line of a stream. What weakness of ftell() prevents it from addressing this problem completely?

    Exercise left to the reader: Solve the problem determining the length of the data consumed by fgets() by overwriting the entire buffer with a non-zero value between each call to fgets().

    So with fgets() we are left with the choice of writing a lot of code and living with a line termination condition which is inconsistent with the rest of the C library, or having an arbitrary cut-off. If this is not good enough, then what are we left with? scanf() mixes parsing with reading in a way that cannot be separated, and fread() will read past the end of the string. In short, the C library leaves us with nothing. We are forced to roll our own based on top of fgetc() directly. So lets give it a shot.

    So, does a getline function that's based on fgets (and doesn't truncate the input) exist?

    • R.. GitHub STOP HELPING ICE
      R.. GitHub STOP HELPING ICE over 13 years
      To your new question at the end, yes, it exists. I outlined it in my answer. The article you've cited mentions a problem with a final non-newline-terminated line; I've made this a non-issue by pre-filling the buffer with '\n' and providing a way to detect the condition.
    • R.. GitHub STOP HELPING ICE
      R.. GitHub STOP HELPING ICE over 13 years
      Also note that Paul Hsieh's solution to use fgetc is very bad. On modern implementations, due to the requirement to support locking in case multiple threads access the same FILE object, using fgetc will be very slow. You can use getc_unlocked (but this is a POSIX function, not a standard C function), but even with an optimal macro expansion of getc_unlocked, the way fgets searches the buffer for '\n' (i.e. using memchr) will be many times faster than anything you can do without access to the internal buffer. Also note that if you have POSIX (2008), you have getline already.
  • Julienne Goldberg
    Julienne Goldberg over 13 years
    This isn't homework... :) How would you suggest using fgets? Using a grow-able character array and filling it in with successive calls to fgets seems complicated to implement correctly. Also, I understand that fgets terminates upon encountering '\n' or EOF, but not '\0'. This isn't an issue for our files, though.
  • chux - Reinstate Monica
    chux - Reinstate Monica about 10 years
    @R.. A minor hole: After using char s[5]; memset(s, '\n', sizeof s); fgets(s, sizeof s, ...); on a file with 3 bytes "xyz" leads to "xyz\0\n" in s. Finding the first '\n' is OK, but checking the following character is UB. Suggest adding "If '\n' in last place, then fgets terminated due to reaching last line in file." then go on to "If the character following ..."
  • supercat
    supercat about 9 years
    I wonder why so many string-related functions have comparatively-useless return values? Code which calls strcat and fgets will often need to find the last character written--something the code for those functions will already have known. I can't think of any usefulness for the return value of those functions as implemented.