Reading text file of unknown size

29,026

Solution 1

The standard way to do this is to use malloc to allocate an array of some size, and start reading into it, and if you run out of array before you run out of characters (that is, if you don't reach EOF before filling up the array), pick a bigger size for the array and use realloc to make it bigger.

Here's how the read-and-allocate loop might look. I've chosen to read input a character at a time using getchar (rather than a line at a time using fgets).

int c;
int nch = 0;
int size = 10;
char *buf = malloc(size);
if(buf == NULL)
    {
    fprintf(stderr, "out of memory\n");
    exit(1);
    }

while((c = getchar()) != EOF)
    {
    if(nch >= size-1)
        {
        /* time to make it bigger */
        size += 10;
        buf = realloc(buf, size);
        if(buf == NULL)
            {
            fprintf(stderr, "out of memory\n");
            exit(1);
            }
        }

    buf[nch++] = c;
    }

buf[nch++] = '\0';

printf("\"%s\"", buf);

Two notes about this code:

  1. The numbers 10 for the initial size and the increment are much too small; in real code you'd want to use something considerably bigger.
  2. It's easy to forget to ensure that there's room for the trailing '\0'; in this code I've tried to do that with the -1 in if(nch >= size-1).

Solution 2

I would be remiss if I didn't add to the answers probably one of the most standard ways of reading an unknown number of lines of unknown length from a text file. In C you have two primary methods of character input. (1) character-oriented input (i.e. getchar, getc, etc..) and (2) line-oriented input (i.e. fgets, getline).

From that mix of functions, the POSIX function getline by default will allocate sufficient space to read a line of any length (up to the exhaustion of system memory). Further, when reading lines of input, line-oriented input is generally the proper choice.

To read an unknown number of lines, the general approach is to allocate an anticipated number of pointers (in an array of pointers-to-char) and then reallocate as necessary if you end up needing more. If you want to work with the complexities of stringing pointers-to-struct together in a linked-list, that's fine, but it is far simpler to handle an array of strings. (a linked-list is more appropriate when you have a struct with multiple members, rather than a single line)

The process is straight forward. (1) allocate memory for some initial number of pointers (LMAX below at 255) and then as each line is read (2) allocate memory to hold the line and copy the line to the array (strdup is used below which both (a) allocates memory to hold the string, and (b) copies the string to the new memory block returning a pointer to its address)(You assign the pointer returned to your array of strings as array[x])

As with any dynamic allocation of memory, you are responsible for keeping track of the memory allocated, preserving a pointer to the start of each allocated block of memory (so you can free it later), and then freeing the memory when it is no longer needed. (Use valgrind or some similar memory checker to confirm you have no memory errors and have freed all memory you have created)

Below is an example of the approach which simply reads any text file and prints its lines back to stdout before freeing the memory allocated to hold the file. Once you have read all lines (or while you are reading all lines), you can easily parse your csv input into individual values.

Note: below, when LMAX lines have been read, the array is reallocated to hold twice as many as before and the read continues. (You can set LMAX to 1 if you want to allocate a new pointer for each line, but that is a very inefficient way to handle memory allocation) Choosing some reasonable anticipated starting value, and then reallocating 2X the current is a standard reallocation approach, but you are free to allocate additional blocks in any size you choose.

Look over the code and let me know if you have any questions.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define LMAX 255

int main (int argc, char **argv) {

    if (argc < 2 ) {
        fprintf (stderr, "error: insufficient input, usage: %s <filename>\n",
                 argv[0]);
        return 1;
    }

    char **array = NULL;        /* array of pointers to char        */ 
    char *ln = NULL;            /* NULL forces getline to allocate  */
    size_t n = 0;               /* buf size, 0 use getline default  */
    ssize_t nchr = 0;           /* number of chars actually read    */
    size_t idx = 0;             /* array index for number of lines  */
    size_t it = 0;              /* general iterator variable        */
    size_t lmax = LMAX;         /* current array pointer allocation */
    FILE *fp = NULL;            /* file pointer                     */

    if (!(fp = fopen (argv[1], "r"))) { /* open file for reading    */
        fprintf (stderr, "error: file open failed '%s'.", argv[1]);
        return 1;
    }

    /* allocate LMAX pointers and set to NULL. Each of the 255 pointers will
       point to (hold the address of) the beginning of each string read from
       the file below. This will allow access to each string with array[x].
    */
    if (!(array = calloc (LMAX, sizeof *array))) {
        fprintf (stderr, "error: memory allocation failed.");
        return 1;
    }

    /* prototype - ssize_t getline (char **ln, size_t *n, FILE *fp)
       above we declared: char *ln and size_t n. Why don't they match? Simple,
       we will be passing the address of each to getline, so we simply precede
       the variable with the urinary '&' which forces an addition level of
       dereference making char* char** and size_t size_t *. Now the arguments
       match the prototype.
    */
    while ((nchr = getline (&ln, &n, fp)) != -1)    /* read line    */
    {
        while (nchr > 0 && (ln[nchr-1] == '\n' || ln[nchr-1] == '\r'))
            ln[--nchr] = 0;     /* strip newline or carriage rtn    */

        /* allocate & copy ln to array - this will create a block of memory
           to hold each character in ln and copy the characters in ln to that
           memory address. The address will then be stored in array[idx].
           (idx++ just increases idx by 1 so it is ready for the next address) 
           There is a lot going on in that simple: array[idx++] = strdup (ln);
        */
        array[idx++] = strdup (ln);

        if (idx == lmax) {      /* if lmax lines reached, realloc   */
            char **tmp = realloc (array, lmax * 2 * sizeof *array);
            if (!tmp)
                return -1;
            array = tmp;
            lmax *= 2;
        }
    }

    if (fp) fclose (fp);        /* close file */
    if (ln) free (ln);          /* free memory allocated to ln  */

    /* 
        process/use lines in array as needed
        (simple print all lines example below)
    */

    printf ("\nLines in file:\n\n");    /* print lines in file  */
    for (it = 0; it < idx; it++)    
        printf ("  array [%3zu]  %s\n", it, array[it]);
    printf ("\n");

    for (it = 0; it < idx; it++)        /* free array memory    */
        free (array[it]);
    free (array);

    return 0;
}

Use/Output

$ ./bin/getline_rdfile dat/damages.txt

Lines in file:

  array [  0]  Personal injury damage awards are unliquidated
  array [  1]  and are not capable of certain measurement; thus, the
  array [  2]  jury has broad discretion in assessing the amount of
  array [  3]  damages in a personal injury case. Yet, at the same
  array [  4]  time, a factual sufficiency review insures that the
  array [  5]  evidence supports the jury's award; and, although
  array [  6]  difficult, the law requires appellate courts to conduct
  array [  7]  factual sufficiency reviews on damage awards in
  array [  8]  personal injury cases. Thus, while a jury has latitude in
  array [  9]  assessing intangible damages in personal injury cases,
  array [ 10]  a jury's damage award does not escape the scrutiny of
  array [ 11]  appellate review.
  array [ 12]
  array [ 13]  Because Texas law applies no physical manifestation
  array [ 14]  rule to restrict wrongful death recoveries, a
  array [ 15]  trial court in a death case is prudent when it chooses
  array [ 16]  to submit the issues of mental anguish and loss of
  array [ 17]  society and companionship. While there is a
  array [ 18]  presumption of mental anguish for the wrongful death
  array [ 19]  beneficiary, the Texas Supreme Court has not indicated
  array [ 20]  that reviewing courts should presume that the mental
  array [ 21]  anguish is sufficient to support a large award. Testimony
  array [ 22]  that proves the beneficiary suffered severe mental
  array [ 23]  anguish or severe grief should be a significant and
  array [ 24]  sometimes determining factor in a factual sufficiency
  array [ 25]  analysis of large non-pecuniary damage awards.

Memory Check

$ valgrind ./bin/getline_rdfile dat/damages.txt
==14321== Memcheck, a memory error detector
==14321== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al.
==14321== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==14321== Command: ./bin/getline_rdfile dat/damages.txt
==14321==

Lines in file:

  array [  0]  Personal injury damage awards are unliquidated
  <snip>
  ...
  array [ 25]  analysis of large non-pecuniary damage awards.

==14321==
==14321== HEAP SUMMARY:
==14321==     in use at exit: 0 bytes in 0 blocks
==14321==   total heap usage: 29 allocs, 29 frees, 3,997 bytes allocated
==14321==
==14321== All heap blocks were freed -- no leaks are possible
==14321==
==14321== For counts of detected and suppressed errors, rerun with: -v
==14321== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 2 from 2)

Solution 3

int main(int argc, char** argv)
{
   FILE* fpInputFile = NULL; 
   unsigned long ulSize = 0;  // Input File size
   unsigned long ulIteration = 0; 
   unsigned char* ucBuffer; // Buffer data

  if(argc != 2)
  {
   printf("Enter ihe file name \n");
   return -1;
  }
  fpInputFile = fopen(argv[1],"r"); // file open

  if(!fpInputFile){
    fprintf(stderr,"File opening failed");
  }
  fseek(fpInputFile,0,SEEK_END);
  ulSize = ftell(fpInputFile); //current file position
  fseek(fpInputFile,0,SEEK_SET);
  ucBuffer = (unsigned char*)malloc(ulSize); // memory allocation for ucBuffer var
  fread(ucBuffer,1,ulSize,fpInputFile); // Read file
  fclose(fpInputFile); // close the  file
 }

Use fseek and ftell to get offset of text file

Share:
29,026
Amir
Author by

Amir

I study undergraduate Physics in the UK. Do programming in my spare time to aid with my electronics hobby but also for solving problems/challenges I see on the internet.

Updated on July 18, 2022

Comments

  • Amir
    Amir almost 2 years

    I am trying to read in a text file of unknown size into an array of characters. This is what I have so far.

    #include<stdio.h>
    #include<string.h>
    
        int main()
        {
                FILE *ptr_file;
                char buf[1000];
            char output[];
                ptr_file =fopen("CodeSV.txt","r");
                if (!ptr_file)
                    return 1;   
    
            while (fgets(buf,1000, ptr_file)!=NULL)
                strcat(output, buf);
            printf("%s",output);
    
        fclose(ptr_file);
    
        printf("%s",output);
            return 0;
    }
    

    But I do not know how to allocate a size for the output array when I am reading in a file of unknown size. Also when I put in a size for the output say n=1000, I get segmentation fault. I am a very inexperienced programmer any guidance is appreciated :)

    The textfile itself is technically a .csv file so the contents look like the following : "0,0,0,1,0,1,0,1,1,0,1..."