splitting a CSV and keeping the header without intermediate files

7,309

An entirely different approach would be to use GNU parallel, and use its --header and --pipe options:

cat input.csv | parallel --header : --pipe -N 10 'cat > output{#}.csv'

This will get you 11 lines in each of the files (the header line plus the ten from -n 10, except in the last file written that way).

Share:
7,309

Related videos on Youtube

Michael Finco
Author by

Michael Finco

Updated on September 18, 2022

Comments

  • Michael Finco
    Michael Finco over 1 year

    I am trying to split a dozen 100MB+ csv files into managable smaller files for a curl post.

    I have managed to do it but with a lot of temporary files and IO. It's taking an eternity.

    I am hoping someone can show me a way to do this much more effectively; preferably with little to no disk IO

    #!/bin/sh
    
    for csv in $(ls *.csv); do
        tail -n +2 $csv | split -a 5 -l - $RANDOM.split.
    done
    
    # chose a file randomly to fetch the header from   
    
    header=$(ls *.csv |sort -R |tail -1 | cut -d',' -f1)
    
    mkdir split
    
    for x in $(/usr/bin/find . -maxdepth 1 -type f -name '*.split.*'); do
        echo Processing $x
        cat header $x >> split/$x
        rm -f $x
    done
    

    The above script may not entirely work. I basically got it working through a combination of these commands.

    I decided to make the curl POST another step entirely in the case of upload failure; I didn't want to lose the data if it were all posted. But, if, say, on error from curl the data could be put into a redo folder then that can work.

    #!/bin/sh
    
    # working on a progress indicator as a percentage. Never finished.
    count=$(ls -1 | wc -l 2> /dev/null | cut -d' ' -f1)
    
    for file in $(/usr/bin/find . -maxdepth 1 -type f); do
        echo Processing $file
        curl -XPOST --data-binary @$file -H "Content-Type: text/cms+csv" $1
    done
    
    • John1024
      John1024 over 9 years
      This question has been cross-posted at SO: stackoverflow.com/questions/26708081/…
    • Olivier Dulac
      Olivier Dulac over 9 years
      this really looks like an XYProblem: what do you need to do? are you sure you need to do it via multiple http post? if you have access to the machine, you can scp, rsync, etc.
    • chepner
      chepner over 9 years
      $(ls *.csv) is redundant; just use for csv in *.csv.
    • Michael Finco
      Michael Finco over 9 years
      @OlivierDulac -- I actually have to POST it because it has to be processed by a webservice. It's not just a straight upload.
  • Michael Finco
    Michael Finco over 9 years
    This looks interesting. I will have to give this a try.