How to split a file and keep the first line in each of the pieces?

59,678

Solution 1

This is robhruska's script cleaned up a bit:

tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
    head -n 1 file.txt > tmp_file
    cat "$file" >> tmp_file
    mv -f tmp_file "$file"
done

I removed wc, cut, ls and echo in the places where they're unnecessary. I changed some of the filenames to make them a little more meaningful. I broke it out onto multiple lines only to make it easier to read.

If you want to get fancy, you could use mktemp or tempfile to create a temporary filename instead of using a hard coded one.

Edit

Using GNU split it's possible to do this:

split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }; export -f split_filter; tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_

Broken out for readability:

split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }
export -f split_filter
tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_

When --filter is specified, split runs the command (a function in this case, which must be exported) for each output file and sets the variable FILE, in the command's environment, to the filename.

A filter script or function could do any manipulation it wanted to the output contents or even the filename. An example of the latter might be to output to a fixed filename in a variable directory: > "$FILE/data.dat" for example.

Solution 2

This one-liner will split the big csv into pieces of 999 records, preserving the header row at the top of each one (so 999 records + 1 header = 1000 rows)

cat bigFile.csv | parallel --header : --pipe -N999 'cat >file_{#}.csv'

Based on Ole Tange's answer. (re Ole's answer: You can't use line count with pipepart)

See comments for some tips on installing parallel

Solution 3

You could use the new --filter functionality in GNU coreutils split >= 8.13 (2011):

tail -n +2 FILE.in | split -l 50 - --filter='sh -c "{ head -n1 FILE.in; cat; } > $FILE"'

Solution 4

You can use [mg]awk:

awk 'NR==1{
        header=$0; 
        count=1; 
        print header > "x_" count; 
        next 
     } 

     !( (NR-1) % 100){
        count++; 
        print header > "x_" count;
     } 
     {
        print $0 > "x_" count
     }' file

100 is the number of lines of each slice. It doesn't require temp files and can be put on a single line.

Solution 5

I'm a novice when it comes to Bash-fu, but I was able to concoct this two-command monstrosity. I'm sure there are more elegant solutions.

$> tail -n +2 file.txt | split -l 4
$> for file in `ls xa*`; do echo "`head -1 file.txt`" > tmp; cat $file >> tmp; mv -f tmp $file; done

This is assuming your input file is file.txt, you're not using the prefix argument to split, and you're working in a directory that doesn't have any other files that start with split's default xa* output format. Also, replace the '4' with your desired split line size.

Share:
59,678
Arkady
Author by

Arkady

Updated on April 30, 2021

Comments

  • Arkady
    Arkady about 3 years

    Given: One big text-data file (e.g. CSV format) with a 'special' first line (e.g., field names).

    Wanted: An equivalent of the coreutils split -l command, but with the additional requirement that the header line from the original file appear at the beginning of each of the resulting pieces.

    I am guessing some concoction of split and head will do the trick?

  • Arkady
    Arkady over 14 years
    This will certainly work. I was just hoping for some slick one-liner like for $part in (split -l 1000 myfile); cat <(head -n1 myfile) $part > myfile.$part; done
  • SourceSeeker
    SourceSeeker over 14 years
    That can't work because split, of necessity, doesn't output on stdout.
  • Arkady
    Arkady over 14 years
    split could output the names of the files to stdout, though (as long as we are discussing what split ought to do :-)
  • SourceSeeker
    SourceSeeker over 14 years
    You're right. That could be handy. Sorry I misread your one-liner.
  • Johnathan Elmore
    Johnathan Elmore about 9 years
    Mac OS X 10.10.4 worked with the original snippet, but not the one-liner GNU split version.
  • SourceSeeker
    SourceSeeker about 9 years
    @JohnathanElmore: Note that GNU utilities are available for OS X. Using Homebrew, for example.
  • Johnathan Elmore
    Johnathan Elmore about 9 years
    stackoverflow.com/a/30005262/1014710 has instructions for Homebrew GNU coreutils instructions
  • Bas
    Bas about 8 years
    I like this solution, however it's limited to only two split files
  • DreamFlasher
    DreamFlasher about 8 years
    If you like it there is the upvote feature for it ;) It can easily be adjusted to more files, but yes it's not as flexible as split -l
  • Pandem1c
    Pandem1c over 7 years
    "one liner" ...pshh
  • KullDox
    KullDox about 7 years
    I like the one-liner version. Just to make it more generic for bash, I did: tail -n +2 FILE.in | split -d --lines 50 - --filter='bash -c "{ head -n1 ${FILE%.*}; cat; } > $FILE"' FILE.in.x
  • Peiti Li
    Peiti Li almost 5 years
    please noted that if we consider the header row in each file then each smaller file will have 1000 rows in this solution.
  • Tim Richardson
    Tim Richardson almost 5 years
    Which is why I use 999 :)
  • Asimov4
    Asimov4 about 4 years
    I had to brew install parallel on macOS. Works like a charm!
  • Henrik Høyer
    Henrik Høyer almost 4 years
    You may want to add --additional-suffix=.txt to the split command to keep the file extension
  • Ram RS
    Ram RS over 3 years
    This was perfect. Thank you so much!
  • Tracy Logan
    Tracy Logan over 3 years
    Like MacOS, Ubuntu 20.04 also needs to have parallel installed for this to work. Note that Ubuntu suggests either sudo apt install moreutils # version 0.63-1, or sudo apt install parallel # version 20161222-1.1 -- go with the latter suggestion. The first suggestion, moreutils sounds extra useful, but the version of parallel included in that package errored out (parallel: invalid option -- '-'). The second suggestion worked as expected (details).
  • runrig
    runrig about 3 years
    Suggestion: change awk script to simply: 'NR > 1' as print is the default action.
  • runrig
    runrig about 3 years
    That said, I doubt awk is any faster (or at least significantly faster) than tail in this case.
  • runrig
    runrig about 3 years
    I also might put the header in a variable before the loop, and then 'echo "$header | ...." in the loop
  • Arman
    Arman almost 3 years
    "--block 10M" - a day saver!!