Split text file into smaller multiple text file using command line

300,971

Solution 1

I know the question has been asked a long time ago, but I am surprised that nobody has given the most straightforward unix answer:

split -l 5000 -d --additional-suffix=.txt $FileName file
  • -l 5000: split file into files of 5,000 lines each.
  • -d: numerical suffix. This will make the suffix go from 00 to 99 by default instead of aa to zz.
  • --additional-suffix: lets you specify the suffix, here the extension
  • $FileName: name of the file to be split.
  • file: prefix to add to the resulting files.

As always, check out man split for more details.

For Mac, the default version of split is apparently dumbed down. You can install the GNU version using the following command. (see this question for more GNU utils)

brew install coreutils

and then you can run the above command by replacing split with gsplit. Check out man gsplit for details.

Solution 2

Here's an example in C# (cause that's what I was searching for). I needed to split a 23 GB csv-file with around 175 million lines to be able to look at the files. I split it into files of one million rows each. This code did it in about 5 minutes on my machine:

var list = new List<string>();
var fileSuffix = 0;
using (var file = File.OpenRead(@"D:\Temp\file.csv"))
using (var reader = new StreamReader(file))
{
    while (!reader.EndOfStream)
    {
        list.Add(reader.ReadLine());
        if (list.Count >= 1000000)
        {
            File.WriteAllLines(@"D:\Temp\split" + (++fileSuffix) + ".csv", list);
            list = new List<string>();
        }
    }
}
File.WriteAllLines(@"D:\Temp\split" + (++fileSuffix) + ".csv", list);

Solution 3

@ECHO OFF
SETLOCAL
SET "sourcedir=U:\sourcedir"
SET /a fcount=100
SET /a llimit=5000
SET /a lcount=%llimit%
FOR /f "usebackqdelims=" %%a IN ("%sourcedir%\q25249516.txt") DO (
 CALL :select
 FOR /f "tokens=1*delims==" %%b IN ('set dfile') DO IF /i "%%b"=="dfile" >>"%%c" ECHO(%%a
)
GOTO :EOF
:select
SET /a lcount+=1
IF %lcount% lss %llimit% GOTO :EOF
SET /a lcount=0
SET /a fcount+=1
SET "dfile=%sourcedir%\file%fcount:~-2%.txt"
GOTO :EOF

Here's a native windows batch that should accomplish the task.

Now I'll not say that it'll be fast (less than 2 minutes for each 5Kline output file) or that it will be immune to batch character-sensitivites. Really depends on the characteristics of your target data.

I used a file named q25249516.txt containing 100Klines of data for my testing.


Revised quicker version

REM

@ECHO OFF
SETLOCAL
SET "sourcedir=U:\sourcedir"
SET /a fcount=199
SET /a llimit=5000
SET /a lcount=%llimit%
FOR /f "usebackqdelims=" %%a IN ("%sourcedir%\q25249516.txt") DO (
 CALL :select
 >>"%sourcedir%\file$$.txt" ECHO(%%a
)
SET /a lcount=%llimit%
:select
SET /a lcount+=1
IF %lcount% lss %llimit% GOTO :EOF
SET /a lcount=0
SET /a fcount+=1
MOVE /y "%sourcedir%\file$$.txt" "%sourcedir%\file%fcount:~-2%.txt" >NUL 2>nul
GOTO :EOF

Note that I used llimit of 50000 for testing. Will overwrite the early file numbers if llimit*100 is gearter than the number of lines in the file (cure by setting fcount to 1999 and use ~3 in place of ~2 in file-renaming line.)

Solution 4

You can maybe do something like this with awk

awk '{outfile=sprintf("file%02d.txt",NR/5000+1);print > outfile}' yourfile

Basically, it calculates the name of the output file by taking the record number (NR) and dividing it by 5000, adding 1, taking the integer of that and zero-padding to 2 places.

By default, awk prints the entire input record when you don't specify anything else. So, print > outfile writes the entire input record to the output file.

As you are running on Windows, you can't use single quotes because it doesn't like that. I think you have to put the script in a file and then tell awkto use the file, something like this:

awk -f script.awk yourfile

and script.awk will contain the script like this:

{outfile=sprintf("file%02d.txt",NR/5000+1);print > outfile}

Or, it may work if you do this:

awk "{outfile=sprintf(\"file%02d.txt\",NR/5000+1);print > outfile}" yourfile

Solution 5

Syntax looks like:

$ split [OPTION] [INPUT [PREFIX]] 

where prefix is PREFIXaa, PREFIXab, ...

Just use proper one and youre done or just use mv for renameing. I think $ mv * *.txt should work but test it first on smaller scale.

:)

Share:
300,971
ashleybee97
Author by

ashleybee97

Updated on July 08, 2022

Comments

  • ashleybee97
    ashleybee97 5 months

    I have multiple text file with about 100,000 lines and I want to split them into smaller text files of 5000 lines each.

    I used:

    split -l 5000 filename.txt
    

    That creates files:

    xaa
    xab
    aac
    xad
    xbe
    aaf
    

    files with no extensions. I just want to call them something like:

    file01.txt
    file02.txt
    file03.txt
    file04.txt
    

    or if that is not possible, i just want them to have the ".txt" extension.

  • shareef
    shareef over 6 years
    1 MB takes 5 MIN too long
  • Magoo
    Magoo over 6 years
    @shareef: The time taken should depend on the number of lines in the file, not the filesize. Not sure whther you mean 1Mb or 1M lines. My test on the latest version was 1M lines and 11Mb long.
  • David Balažic
    David Balažic almost 6 years
    This makes the first file to be one line less that the others. The correct formula is (NR-1)/5000+1
  • Zachary Dow
    Zachary Dow almost 6 years
    And you can basically just throw it in LINQPad and just tweek to your heart's content. No need to compile anything. Good Solution.
  • Arya
    Arya almost 6 years
    This is good but it leaves one blank line at the end of each line. Anyway to prevent that?
  • Magoo
    Magoo almost 6 years
    @arya : I do not understand "one blank line at the end of each line". The line-endings are windows-standard CRLF. There are no empty lines in the output. Perhaps you are using a utility that counts both CR and LF as new-lines?
  • bakoyaro
    bakoyaro almost 5 years
    If I could +100 I would! With the syntax you posted I was able to split a >380M file into 10M files in roughly .3 second.
  • Stefano Munarini almost 5 years
    It seems like -d and --additional-suffix are no longer supported options (OSX 10.12.6)
  • ursan
    ursan almost 5 years
    @StefanoMunarini for mac, you can install the gnu version of split with brew install coreutils, and then you would replace split with gsplit in the command above.
  • AGrush
    AGrush over 2 years
    and how would you use a delimeter instead of number of lines?
  • ursan
    ursan over 2 years
    @AGrush I'm not sure exactly what your use case is, but I think you could use the -t flag which splits on a user-specified delimiter instead of a newline. You can then use the -l flag to specify how many splits you want to group together in the output file.
  • Michał Stochmal
    Michał Stochmal 7 months
    This tool would be ideal, but it replaces non-ASCII characters with bushes :). Just saying to be aware of this problem.
  • Michał Stochmal
    Michał Stochmal 7 months
    On Windows 10 if you have WSL installed you can mount Windows directory and use this split command. Use this to get to your windows directory: cd /mnt/c/.
  • user2590805 6 months
    @Michał Stochmal: In documentation of this tool is mentioned: ... Split by size produces binary files... So you have to split by line numbers.