How to split a text file into multiple text files

windows linux text-editing awk parsing

32,652

Solution 1

And here's a nice, simple gawk one-liner :

$ gawk '{if(match($0, /^\[ (.+?) \]/, k)){name=k[1]}} {print >name".txt" }' entry.txt

This will work for any file size, irrespective of the number of lines in each entry, as long as each entry header looks like [ blahblah blah blah ]. Notice the space just after the opening [ and just before the closing ].

EXPLANATION:

awk and gawk read an input file line by line. As each line is read, its contents are saved in the $0 variable. Here, we are telling awk to match anything within square brackets, and save its match into the array k.

So, every time that regular expression is matched, that is, for every header in your file, k[1] will have the matched region of the line. Namely, "entry1", "entry2" or "entry3" or "entryN". name=k[1] just saves the value of k[1] (the match) into a new variable name.

Finally, we print each line into a file called <whatever value k currently has>.txt, ie entry1.txt, entry2.txt ... entryN.txt.

This method will be much faster than perl for larger files.

I can't vouch for this as I have never used windows shell, but I am willing to bet it will be far faster than that also. Gawk/awk are FAST.

Solution 2

For a Windows solution, try this PowerShell script:

$Path = "D:\Scripts\PS\test"
$InputFile = (Join-Path $Path "log.txt")
$Reader = New-Object System.IO.StreamReader($InputFile)

While (($Line = $Reader.ReadLine()) -ne $null) {
    If ($Line -match "\[ (.+?) \]") {
        $OutputFile = $matches[1] + ".txt"
    }

    Add-Content (Join-Path $Path $OutputFile) $Line
}

Edit the $Path and $InputFile variables accordingly. With some minor modifications it could also accept that information as command-line parameters, or you could turn it into a function.

Solution 3

Yet another awk solution:

BEGIN { 
  RS="\\[ entry[0-9]+ \\]\n"  # Record separator
  ORS=""                      # Reduce whitespace on output
}
NR == 1 { f=RT }              # Entries are of-by-one relative to matched RS
NR  > 1 {
  split(f, a, " ")            # Assuming entries do not have spaces 
  print f  > a[2] ".txt"      # a[2] now holds the bare entry name
  print   >> a[2] ".txt"
  f = RT                      # Remember next entry name
}

Solution 4

The following perl script does the job:

#!/usr/bin/perl

while (<STDIN>) {
    if ($_ =~ m/^\[ (.+?) \]/) {
        $f = $1;
        close FH if tell(FH) != -1;
        open FH, ">", "$f.txt" or die "couldn't open file $f: $!\n";
    }
    print FH $_;
}
close FH;

Run the script like this:

script.pl < entry.txt

The script works no matter how many entry sections are included and how long the sections are as long as only the entry section headers are like [ some text ].

If you prefer unreadable code or just don't want to store a script somewhere, you can use this single command:

perl -e 'while(<STDIN>){if($_=~/^\[ (.+?) \]/){close FH if tell FH!=-1;open FH,">","$1.txt"or die"$1.txt: $!";}print FH $_;}close FH;' < entry.txt

Solution 5

Is it not simpler to use existing commands? Not everything needs a new program.

csplit /\[/ file

View more solutions

32,652

Andrew

Updated on September 18, 2022

Comments

Andrew almost 2 years

I have a text file called entry.txt that contains the following:

[ entry1 ]
1239 1240 1242 1391 1392 1394 1486 1487 1489 1600
1601 1603 1657 1658 1660 2075 2076 2078 2322 2323
2325 2740 2741 2743 3082 3083 3085 3291 3292 3294
3481 3482 3484 3633 3634 3636 3690 3691 3693 3766
3767 3769 4526 4527 4529 4583 4584 4586 4773 4774
4776 5153 5154 5156 5628 5629 5631
[ entry2 ]
1239 1240 1242 1391 1392 1394 1486 1487 1489 1600
1601 1603 1657 1658 1660 2075 2076 2078 2322 2323
2325 2740 2741 2743 3082 3083 3085 3291 3292 3294
3481 3482 3484 3690 3691 3693 3766 3767 3769 4526
4527 4529 4583 4584 4586 4773 4774 4776 5153 5154
5156 5628 5629 5631
[ entry3 ]
1239 1240 1242 1391 1392 1394 1486 1487 1489 1600
1601 1603 1657 1658 1660 2075 2076 2078 2322 2323
2325 2740 2741 2743 3082 3083 3085 3291 3292 3294
3481 3482 3484 3690 3691 3693 3766 3767 3769 4241
4242 4244 4526 4527 4529 4583 4584 4586 4773 4774
4776 5153 5154 5156 5495 5496 5498 5628 5629 5631

I would like to split it into three text files: entry1.txt, entry2.txt, entry3.txt. Their contents are as follows.

entry1.txt:

[ entry1 ]
1239 1240 1242 1391 1392 1394 1486 1487 1489 1600
1601 1603 1657 1658 1660 2075 2076 2078 2322 2323
2325 2740 2741 2743 3082 3083 3085 3291 3292 3294
3481 3482 3484 3633 3634 3636 3690 3691 3693 3766
3767 3769 4526 4527 4529 4583 4584 4586 4773 4774
4776 5153 5154 5156 5628 5629 5631

entry2.txt:

[ entry2 ]
1239 1240 1242 1391 1392 1394 1486 1487 1489 1600
1601 1603 1657 1658 1660 2075 2076 2078 2322 2323
2325 2740 2741 2743 3082 3083 3085 3291 3292 3294
3481 3482 3484 3690 3691 3693 3766 3767 3769 4526
4527 4529 4583 4584 4586 4773 4774 4776 5153 5154
5156 5628 5629 5631

entry3.txt:

[ entry3 ]
1239 1240 1242 1391 1392 1394 1486 1487 1489 1600
1601 1603 1657 1658 1660 2075 2076 2078 2322 2323
2325 2740 2741 2743 3082 3083 3085 3291 3292 3294
3481 3482 3484 3690 3691 3693 3766 3767 3769 4241
4242 4244 4526 4527 4529 4583 4584 4586 4773 4774
4776 5153 5154 5156 5495 5496 5498 5628 5629 5631

In other words, the [ character indicates a new file should begin.

Is there any way I can accomplish automatic text file splitting? My eventual, actual input entry.txt actually contains 200,001 entries.

Doing the text split in either Windows or Linux would be great. I do not have access to a Mac machine. Thanks!

Hamed almost 12 years

all the entries have 7 lines ?
Andrew almost 12 years

@hamed Oops, I forgot to mention that, unfortunately, the entries do not all have 7 lines.
Behrouz.M about 8 years

Check this app: softpedia.com/get/System/File-Management/…

terdon almost 12 years

You don't need cat, you can just run script.pl test.txt.
speakr almost 12 years

This works with gawk but not with awk (at least the awk on a default Debian system). awk's match function only allows two parameters, so your example gives a syntax error with awk.
terdon almost 12 years

You're quite right, sorry. I am too used to using while(<>) which takes the input file as a first argument.
Thor almost 12 years

You're right, csplit is the right tool for the job. I had to add a repeat count and swap the arguments to make it work. The following command-line comes close to what the OP asked for: csplit -f entry -b '%d.txt' -z entry.txt '/^\[/' '{*}'.
Suncatcher over 7 years

However, csplit will work only if record name in file would follow entryXX pattern, 'cause it doesn't support setting variable prefixes