How to split a text file into multiple text files
Solution 1
And here's a nice, simple gawk one-liner :
$ gawk '{if(match($0, /^\[ (.+?) \]/, k)){name=k[1]}} {print >name".txt" }' entry.txt
This will work for any file size, irrespective of the number of lines in each entry, as long as each entry header looks like [ blahblah blah blah ]
. Notice the space just after the opening [
and just before the closing ]
.
EXPLANATION:
awk
and gawk
read an input file line by line. As each line is read, its contents are saved in the $0
variable. Here, we are telling awk to match anything within square brackets, and save its match into the array k
.
So, every time that regular expression is matched, that is, for every header in your file, k[1] will have the matched region of the line. Namely, "entry1", "entry2" or "entry3" or "entryN". name=k[1]
just saves the value of k[1] (the match) into a new variable name
.
Finally, we print each line into a file called <whatever value k currently has>.txt
, ie entry1.txt, entry2.txt ... entryN.txt.
This method will be much faster than perl for larger files.
I can't vouch for this as I have never used windows shell, but I am willing to bet it will be far faster than that also. Gawk/awk are FAST.
Solution 2
For a Windows solution, try this PowerShell script:
$Path = "D:\Scripts\PS\test"
$InputFile = (Join-Path $Path "log.txt")
$Reader = New-Object System.IO.StreamReader($InputFile)
While (($Line = $Reader.ReadLine()) -ne $null) {
If ($Line -match "\[ (.+?) \]") {
$OutputFile = $matches[1] + ".txt"
}
Add-Content (Join-Path $Path $OutputFile) $Line
}
Edit the $Path
and $InputFile
variables accordingly. With some minor modifications it could also accept that information as command-line parameters, or you could turn it into a function.
Solution 3
Yet another awk
solution:
BEGIN {
RS="\\[ entry[0-9]+ \\]\n" # Record separator
ORS="" # Reduce whitespace on output
}
NR == 1 { f=RT } # Entries are of-by-one relative to matched RS
NR > 1 {
split(f, a, " ") # Assuming entries do not have spaces
print f > a[2] ".txt" # a[2] now holds the bare entry name
print >> a[2] ".txt"
f = RT # Remember next entry name
}
Solution 4
The following perl script does the job:
#!/usr/bin/perl while (<STDIN>) { if ($_ =~ m/^\[ (.+?) \]/) { $f = $1; close FH if tell(FH) != -1; open FH, ">", "$f.txt" or die "couldn't open file $f: $!\n"; } print FH $_; } close FH;
Run the script like this:
script.pl < entry.txt
The script works no matter how many entry sections are included and how long the sections are as long as only the entry section headers are like [ some text ]
.
If you prefer unreadable code or just don't want to store a script somewhere, you can use this single command:
perl -e 'while(<STDIN>){if($_=~/^\[ (.+?) \]/){close FH if tell FH!=-1;open FH,">","$1.txt"or die"$1.txt: $!";}print FH $_;}close FH;' < entry.txt
Solution 5
Is it not simpler to use existing commands? Not everything needs a new program.
csplit /\[/ file
Related videos on Youtube
Andrew
Updated on September 18, 2022Comments
-
Andrew almost 2 years
I have a text file called
entry.txt
that contains the following:[ entry1 ] 1239 1240 1242 1391 1392 1394 1486 1487 1489 1600 1601 1603 1657 1658 1660 2075 2076 2078 2322 2323 2325 2740 2741 2743 3082 3083 3085 3291 3292 3294 3481 3482 3484 3633 3634 3636 3690 3691 3693 3766 3767 3769 4526 4527 4529 4583 4584 4586 4773 4774 4776 5153 5154 5156 5628 5629 5631 [ entry2 ] 1239 1240 1242 1391 1392 1394 1486 1487 1489 1600 1601 1603 1657 1658 1660 2075 2076 2078 2322 2323 2325 2740 2741 2743 3082 3083 3085 3291 3292 3294 3481 3482 3484 3690 3691 3693 3766 3767 3769 4526 4527 4529 4583 4584 4586 4773 4774 4776 5153 5154 5156 5628 5629 5631 [ entry3 ] 1239 1240 1242 1391 1392 1394 1486 1487 1489 1600 1601 1603 1657 1658 1660 2075 2076 2078 2322 2323 2325 2740 2741 2743 3082 3083 3085 3291 3292 3294 3481 3482 3484 3690 3691 3693 3766 3767 3769 4241 4242 4244 4526 4527 4529 4583 4584 4586 4773 4774 4776 5153 5154 5156 5495 5496 5498 5628 5629 5631
I would like to split it into three text files:
entry1.txt
,entry2.txt
,entry3.txt
. Their contents are as follows.entry1.txt:
[ entry1 ] 1239 1240 1242 1391 1392 1394 1486 1487 1489 1600 1601 1603 1657 1658 1660 2075 2076 2078 2322 2323 2325 2740 2741 2743 3082 3083 3085 3291 3292 3294 3481 3482 3484 3633 3634 3636 3690 3691 3693 3766 3767 3769 4526 4527 4529 4583 4584 4586 4773 4774 4776 5153 5154 5156 5628 5629 5631
entry2.txt:
[ entry2 ] 1239 1240 1242 1391 1392 1394 1486 1487 1489 1600 1601 1603 1657 1658 1660 2075 2076 2078 2322 2323 2325 2740 2741 2743 3082 3083 3085 3291 3292 3294 3481 3482 3484 3690 3691 3693 3766 3767 3769 4526 4527 4529 4583 4584 4586 4773 4774 4776 5153 5154 5156 5628 5629 5631
entry3.txt:
[ entry3 ] 1239 1240 1242 1391 1392 1394 1486 1487 1489 1600 1601 1603 1657 1658 1660 2075 2076 2078 2322 2323 2325 2740 2741 2743 3082 3083 3085 3291 3292 3294 3481 3482 3484 3690 3691 3693 3766 3767 3769 4241 4242 4244 4526 4527 4529 4583 4584 4586 4773 4774 4776 5153 5154 5156 5495 5496 5498 5628 5629 5631
In other words, the
[
character indicates a new file should begin.Is there any way I can accomplish automatic text file splitting? My eventual, actual input
entry.txt
actually contains 200,001 entries.Doing the text split in either Windows or Linux would be great. I do not have access to a Mac machine. Thanks!
-
Hamed almost 12 yearsall the entries have 7 lines ?
-
Andrew almost 12 years@hamed Oops, I forgot to mention that, unfortunately, the entries do not all have 7 lines.
-
Behrouz.M about 8 yearsCheck this app: softpedia.com/get/System/File-Management/…
-
-
terdon almost 12 yearsYou don't need
cat
, you can just runscript.pl test.txt
. -
speakr almost 12 yearsThis works with gawk but not with awk (at least the awk on a default Debian system). awk's match function only allows two parameters, so your example gives a syntax error with awk.
-
terdon almost 12 yearsYou're quite right, sorry. I am too used to using
while(<>)
which takes the input file as a first argument. -
Thor almost 12 yearsYou're right,
csplit
is the right tool for the job. I had to add a repeat count and swap the arguments to make it work. The following command-line comes close to what the OP asked for:csplit -f entry -b '%d.txt' -z entry.txt '/^\[/' '{*}'
. -
Suncatcher over 7 yearsHowever,
csplit
will work only if record name in file would followentryXX
pattern, 'cause it doesn't support setting variable prefixes