Splitting a file in linux based on content

linux file bash sed awk

69,324

Solution 1

If you have a mail.txt

$ cat mail.txt
<html>
    mail A
</html>

<html>
    mail B
</html>

<html>
    mail C
</html>

run csplit to split by <html>

$ csplit mail.txt '/^<html>$/' '{*}'

 - mail.txt    => input file
 - /^<html>$/  => pattern match every `<html>` line
 - {*}         => repeat the previous pattern as many times as possible

check output

$ ls
mail.txt  xx00  xx01  xx02  xx03

If you want do it in awk

$ awk '/<html>/{filename=NR".txt"}; {print >filename}' mail.txt
$ ls
1.txt  5.txt  9.txt  mail.txt

Solution 2

csplit is the best solution to this problem. Just thought I'd post a bash-solution to show that there is no need to go perl on this task:

#!/usr/bin/bash

MAIL='mail'        # path to huge mail-file

#get linenumbers for all headers
line_no=$(grep -n html $MAIL | cut -d: -f1)

read -a LINES<<< $line_no

file=0
for i in $(seq 0 2 ${#LINES[@]}); do
    start=${LINES[i]}
    end=$((${LINES[i+1]}-1))
    echo $start, $end
    sed -n "${start},${end}p" $MAIL > ${MAIL}${file}.txt
    file=$((file+1))
done

Solution 3

The csplit program solves your problem elegantly:

csplit '/<!DOCTYPE.*/' $FILE

Solution 4

I agree with fge. With perl it would be a lot simpler. You can try something like this -

#!/usr/bin/perl

undef $/;
$_ = <>;
$n = 0;

for $match (split(/(?=HEADER_FORMAT)/)) {
      open(O, '>mail' . ++$n);
      print O $match;
      close(O);
}

Replace HEADER_FORMAT with your header type.

Solution 5

It is doable with some perl "magic"... Many people would call this ugly but here goes.

The trick is to replace $/ with what you want and read your input, as such:

#!/usr/bin/perl -W
use strict;
my $i = 1;

$/ = <<EOF;
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head> <xmeta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
EOF

open INPUT, "/path/to/inputfile" or die;

while (my $mail = <INPUT>) {
    $mail = substr($mail, 0, index($mail, $/));
    open OUTPUT, ">/path/to/emailfile." . $i . ".txt" or die;
    $i++;
    print OUTPUT $mail;
    close OUTPUT;
}

edit: fixed, I always forget that $/ is included in the input. Also, the first file will always be empty, but then it can be easily handled.

View more solutions

69,324

Author by

Greenhorn

Avid reader! Ardent programmer!

Updated on July 09, 2022

Comments

Greenhorn almost 2 years

I have an email dump of around 400mb. I want to split this into .txt files, consisting of one mail in each file. Every e-mail starts with the standard HTML header specifying the doctype.

This means I will have to split my files based on the above said header. How do I go about it in linux?
fge over 12 years

Yep, a positive lookahead would work nicely, especially since here the header does not contain any metacharacter. You could even use qr// to build the split regex.
Greenhorn over 12 years

Am afraid! I did the same and did a $ls mail.txt xx00 and obviously mail.txt was the same as xx00 Any fixes?
kev over 12 years

@Ramprakash My csplit's ver is 8.5. Maybe yours don't have the {*} which repeat pattern. please check manpage. I just add awk solution. You can try it.
Daniel Gasienica about 8 years

@Greenhorn My version of csplit also didn’t support {*}, but this worked: csplit -n 6 -f 'mail-' -k mail.txt '/^<html>$/' '{5000}'
qwertzguy about 7 years

Arguments are in the wrong order and is missing the repetition to actually do as intended.
mwfearnley almost 5 years

To prevent an awk error if the first line doesn't match the pattern (for gawk at least), do: awk 'BEGIN {filename="0.txt"} /...'
boutta about 2 years

In the seq command, I don't know why a step-width of 2 was chosen. I changed it to 1 in order to work for me.