How to append Line to previous Line?

text-processing sed awk

9,505

Solution 1

A version in perl, using negative lookaheads:

$ perl -0pe 's/\n(?!([0-9]{8}|$))//g' test.txt
20141101 server contain dump
20141101 server contain nothing    {uekdmsam ikdas jwdjamc ksadkek} ssfjddkc * kdlsdlsddsfd jfkdfk
20141101 server contain dump

-0 allows the regex to be matched across the entire file, and \n(?!([0-9]{8}|$)) is a negative lookahead, meaning a newline not followed by 8 digits, or end of the line (which, with -0, will be the end of the file).

Solution 2

May be a little bit easy with sed

sed -e ':1 ; N ; $!b1' -e 's/\n\+\( *[^0-9]\)/\1/g'

first part :1;N;$!b1 collect all lines in file divided by \n in 1 long line
second part strip newline symbol if it followed non-digit symbol with possible spaces between its.

To avoid memory limitation (espesially for big files) you can use:

sed -e '1{h;d}' -e '1!{/^[0-9]/!{H;d};/^[0-9]/x;$G}' -e 's/\n\+\( *[^0-9]\)/\1/g'

Or forget a difficult sedscripts and to remember that year starts from 2

tr '\n2' ' \n' | sed -e '1!s/^/2/' -e 1{/^$/d} -e $a

Solution 3

One way would be:

 $ perl -lne 's/^/\n/ if $.>1 && /^\d+/; printf "%s",$_' file
 20141101 server contain dump
 20141101 server contain nothing    {uekdmsam ikdas jwdjamc ksadkek} ssfjddkc * kdlsdlsddsfd jfkdfk 
 20141101 server contain dump

However, .that also removes the final newline. To add it again, use:

$ { perl -lne 's/^/\n/ if $.>1 && /^\d+/; printf "%s",$_' file; echo; } > new

Explanation

The -l will remove trailing newlines (and also add one to each print call which is why I use printf instead. Then, if the current line starts with numbers (/^\d+/) and the current line number is greater than one ($.>1, this is needed to avoid adding an extra empty line at the beginning), add a \n to the beginning of the line. The printf prints each line.

Alternatively, you can change all \n characters to \0, then change those \0 that are right before a string of numbers to \n again:

$ tr '\n' '\0' < file | perl -pe 's/\0\d+ |$/\n$&/g' | tr -d '\0'
20141101 server contain dump
20141101 server contain nothing    {uekdmsam ikdas jwdjamc ksadkek} ssfjddkc * kdlsdlsddsfd jfkdfk 
20141101 server contain dump

To make it match only strings of 8 numbers, use this instead:

$ tr '\n' '\0' < file | perl -pe 's/\0\d{8} |$/\n$&/g' | tr -d '\0'

Solution 4

Try doing this using awk :

#!/usr/bin/awk -f

{
    # if the current line begins with 8 digits followed by
    # 'nothing' OR the current line doesn't start with 8 digits
    if (/^[0-9]{8}.*nothing/ || !/^[0-9]{8}/) {
        # print current line without newline
        printf "%s", $0
        # feeding a 'state' variable
        weird=1
    }
    else {
        # if last line was treated in the 'if' statement
        if (weird==1) {
            printf "\n%s", $0
            weird=0
        }
        else {
            print # print the current line
        }
    }
}
END{
    print # add a newline when there's no more line to treat
}

To use it:

chmod +x script.awk
./script.awk file.txt

Solution 5

Another simplest way (than my other answer) using awk and terdon's algorithm :

awk 'NR>1 && /^[0-9]{8}/{printf "%s","\n"$0;next}{printf "%s",$0}END{print}' file

View more solutions

9,505

Author by

William R

Updated on September 18, 2022

Comments

William R over 1 year

I have a Log file which need to be parsed and analysed. File contains something similar like below:

File:

20141101 server contain dump
20141101 server contain nothing
    {uekdmsam ikdas 

jwdjamc ksadkek} ssfjddkc * kdlsdl
sddsfd jfkdfk 
20141101 server contain dump

Based on the above scenario, I have to check if the starting line doesn't contain date or Number I have to append to previous line.

Output file:

20141101 server contain dump
20141101 server contain nothing {uekdmsam ikdas jwdjamc ksadkek} ssfjddkc * kdlsdl sddsfd jfkdfk 
20141101 server contain dump

muru over 9 years

@terdon, updated to save last newline.
terdon over 9 years

Nice one! I'd upvote you but I'm afraid I already had :)
terdon over 9 years

Nice, +1. Could you add an explanation of how it works please?
mirabilos over 9 years

Aw. Nice. I always do tr '\n' $'\a' | sed $'s/\a\a*$ *[^0-9]$/\1/g' | tr $'\a' '\n' myself.
mirabilos over 9 years

Sorry, have to downvote though for using things that are not POSIX BASIC REGULAR EXPRESSIONS in sed(1), which is a GNUism.
Costas over 9 years

@mirabilos Kindly ask you to indicate non-POSIX exptression in my script.
mirabilos over 9 years

There is no + or \+ in POSIX basic regular expressions.
Costas over 9 years

@mirabilos From man grep >**Basic vs Extended Regular Expressions** > In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; > instead use the backslashed versions \?, \+, \{, \|, (, and ).
Stéphane Chazelas over 9 years

No, -0 if for NUL-delimited records. Use -0777 to slurp the entire file in memory (which you don't need to here).
Stéphane Chazelas over 9 years

@Costas, that's GNU grep's man page. POSIX BRE spec are there. BRE equivalent of ERE + is \{1,\}. [\n] is not portable either. \n\{1,\} would be POSIX.
muru over 9 years

@StéphaneChazelas So whats the best way to make Perl match the newline, other than reading the whole file in?
Costas over 9 years

@StéphaneChazelas OK, if you'd like to be so old-school you are welcome to change \+ to \{1,\}
Stéphane Chazelas over 9 years

Also, you can't have another command after a label. : 1;x is to define the 1;x label in POSIX seds. So you need: sed -e :1 -e 'N;$!b1' -e 's/\n\{1,\}$ *[^0-9]$/\1/g'. Also note that many sed implementations have a small limit on the size of their pattern space (POSIX only guarantees 10 x LINE_MAX IIRC).
Stéphane Chazelas over 9 years

The first argument to printf is the format. Use printf "%s", $_
Costas over 9 years

@StéphaneChazelas Yes, I am worry about space limitation too, even try to play with P and D but I couldn't find acceptable solution
terdon over 9 years

@StéphaneChazelas why? I mean, I know it's cleaner and perhaps easier to understand but is there any danger that that would protect from?
Stéphane Chazelas over 9 years

Yes, it's wrong and potentially dangerous if the input may contain % characters. Try with an input with %10000000000s for instance.
Stéphane Chazelas over 9 years

In C, that's a very well known very bad practice and vulnerability source. With perl, echo %.10000000000f | perl -ne printf brings my machine to its knees.
Stéphane Chazelas over 9 years

See the other answers that process the file line by line.
terdon over 9 years

@StéphaneChazelas wow, yes. Mine too. Fair enough then, answer edited and thanks.
Stéphane Chazelas over 9 years

ITYM END{print ""}. Alternative: awk -v ORS= 'NR>1 && /^[0-9]{8}/{print "\n"};1;END{print "\n"}'
muru over 9 years

@AvinashRaj Yes, it should be more efficient, but produces wrong results if non-log lines include blank ones?
muru over 9 years

@StéphaneChazelas so there's no middle ground between "matching a newline" and "reading the whole file and the library next to it"?
mirabilos over 9 years

This will break if the line contains, say, a backslash and an n. It also strips whitespace. But you can use mksh to do this: while IFS= read -r L; do [[ $L = [0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]* ]] && print; print -nr -- "$L"; done; print
rook over 9 years

Of course it is not for everything algorithm, but solution for the requirements provided by the task. Of course the final solution will be more complex and less readable at a glance as it usually happens in Real Life :)
mirabilos over 9 years

I agree, but I’ve learned the hard way to not assume too much about the OP ☺ especially if they replace the actual text by dummy text.