Replace any number of tabs and spaces with single new line in Linux?

9,439

Solution 1

etopylight was almost right:

tr -s ' \t' '\n'

because the question asks to replace tabs, too.

Solution 2

Basically, you could do it in GNU sed:

sed 's/\s\+/\n/g'

There you go...

Solution 3

You should be able to use

sed -e 's/[[:space:]]\{1,\}/\n/'

to replace any sequence of one or more whitespace characters (including oddities like formfeed and vertical tabs) with a single newline.

Solution 4

If gnu-grep available,

grep -Po '\S+'

Solution 5

It could be done in a long list of ways:

tr -s ' \t' '\n' <file for tabs and spaces only.
tr -s ' \t,."' '\n' <file for your strings.
tr -s '[:blank:]' '\n' <file for tabs and spaces only.
tr -s '[:space:]' '\n' <file for \t\n\v\f\r
sed -e 's/[ \t]/\n/g' -e 's/\n\n*/\n/g' file GNU sed for \n.
sed 's/[ \t".,]\+/\n/g' file | tr -s '\n' GNU sed for \n.


Description

tr

The most basic tool is tr, yes, but it could not understand format. So, a basic

tr -s ' ' '\n' <file

would convert all (repeated) spaces to one newline. That, of course could generate empty lines if lines with only spaces exist or the file start with spaces. There is no way to correct that in tr. It could be done adding a filter to remove empty lines. Like sed '/^$/d'

tr -s ' ' '\n' <file | sed '/^$/d'

Additional characters (like tabs) could be added:

tr -s ' \t' '\n' <file | sed '/^$/d'

Others, that might result from a text paragraph (like this one), like commas and periods could also be added. That is changing the definition of what a word is.

tr -s ' \t,.()' '\n' <file | sed '/^$/d'

sed

Sed is more capable than tr (and could be slower). It could change runs of some characters to one newline. The basic idea would be (in GNU sed for the replacement \n):

sed 's/[ \t]\{1,\}/\n/g' 

In other seds that are not able to use \n in the right side of a replacement we need to use an actual newline (and, sometimes, an actual tab):

sed 's/[   ]\{1,\}/\
/g'

So, sed helps with some issues but makes unnecessarily complex others.
Read 4.1. How do I insert a newline into the RHS of a substitution?

grep

It could be done in grep as well. Just match sequences of non-word characters:

grep -o '[^         ]\{1,\}'     ## explicit space-tab.

In GNU grep, the equivalent \S+ could be used:

grep -Eo '\S+'     ## or grep -o '\S\+'

awk

In awk we get still more power from the tool. It gets quite simple:

awk '{for(i=1;i<=NF;i++) {print $i}}' file

Which is just: print all fields for each line. Where there are fields. If the number of fields is 0 nothing will be printed.

That is using the default FS which delimits fields on repeated spaces, tabs or newlines.

A similar solution could be done with RS (for awks that allow regex separators):

awk -v RS="[ \t\n]+" 'NF'

Which tells awk to split records in runs of space, tab or newlines and print only if there is any no-empty field (the NF).

Share:
9,439
user2925489
Author by

user2925489

Updated on September 18, 2022

Comments

  • user2925489
    user2925489 over 1 year

    Suppose I have a (potentially very large) text file that contains a word list with whitespace interjected.  For example, it might look like this:

    Cat                           Dog
    Soup                          Rat
    Cass                          Audrey
    

    I want each word on a separate line (with no whitespace), like this:

    Cat
    Dog
    Soup
    Rat
    Cass
    Audrey
    

    I can do a simple tr -d " " to make that into:

    CatDog
    SoupRat
    CassAudrey
    

    (but that is not what I want).

    I do not know what type of blank space separates those words, so assume that it's some combination of ordinary ASCII spaces and tabs.  (We can assume that there are no invisible Unicode characters like em spaces and zero-width thingies.)  Naturally, the words do not contain whitespace, so "à la", "alma mater", "apple pie", "at large" and "ice cream" are not valid words.

    Assume that words may contain (non-blank) non-alphabetic characters, such as "AC/DC", "add-on", "AT&T", "audio-visual", "can't", "carbon-14", "jack-o'-lantern", "mother-in-law", "o'clock", "O'Reilly", "RS-232" and "3-D".  Ideally the solution should tolerate non-ASCII characters, as in "Ångström", "Gödel", "naïve", "résumé" and "smörgåsbord".

    How do I get rid of all those spaces while preserving (and isolating) the indented words using common Unix/Linux tools like tr, sed or awk?

    It would be great if the solution would also work for more general cases of the stated problem; i.e., not just two-column text, but also random arrangements like:

              Once    upon
        a   midnight
                        dreary
    while                     I pondered
           weak    and weary
               Over                many
    a   quaint  and     curious     volume
     of forgotten lore
    
    • Admin
      Admin over 6 years
      set -f; printf ‘%s\n’ $(<file); set +f. This is halfway a joke, because there are other types of expansion in the shell besides globs, but in some hackish cases it might be a very simple solution.
    • Admin
      Admin about 3 years
      This is not a question "describing a problem that can't be reproduced and seemingly went away on its own (or went away when a typo was fixed)". This question describes a reproducible problem, whose solution(s) are likely to help future readers.  The fact that the OP didn't actually have the problem they described does not invalidate the question, per se.
    • Admin
      Admin about 3 years
      @G-Man it looks to me like the OP said in version 3 that "So it was looking like some words were appearing right-justified. I went through the same file more slowly with vim and there were no right-justified words." which sounds to me like they realized that there wasn't actually a problem to solve. If we want to reopen this question for the existing answers, I'd suggest editing the Q down to focus on the problem that they solve.
    • Admin
      Admin about 3 years
      @G-ManSays'ReinstateMonica' I personally think we should keep this question closed, since the OP is in no position to accept an answer. I'll abstain from voting in the reopen queue, though. If we think that this is the best question we have on removing spaces, then I would say to edit the Q to focus on that, removing the backstory and "I didn't actually have this problem" parts.
  • Philippos
    Philippos over 6 years
    Almost portable, but most sed versions will insert a backslash and an n, because \n in the replacement is undefined by the standard. Use a literal newline instead (typically by typing backslash, Ctrl-V, Ctrl-J).
  • Stéphane Chazelas
    Stéphane Chazelas over 6 years
    The POSIX equivalent would be tr -s ' \t' '[\n*]'. See also tr -s '[:space:]' '[\n*]' or tr -s '[:blank:]' '[\n*]'
  • done
    done about 3 years
    This fails on "AC/DC", "add-on", "AT&T". You need something like tr -s ' \t,."' '\n' <file
  • G-Man Says 'Reinstate Monica'
    G-Man Says 'Reinstate Monica' about 3 years
    @Isaac Well, that’s debatable. The question says, “Assume that words may contain (non-blank) non-alphabetic characters”. (Disclosure: I edited the question to say that, with the intent of keeping my answer correct.) If words may contain non-alphabetic characters, then "AC/DC", "add-on" and  "AT&T" are all words. And the OP didn’t give us any clue how they want “Mr.”, “Mrs.”, “Ph.D” or “Q.E.D.” to be handled. While comma (and semicolon) maybe should always be separators, people with 20th century technology sometimes used " to denote umlaut / dieresis; e.g., na"ive for naïve.
  • done
    done about 3 years
    Fair enough. @G-ManSays'ReinstateMonica'
  • Ed Morton
    Ed Morton over 2 years
    That would output a blank line at the start of the OPs 2nd set of input (the one that starts with spaces).
  • Ed Morton
    Ed Morton over 2 years
    That would output multiple blank lines given the OPs 2nd set of input (the one that starts with spaces).
  • Ed Morton
    Ed Morton over 2 years
    That would output multiple blank lines and lines containing spaces given the OPs 2nd set of input (the one that starts with spaces).
  • Ed Morton
    Ed Morton over 2 years
    That wouldn't handle tabs in the input and would output a blank line at the start of the OPs 2nd set of input (the one that starts with spaces).
  • Ed Morton
    Ed Morton over 2 years
    I believe all of those would output a blank line at the start of the OPs 2nd set of input (the one that starts with spaces).
  • Ed Morton
    Ed Morton over 2 years
    That would work, though you don't need -P (which even in new versions of GNU grep is still considered "experimental" in combination with other grep options so I personally avoid), it'd work the same with -E.
  • done
    done over 2 years
    Should awk -v RS='[[:space:]]+' 'NF' file2 not work ?
  • done
    done over 2 years
    @EdMorton Yes, technically, that is correct for the initial ways I posted. It is quite simple to remove empty lines, anyway, so, I am not very worried for this issue. However, I extended the description to solve that from different points of view. I believe that I have clarified that issue.
  • Ed Morton
    Ed Morton over 2 years
    @ImHere yes, I think that would work too.