Replace any number of tabs and spaces with single new line in Linux?
Solution 1
etopylight was almost right:
tr -s ' \t' '\n'
because the question asks to replace tabs, too.
Solution 2
Basically, you could do it in GNU sed
:
sed 's/\s\+/\n/g'
There you go...
Solution 3
You should be able to use
sed -e 's/[[:space:]]\{1,\}/\n/'
to replace any sequence of one or more whitespace characters (including oddities like formfeed and vertical tabs) with a single newline.
Solution 4
If gnu-grep available,
grep -Po '\S+'
Solution 5
It could be done in a long list of ways:
tr tr -s ' \t' '\n' <file
for tabs and spaces only.
tr tr -s ' \t,."' '\n' <file
for your strings.
tr tr -s '[:blank:]' '\n' <file
for tabs and spaces only.
tr tr -s '[:space:]' '\n' <file
for \t\n\v\f\r
sed sed -e 's/[ \t]/\n/g' -e 's/\n\n*/\n/g' file
GNU sed for \n
.
sed sed 's/[ \t".,]\+/\n/g' file | tr -s '\n'
GNU sed for \n
.
Description
tr
The most basic tool is tr
, yes, but it could not understand format.
So, a basic
tr -s ' ' '\n' <file
would convert all (repeated) spaces to one newline. That, of course could generate empty lines if lines with only spaces exist or the file start with spaces. There is no way to correct that in tr
. It could be done adding a filter to remove empty lines. Like sed '/^$/d'
tr -s ' ' '\n' <file | sed '/^$/d'
Additional characters (like tabs) could be added:
tr -s ' \t' '\n' <file | sed '/^$/d'
Others, that might result from a text paragraph (like this one), like commas and periods could also be added. That is changing the definition of what a word is.
tr -s ' \t,.()' '\n' <file | sed '/^$/d'
sed
Sed is more capable than tr (and could be slower). It could change runs of some characters to one newline. The basic idea would be (in GNU sed for the replacement \n
):
sed 's/[ \t]\{1,\}/\n/g'
In other seds that are not able to use \n
in the right side of a replacement we need to use an actual newline (and, sometimes, an actual tab):
sed 's/[ ]\{1,\}/\
/g'
So, sed helps with some issues but makes unnecessarily complex others.
Read 4.1. How do I insert a newline into the RHS of a substitution?
grep
It could be done in grep as well. Just match sequences of non-word characters:
grep -o '[^ ]\{1,\}' ## explicit space-tab.
In GNU grep, the equivalent \S+
could be used:
grep -Eo '\S+' ## or grep -o '\S\+'
awk
In awk
we get still more power from the tool. It gets quite simple:
awk '{for(i=1;i<=NF;i++) {print $i}}' file
Which is just: print all fields for each line. Where there are fields. If the number of fields is 0 nothing will be printed.
That is using the default FS which delimits fields on repeated spaces, tabs or newlines.
A similar solution could be done with RS
(for awks that allow regex separators):
awk -v RS="[ \t\n]+" 'NF'
Which tells awk to split records in runs of space, tab or newlines and print only if there is any no-empty field (the NF).
user2925489
Updated on September 18, 2022Comments
-
user2925489 over 1 year
Suppose I have a (potentially very large) text file that contains a word list with whitespace interjected. For example, it might look like this:
Cat Dog Soup Rat Cass Audrey
I want each word on a separate line (with no whitespace), like this:
Cat Dog Soup Rat Cass Audrey
I can do a simple
tr -d " "
to make that into:CatDog SoupRat CassAudrey
(but that is not what I want).
I do not know what type of blank space separates those words, so assume that it's some combination of ordinary ASCII spaces and tabs. (We can assume that there are no invisible Unicode characters like em spaces and zero-width thingies.) Naturally, the words do not contain whitespace, so "à la", "alma mater", "apple pie", "at large" and "ice cream" are not valid words.
Assume that words may contain (non-blank) non-alphabetic characters, such as "AC/DC", "add-on", "AT&T", "audio-visual", "can't", "carbon-14", "jack-o'-lantern", "mother-in-law", "o'clock", "O'Reilly", "RS-232" and "3-D". Ideally the solution should tolerate non-ASCII characters, as in "Ångström", "Gödel", "naïve", "résumé" and "smörgåsbord".
How do I get rid of all those spaces while preserving (and isolating) the indented words using common Unix/Linux tools like
tr
,sed
orawk
?It would be great if the solution would also work for more general cases of the stated problem; i.e., not just two-column text, but also random arrangements like:
Once upon a midnight dreary while I pondered weak and weary Over many a quaint and curious volume of forgotten lore
-
Admin over 6 years
set -f; printf ‘%s\n’ $(<file); set +f
. This is halfway a joke, because there are other types of expansion in the shell besides globs, but in some hackish cases it might be a very simple solution. -
Admin about 3 yearsThis is not a question "describing a problem that can't be reproduced and seemingly went away on its own (or went away when a typo was fixed)". This question describes a reproducible problem, whose solution(s) are likely to help future readers. The fact that the OP didn't actually have the problem they described does not invalidate the question, per se.
-
Admin about 3 years@G-Man it looks to me like the OP said in version 3 that "So it was looking like some words were appearing right-justified. I went through the same file more slowly with vim and there were no right-justified words." which sounds to me like they realized that there wasn't actually a problem to solve. If we want to reopen this question for the existing answers, I'd suggest editing the Q down to focus on the problem that they solve.
-
Admin about 3 years@G-ManSays'ReinstateMonica' I personally think we should keep this question closed, since the OP is in no position to accept an answer. I'll abstain from voting in the reopen queue, though. If we think that this is the best question we have on removing spaces, then I would say to edit the Q to focus on that, removing the backstory and "I didn't actually have this problem" parts.
-
-
Philippos over 6 yearsAlmost portable, but most
sed
versions will insert a backslash and ann
, because\n
in the replacement is undefined by the standard. Use a literal newline instead (typically by typing backslash, Ctrl-V, Ctrl-J). -
Stéphane Chazelas over 6 yearsThe POSIX equivalent would be
tr -s ' \t' '[\n*]'
. See alsotr -s '[:space:]' '[\n*]'
ortr -s '[:blank:]' '[\n*]'
-
done about 3 yearsThis fails on
"AC/DC", "add-on", "AT&T"
. You need something liketr -s ' \t,."' '\n' <file
-
G-Man Says 'Reinstate Monica' about 3 years@Isaac Well, that’s debatable. The question says, “Assume that words may contain (non-blank) non-alphabetic characters”. (Disclosure: I edited the question to say that, with the intent of keeping my answer correct.) If words may contain non-alphabetic characters, then "AC/DC", "add-on" and "AT&T" are all words. And the OP didn’t give us any clue how they want “Mr.”, “Mrs.”, “Ph.D” or “Q.E.D.” to be handled. While comma (and semicolon) maybe should always be separators, people with 20th century technology sometimes used
"
to denote umlaut / dieresis; e.g.,na"ive
fornaïve
. -
done about 3 yearsFair enough. @G-ManSays'ReinstateMonica'
-
Ed Morton over 2 yearsThat would output a blank line at the start of the OPs 2nd set of input (the one that starts with spaces).
-
Ed Morton over 2 yearsThat would output multiple blank lines given the OPs 2nd set of input (the one that starts with spaces).
-
Ed Morton over 2 yearsThat would output multiple blank lines and lines containing spaces given the OPs 2nd set of input (the one that starts with spaces).
-
Ed Morton over 2 yearsThat wouldn't handle tabs in the input and would output a blank line at the start of the OPs 2nd set of input (the one that starts with spaces).
-
Ed Morton over 2 yearsI believe all of those would output a blank line at the start of the OPs 2nd set of input (the one that starts with spaces).
-
Ed Morton over 2 yearsThat would work, though you don't need
-P
(which even in new versions of GNU grep is still considered "experimental" in combination with other grep options so I personally avoid), it'd work the same with-E
. -
done over 2 yearsShould
awk -v RS='[[:space:]]+' 'NF' file2
not work ? -
done over 2 years@EdMorton Yes, technically, that is correct for the initial ways I posted. It is quite simple to remove empty lines, anyway, so, I am not very worried for this issue. However, I extended the description to solve that from different points of view. I believe that I have clarified that issue.
-
Ed Morton over 2 years@ImHere yes, I think that would work too.