Bash script: split word on each letter

12,043

Solution 1

I would use grep:

$ grep -o . <<<"StackOver"
S
t
a
c
k
O
v
e
r

or sed:

$ sed 's/./&\n/g' <<<"StackOver"
S
t
a
c
k
O
v
e
r

And if empty space at the end is an issue:

sed 's/\B/&\n/g' <<<"StackOver"

All of that assuming GNU/Linux.

Solution 2

You may want to break on grapheme clusters instead of characters if the intent is to print text vertically. For instance with a e with an acute accent:

  • With grapheme clusters (e with its acute accent would be one grapheme cluster):

    $ perl -CLAS -le 'for (@ARGV) {print for /\X/g}' $'Ste\u301phane'
    S
    t
    é
    p
    h
    a
    n
    e
    

    (or grep -Po '\X' with GNU grep built with PCRE support)

  • With characters (here with GNU grep):

    $ printf '%s\n' $'Ste\u301phane' | grep -o .
    S
    t
    e
    
    p
    h
    a
    n
    e
    
  • fold is meant to break on characters, but GNU fold doesn't support multi-byte characters, so it breaks on bytes instead:

    $ printf '%s\n' $'Ste\u301phane' | fold -w 1
    S
    t
    e
    �
    �
    p
    h
    a
    n
    e
    

On StackOver which only consists of ASCII characters (so one byte per character, one character per grapheme cluster), all three would give the same result.

Solution 3

If you have perl6 in your box:

$ perl6 -e 'for @*ARGS -> $w { .say for $w.comb }' 'cường'       
c
ư
ờ
n
g

work regardless of your locale.

Solution 4

With many awk versions

awk -F '' -v OFS='\n' '{$1=$1};1' <<<'StackOver'

Solution 5

You can use the fold (1) command. It is more efficient than grep and sed.

$ time grep -o . <bigfile >/dev/null

real    0m3.868s
user    0m3.784s
sys     0m0.056s
$ time fold -b1 <bigfile >/dev/null

real    0m0.555s
user    0m0.528s
sys     0m0.016s
$

One significant difference is that fold will reproduce empty lines in the output:

$ grep -o . <(printf "A\nB\n\nC\n\n\nD\n")
A
B
C
D
$ fold -b1 <(printf "A\nB\n\nC\n\n\nD\n")
A
B

C


D
$ 
Share:
12,043

Related videos on Youtube

Sijaan Hallak
Author by

Sijaan Hallak

Updated on September 18, 2022

Comments

  • Sijaan Hallak
    Sijaan Hallak over 1 year

    How can I split a word's letters, with each letter in a separate line?

    For example, given "StackOver" I would like to see

    S
    t
    a
    c
    k
    O
    v
    e
    r
    

    I'm new to bash so I have no clue where to start.

  • Sijaan Hallak
    Sijaan Hallak over 8 years
    grep -o . <<< ¿¿¿ .. -o searches for the PATTERN provided right? and what it does here in your command?
  • jimmij
    jimmij over 8 years
    @SijaanHallak grep searches for pattern, an in this example it searches for every character . and prints it in the separate line. See also sed solution.
  • Sijaan Hallak
    Sijaan Hallak over 8 years
    Thanks! so this "." dot means every character.. Can you please give me a link where I can read about things such as this dot? or what ar these things called?
  • jimmij
    jimmij over 8 years
    I'm surprised grep -Po doesn't do what one would expect (like grep -P does).
  • Stéphane Chazelas
    Stéphane Chazelas over 8 years
    Note that both -o and \n are a GNU extension. <<< is a zsh extension (also available in recent versions of ksh93 and the GNU shell (bash)).
  • Stéphane Chazelas
    Stéphane Chazelas over 8 years
    @jimmij, what do you mean? grep -Po . finds characters (and a combining acute accent following a newline character is invalid), and grep -Po '\X' finds graphem clusters for me. You may need a recent version of grep and/or PCRE for it to work properly (or try grep -Po '(*UTF8)\X')
  • jimmij
    jimmij over 8 years
    @SijaanHallak The best manual you have already on you computer, just run man grep and then just look for the chapter "REGULAR EXPRESSIONS" (if that is what you are interested in).
  • Avinash Raj
    Avinash Raj over 8 years
    Second answer would produce a new line after last...
  • cuonglm
    cuonglm over 8 years
    NP, should we add a note about the locale?
  • Sijaan Hallak
    Sijaan Hallak over 8 years
    @jimmij I cant find any help on what <<< really does! any help?
  • Sijaan Hallak
    Sijaan Hallak over 8 years
    This won't help as it prints a new line at the end
  • jimmij
    jimmij over 8 years
    @SijaanHallak This is so called Here string, grosso modo equivalent of echo foo | ... just less typing. See tldp.org/LDP/abs/html/x17837.html
  • kay
    kay over 8 years
    Does not work for combining characters like Stéphane Chazelas answer, but with proper normalization this should not matter.
  • mikeserv
    mikeserv over 8 years
    @Kay - it's works for combining characters if you want it to - that's what sed scripts are for. i'm not likely to write one right about now - im pretty sleepy. it's really useful, though, when reading a terminal.
  • mikeserv
    mikeserv over 8 years
    @cuonglm - if you like. it should just work for the locale, given a sane libc, though.
  • Sijaan Hallak
    Sijaan Hallak over 8 years
    @jimmij the second solution here seems to have a problem. it prints a new line at the end! I changed it to this sed -e 's/./\n&/g' <<< "$1" But this prints a new line at the beggining.. any suggestion how to overcome this?
  • jimmij
    jimmij over 8 years
    @SijaanHallak change . to \B (doesn't match on word boundary).
  • Sijaan Hallak
    Sijaan Hallak over 8 years
    @jimmij \B will not work as it prints "Stack Over" -> the "O" will be printed near the letter "k" at the same line and then it does \n
  • jpmc26
    jpmc26 over 8 years
  • Stéphane Chazelas
    Stéphane Chazelas over 8 years
    Note that dd will break multibyte characters, so the output will not be text anymore so the behaviour of sed will be unspecified as per POSIX.
  • mikeserv
    mikeserv over 8 years
    @StéphaneChazelas - do you have a link to reference that statement? a NUL can't occur in a multibyte character, and a dot can only match a whole character which is not NUL, and it has worked with every sed i've tried. how could it not work?
  • mikeserv
    mikeserv over 8 years
    oh wait - you mean because input isn't a text file. possibly, but sed is spec'd to handle conditions which exceed/break text file specs, too, such as 4k pattern spaces scripts which is well beyond line max. its also spec'd to evaluate chars bytewise w/ l - even when a single char is multiple bytes. i think the text file restriction for sed is probably based on the NUL prohibition - many seds replace delimiter in their scripts w/ NULs, and ive never managed to seek past a NUL in pattern space with heirloom sed except with D and G.
  • mikeserv
    mikeserv over 8 years
    @SijaanHallak - you can drop the second sed like: sed -et -e's/./\n&/g;//D'
  • Yunus
    Yunus almost 8 years
    since each byte have a width=1 the result will be the same !
  • VocalFan
    VocalFan almost 8 years
    So how is this not a duplicate of the earlier answer?
  • Yunus
    Yunus almost 8 years
    because it shows tha same cmd with different argyment , and that is nice to know .
  • eruve
    eruve about 5 years
    Great! But on my version of nAWK ("One True AWK") that doesn't work. However this does the trick: awk -v FS='' -v OFS='\n' '{$1=$1};1' (wondering if that's more portable since -F '' might yield the ERE: //)
  • done
    done almost 3 years
    This removes white space from the original string.
  • done
    done almost 3 years
    An eval could be a big risk, a double eval is even more risky. Specially with arbitrary input from $s. Just saying !!
  • done
    done almost 3 years
    Are you claiming that $'e\u301' is equivalent/equal to é ?
  • Stéphane Chazelas
    Stéphane Chazelas almost 3 years
    @Isaac, no, I'm not claiming any such thing though there are some definitions of "equivalent" for which that would be true.
  • done
    done almost 3 years
    Your description seems to imply that because Perl is able to join together characters and accents (much like a text editor join them to select an specific glyph) other software should be able also. But no, not all programs are text editors, Nor all utilities understand the complex (specially in Hangul) set of rules to join some individual Unicode codepoints (unicode.org/reports/tr29 and search for Devanagari kshi). So, no, nor grep, sed or fold understand any of this issue (yet).