extract every nth character from a string

shell shell-script string

3,426

Solution 1

Two lines

Here is a pure-bash solution that produces a bash array:

s="100000011100"
array=($(
    for ((i=0; i<${#s}-6; i++))
    do
        echo "${s:$i:1}${s:$((i+6)):1}"
    done
    ))
echo "${array[@]}"

This produces the same output as shown in the question:

10 01 01 01 00 00

The key element here is the use of bash's substring expansion. Bash allows the extraction substrings from a variable, say parameter, via ${parameter:offset:length}. In our case, the offset is determined by the loop variable i and the length is always 1.

General Solution For Any Number of Lines

Suppose, for example, that our original string has 18 characters and we want to extract the i-th, the i+6-th, and the i+12-th characters for i from 0 to 5. Then:

s="100000011100234567"
array=($(
    for ((i=0; i<6; i++))
    do
        new=${s:$i:1}
        for ((j=i+6; j<${#s}; j=j+6))
        do 
            new="$new${s:$j:1}"
        done
        echo "$new"
    done
    ))

echo "${array[@]}"

This produces the output:

102 013 014 015 006 007

This same code extends to an arbitrary number of 6-character lines. For example, if s has three lines (18 characters):

s="100000011100234567abcdef"

Then, the output becomes:

102a 013b 014c 015d 006e 007f

Solution 2

Using perl:

$ echo 100000011100 | perl -nle '
    for ($i = 0; $i < length()/2; $i++) {
        print substr($_,$i,1), substr($_,$i+6,1);
    }
'
10
01
01
01
00
00

It works for two lines. If you want to work with arbitrary of lines, you should process lines directly, instead of building big string. With this input:

1   0   0   0   0   0                                                           
0   1   1   1   0   0                                                           
0   0   0   0   0   0

Try:

$ perl -anle '
    for ($i = 0; $i <= $#F; $i++) {
      push @{$h{$i}}, $F[$i];
    }
    END {
        print @{$h{$_}} for keys %h;
    }
' file
000
010
000
100
010
010

Solution 3

As a shell solution, getopts is probably easiest. The thing about getopts is that it is POSIX-specified to do exactly what you're asking - process a byte-stream in a shell-loop. I know that sounds weird, because, if you're like me before I learned this, you're probably thinking, well, gee, I thought it was supposed to handle command-line switches. Which is true, but so is the first thing. Consider:

-thisisonelongstringconsistingofseparatecommandlineswitches

Yes, getopts has to handle that. It has to split that char by char in a loop and return to you each character in either the shell variable $OPTARG or in another that you specify by name all depending on how specific you get when you call it. What's more, it has to return errors in shell variables and save its progress when it does in the shell variable $OPTIND so it can resume right where it left off if you can somehow address it. And it has to do the whole job without invoking a single subshell.

So let's say we have:

arg=$(seq -s '' 1000); set --
while getopts :0123456789 v -"${arg}"
do [ "$((i=$i+1<6?$i+1:0))" -gt 0 ] ||
set "$@" "$v"
done

Hmmm.... I wonder if it worked?

echo "$((${#arg}/6))" "$#"
482 482

That's nice...

eval '
printf %.1s\\n "${arg#'"$(printf %0$((124*6-1))d | tr 0 \?)"'}" "${124}"'
4
4

So, as you can see, the getopts command completely set the array for every sixth byte in the string. And it doesn't have to be numbers like this - nor must it even be shell safe characters - and you needn't even specify the target chars as I did above with 01234565789 either. I've tested this repeatedly in a lot of shells and they all just work. There are some quirks - bash will throw away the first character if it is a whitespace character - dash accepts the : colon as a specified parameter even though it is just about the only POSIX specifically forbids. But none of that matters because getopts still deposits the current opt char's value in $OPTARG even when it returns you an error (represented by a ? assigned to your specified opt var) and otherwise explictly unsets $OPTARG unless you've declared an option should have an argument. And the whitespace thing is kind of a good thing - it only discards a leading space, which is excellent, because, when working with unknown values, you can do:

getopts : o -" $unknown_value"

...to kick off the loop without any danger of the first character actually being in your accepted args string - which would result in getopts sticking the whole thing in $OPTARG at once - as an argument.

Here's another example:

OPTIND=1
while getopts : o -" $(dd if=/dev/urandom bs=16 count=1 2>/dev/null)"                         
do printf '\\%04o' "'$OPTARG"; done  

\0040\0150\0071\0365\0320\0070\0161\0064\0274\0115\0012\0215\0222\0271\0146\0057\0166

I set $OPTIND=1 in the first line because I just used getopts and, until you reset it, it expects its next call to continue where it left off - it wants "${arg2}" in other words. But I don't feel like giving and I'm doing a different thing now, so I let it know by resetting $OPTIND at which point it's good to go.

In this one I used zsh - which doesn't quibble about a leading space - and so the first character is octal 40 - the space character. I don't usually use getopts in that way, though - I usually use it to avoid doing a write() for each byte and instead assign its output - which comes in a variable - to another shell variable - as I did above with set after a fashion. Then, when I'm ready I can take the whole string and when I do usually strip the first byte.

Solution 4

sed is the first thing that pops into my mind.

$ echo 1234567890abcdefghijklmnopqrstuvwxyz | sed 's/.\{5\}\(.\)/\1/g'
6bhntz

Match 5 characters, capture the 6th, and replace them all with that captured character.

This will however have an issue if the length of the string isn't an exact multiple of 6:

$ echo 1234567890abcdefghijklmnopqrstuvwxy | sed 's/.\{5\}\(.\)/\1/g' 
6bhntuvwxy

But we can fix this by altering the sed a little:

$ echo 1234567890abcdefghijklmnopqrstuvwxy | sed 's/.\{1,5\}\(.\{0,1\}\)/\1/g'
6bhnt

Due to the greedy nature of regex, the variable length matches will match as much as they can, and if there's nothing left for the capture, then it doesn't capture, and the characters are just deleted.

View more solutions

3,426

Laura

Updated on September 18, 2022

Comments

Laura almost 2 years

I want to classify a data set (which has four classes) using the SVM method. I've done it using the coding below (using a 1 against all). It isn't terribly accurate but I'm thankful for anything at this stage.

http://www.mathworks.co.uk/matlabcentral/fileexchange/39352-multi-class-svm

I was wondering if there is a way to plot the support vectors and training points. I've managed this for a 2 class SVM classification but can't find a way of doing it with >2 classes.

Any help/advice re. how to achieve a semi-pretty graph would be very much appreciated!
Ramesh over 9 years

Thanks. But if I have 3 lines, I want to get the substrings as 101, 001 and so on. But I am getting only 2 characters with this approach.
Ramesh over 9 years

Thanks. So I will still need a loop to store the sed's output to an array variable?
phemmer over 9 years

Not necessarily. read -a foo < <(echo 1234567890ab | sed 's/.\{1,5\}$.\{0,1\}$/\1 /g')
Ramesh over 9 years

thanks. I have the same problem as I have mentioned in Gnouc's answer. If I have 3 lines, it seems to break.
Ramesh over 9 years

your solution seems to work even if I have more than 2 lines. But I still get the last line as the array value. Instead I want all the values to be stored in my array. This is the command am trying. read -a array_elements < <(column -s '\t' inputfile | tr -d '[:space:]' | sed 's/.\{5\}$.$/\1/g')
Ramesh over 9 years

cool. This one works. You should probably add this one as the answer to the other question.
John1024 over 9 years

@Ramesh See update.
Ramesh over 9 years

cool. This one also works as expected. Thanks again for providing a bash solution.
Ramesh over 9 years

I have used your solution to address the original problem of the OP in the question I linked. Thanks again.
John1024 over 9 years

@Ramesh So, this is combined with wc -w. Nice. +1 for your answer to the other question.
mikeserv over 5 years

you gotta give a little to get the little. for shame, me.