How to split a string into substrings of a given length?

59,979

Solution 1

Here is one way

substring("aabbccccdd", seq(1, 9, 2), seq(2, 10, 2))
#[1] "aa" "bb" "cc" "cc" "dd"

or more generally

text <- "aabbccccdd"
substring(text, seq(1, nchar(text)-1, 2), seq(2, nchar(text), 2))
#[1] "aa" "bb" "cc" "cc" "dd"

Edit: This is much, much faster

sst <- strsplit(text, "")[[1]]
out <- paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])

It first splits the string into characters. Then, it pastes together the even elements and the odd elements.

Timings

text <- paste(rep(paste0(letters, letters), 1000), collapse="")
g1 <- function(text) {
    substring(text, seq(1, nchar(text)-1, 2), seq(2, nchar(text), 2))
}
g2 <- function(text) {
    sst <- strsplit(text, "")[[1]]
    paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])
}
identical(g1(text), g2(text))
#[1] TRUE
library(rbenchmark)
benchmark(g1=g1(text), g2=g2(text))
#  test replications elapsed relative user.self sys.self user.child sys.child
#1   g1          100  95.451 79.87531    95.438        0          0         0
#2   g2          100   1.195  1.00000     1.196        0          0         0

Solution 2

There are two easy possibilities:

s <- "aabbccccdd"
  1. gregexpr and regmatches:

    regmatches(s, gregexpr(".{2}", s))[[1]]
    # [1] "aa" "bb" "cc" "cc" "dd"
    
  2. strsplit:

    strsplit(s, "(?<=.{2})", perl = TRUE)[[1]]
    # [1] "aa" "bb" "cc" "cc" "dd"
    

Solution 3

string <- "aabbccccdd"
# total length of string
num.chars <- nchar(string)

# the indices where each substr will start
starts <- seq(1,num.chars, by=2)

# chop it up
sapply(starts, function(ii) {
  substr(string, ii, ii+1)
})

Which gives

[1] "aa" "bb" "cc" "cc" "dd"

Solution 4

One can use a matrix to group the characters:

s2 <- function(x) {
  m <- matrix(strsplit(x, '')[[1]], nrow=2)
  apply(m, 2, paste, collapse='')
}

s2('aabbccddeeff')
## [1] "aa" "bb" "cc" "dd" "ee" "ff"

Unfortunately, this breaks for an input of odd string length, giving a warning:

s2('abc')
## [1] "ab" "ca"
## Warning message:
## In matrix(strsplit(x, "")[[1]], nrow = 2) :
##   data length [3] is not a sub-multiple or multiple of the number of rows [2]

More unfortunate is that g1 and g2 from @GSee silently return incorrect results for an input of odd string length:

g1('abc')
## [1] "ab"

g2('abc')
## [1] "ab" "cb"

Here is function in the spirit of s2, taking a parameter for the number of characters in each group, and leaves the last entry short if necessary:

s <- function(x, n) {
  sst <- strsplit(x, '')[[1]]
  m <- matrix('', nrow=n, ncol=(length(sst)+n-1)%/%n)
  m[seq_along(sst)] <- sst
  apply(m, 2, paste, collapse='')
}

s('hello world', 2)
## [1] "he" "ll" "o " "wo" "rl" "d" 
s('hello world', 3)
## [1] "hel" "lo " "wor" "ld" 

(It is indeed slower than g2, but faster than g1 by about a factor of 7)

Solution 5

Ugly but works

sequenceString <- "ATGAATAAAG"

J=3#maximum sequence length in file
sequenceSmallVecStart <-
  substring(sequenceString, seq(1, nchar(sequenceString)-J+1, J), 
    seq(J,nchar(sequenceString), J))
sequenceSmallVecEnd <-
    substring(sequenceString, max(seq(J, nchar(sequenceString), J))+1)
sequenceSmallVec <-
    c(sequenceSmallVecStart,sequenceSmallVecEnd)
cat(sequenceSmallVec,sep = "\n")

Gives ATG AAT AAA G

Share:
59,979
MadSeb
Author by

MadSeb

Updated on April 24, 2020

Comments

  • MadSeb
    MadSeb about 4 years

    I have a string such as:

    "aabbccccdd"

    I want to break this string into a vector of substrings of length 2 :

    "aa" "bb" "cc" "cc" "dd"

  • mindless.panda
    mindless.panda almost 12 years
    Interesting, didn't know about substring. Much nicer since substr doesn't take vector args for start/end.
  • MadSeb
    MadSeb almost 12 years
    brilliant ! the second version is really really fast !
  • jackStinger
    jackStinger over 11 years
    I was wondering if there was something like this that would split "aabbbcccccdd" into aa bbb ccccc dd I use grepexpr at the moment.
  • GSee
    GSee about 11 years
    If it's possible to have an odd number of characters, then it seems to me it would be faster to handle that after the fact than to introduce an apply loop. I bet this is faster: out <- g2(x); if (nchar(x) %% 2 == 1L) out[length(out)] <- substring(out[length(out)], 1, 1); out
  • Joe
    Joe almost 10 years
    @GSee You might want to re-post the g2 portion of this answer on the question this is a duplicate of: stackoverflow.com/questions/2247045/…,
  • mathematical.coffee
    mathematical.coffee almost 9 years
    Got any tricks to extend the fast version to arbitrary chunk length n?
  • GSee
    GSee almost 9 years
    @mathematical.coffee maybe something like this: do.call(paste0, lapply(seq_len(n), function(i) { idx <- rep(FALSE, n); idx[i] <- TRUE; sst[idx] })) but see my comment on Matthew's post about paying attention to whether your input is divisible by n
  • rjss
    rjss over 4 years
    these possibilities are equivalent for the proposed s but what if s <- "aabbccccdde"?. I like the second option better
  • vwvan
    vwvan over 3 years
    Double check that result: ~~~ test replications elapsed relative user.self sys.self user.c‌​hild sys.child g1 100 0.262 1.000 0.216 0.044 0 0 g2 100 0.562 2.145 0.530 0.031 0 0 ~~~
  • Øystein S
    Øystein S over 2 years
    The second option works for any number, e.g., strsplit(s, "(?<=.{11})", perl = TRUE)[[1]], while the first only first for single digits.