How to split a string into substrings of a given length?

string r split

59,979

Solution 1

Here is one way

substring("aabbccccdd", seq(1, 9, 2), seq(2, 10, 2))
#[1] "aa" "bb" "cc" "cc" "dd"

or more generally

text <- "aabbccccdd"
substring(text, seq(1, nchar(text)-1, 2), seq(2, nchar(text), 2))
#[1] "aa" "bb" "cc" "cc" "dd"

Edit: This is much, much faster

sst <- strsplit(text, "")[[1]]
out <- paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])

It first splits the string into characters. Then, it pastes together the even elements and the odd elements.

Timings

text <- paste(rep(paste0(letters, letters), 1000), collapse="")
g1 <- function(text) {
    substring(text, seq(1, nchar(text)-1, 2), seq(2, nchar(text), 2))
}
g2 <- function(text) {
    sst <- strsplit(text, "")[[1]]
    paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])
}
identical(g1(text), g2(text))
#[1] TRUE
library(rbenchmark)
benchmark(g1=g1(text), g2=g2(text))
#  test replications elapsed relative user.self sys.self user.child sys.child
#1   g1          100  95.451 79.87531    95.438        0          0         0
#2   g2          100   1.195  1.00000     1.196        0          0         0

Solution 2

There are two easy possibilities:

s <- "aabbccccdd"

gregexpr and regmatches:

regmatches(s, gregexpr(".{2}", s))[[1]]
# [1] "aa" "bb" "cc" "cc" "dd"

strsplit:

strsplit(s, "(?<=.{2})", perl = TRUE)[[1]]
# [1] "aa" "bb" "cc" "cc" "dd"

Solution 3

string <- "aabbccccdd"
# total length of string
num.chars <- nchar(string)

# the indices where each substr will start
starts <- seq(1,num.chars, by=2)

# chop it up
sapply(starts, function(ii) {
  substr(string, ii, ii+1)
})

Which gives

[1] "aa" "bb" "cc" "cc" "dd"

Solution 4

One can use a matrix to group the characters:

s2 <- function(x) {
  m <- matrix(strsplit(x, '')[[1]], nrow=2)
  apply(m, 2, paste, collapse='')
}

s2('aabbccddeeff')
## [1] "aa" "bb" "cc" "dd" "ee" "ff"

Unfortunately, this breaks for an input of odd string length, giving a warning:

s2('abc')
## [1] "ab" "ca"
## Warning message:
## In matrix(strsplit(x, "")[[1]], nrow = 2) :
##   data length [3] is not a sub-multiple or multiple of the number of rows [2]

More unfortunate is that g1 and g2 from @GSee silently return incorrect results for an input of odd string length:

g1('abc')
## [1] "ab"

g2('abc')
## [1] "ab" "cb"

Here is function in the spirit of s2, taking a parameter for the number of characters in each group, and leaves the last entry short if necessary:

s <- function(x, n) {
  sst <- strsplit(x, '')[[1]]
  m <- matrix('', nrow=n, ncol=(length(sst)+n-1)%/%n)
  m[seq_along(sst)] <- sst
  apply(m, 2, paste, collapse='')
}

s('hello world', 2)
## [1] "he" "ll" "o " "wo" "rl" "d" 
s('hello world', 3)
## [1] "hel" "lo " "wor" "ld"

(It is indeed slower than g2, but faster than g1 by about a factor of 7)

Solution 5

Ugly but works

sequenceString <- "ATGAATAAAG"

J=3#maximum sequence length in file
sequenceSmallVecStart <-
  substring(sequenceString, seq(1, nchar(sequenceString)-J+1, J), 
    seq(J,nchar(sequenceString), J))
sequenceSmallVecEnd <-
    substring(sequenceString, max(seq(J, nchar(sequenceString), J))+1)
sequenceSmallVec <-
    c(sequenceSmallVecStart,sequenceSmallVecEnd)
cat(sequenceSmallVec,sep = "\n")

Gives ATG AAT AAA G

View more solutions

59,979

Author by

MadSeb

Updated on April 24, 2020

Comments

MadSeb about 4 years

I have a string such as:

"aabbccccdd"

I want to break this string into a vector of substrings of length 2 :

"aa" "bb" "cc" "cc" "dd"
mindless.panda almost 12 years

Interesting, didn't know about substring. Much nicer since substr doesn't take vector args for start/end.
MadSeb almost 12 years

brilliant ! the second version is really really fast !
jackStinger over 11 years

I was wondering if there was something like this that would split "aabbbcccccdd" into aa bbb ccccc dd I use grepexpr at the moment.
GSee about 11 years

If it's possible to have an odd number of characters, then it seems to me it would be faster to handle that after the fact than to introduce an apply loop. I bet this is faster: out <- g2(x); if (nchar(x) %% 2 == 1L) out[length(out)] <- substring(out[length(out)], 1, 1); out
Joe almost 10 years

@GSee You might want to re-post the g2 portion of this answer on the question this is a duplicate of: stackoverflow.com/questions/2247045/…,
mathematical.coffee almost 9 years

Got any tricks to extend the fast version to arbitrary chunk length n?
GSee almost 9 years

@mathematical.coffee maybe something like this: do.call(paste0, lapply(seq_len(n), function(i) { idx <- rep(FALSE, n); idx[i] <- TRUE; sst[idx] })) but see my comment on Matthew's post about paying attention to whether your input is divisible by n
rjss over 4 years

these possibilities are equivalent for the proposed s but what if s <- "aabbccccdde"?. I like the second option better
vwvan over 3 years

Double check that result: ~~~ test replications elapsed relative user.self sys.self user.c‌hild sys.child g1 100 0.262 1.000 0.216 0.044 0 0 g2 100 0.562 2.145 0.530 0.031 0 0 ~~~
Øystein S over 2 years

The second option works for any number, e.g., strsplit(s, "(?<=.{11})", perl = TRUE)[[1]], while the first only first for single digits.