How to split a string into substrings of a given length?
Solution 1
Here is one way
substring("aabbccccdd", seq(1, 9, 2), seq(2, 10, 2))
#[1] "aa" "bb" "cc" "cc" "dd"
or more generally
text <- "aabbccccdd"
substring(text, seq(1, nchar(text)-1, 2), seq(2, nchar(text), 2))
#[1] "aa" "bb" "cc" "cc" "dd"
Edit: This is much, much faster
sst <- strsplit(text, "")[[1]]
out <- paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])
It first splits the string into characters. Then, it pastes together the even elements and the odd elements.
Timings
text <- paste(rep(paste0(letters, letters), 1000), collapse="")
g1 <- function(text) {
substring(text, seq(1, nchar(text)-1, 2), seq(2, nchar(text), 2))
}
g2 <- function(text) {
sst <- strsplit(text, "")[[1]]
paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])
}
identical(g1(text), g2(text))
#[1] TRUE
library(rbenchmark)
benchmark(g1=g1(text), g2=g2(text))
# test replications elapsed relative user.self sys.self user.child sys.child
#1 g1 100 95.451 79.87531 95.438 0 0 0
#2 g2 100 1.195 1.00000 1.196 0 0 0
Solution 2
There are two easy possibilities:
s <- "aabbccccdd"
gregexpr
andregmatches
:regmatches(s, gregexpr(".{2}", s))[[1]] # [1] "aa" "bb" "cc" "cc" "dd"
strsplit
:strsplit(s, "(?<=.{2})", perl = TRUE)[[1]] # [1] "aa" "bb" "cc" "cc" "dd"
Solution 3
string <- "aabbccccdd"
# total length of string
num.chars <- nchar(string)
# the indices where each substr will start
starts <- seq(1,num.chars, by=2)
# chop it up
sapply(starts, function(ii) {
substr(string, ii, ii+1)
})
Which gives
[1] "aa" "bb" "cc" "cc" "dd"
Solution 4
One can use a matrix to group the characters:
s2 <- function(x) {
m <- matrix(strsplit(x, '')[[1]], nrow=2)
apply(m, 2, paste, collapse='')
}
s2('aabbccddeeff')
## [1] "aa" "bb" "cc" "dd" "ee" "ff"
Unfortunately, this breaks for an input of odd string length, giving a warning:
s2('abc')
## [1] "ab" "ca"
## Warning message:
## In matrix(strsplit(x, "")[[1]], nrow = 2) :
## data length [3] is not a sub-multiple or multiple of the number of rows [2]
More unfortunate is that g1
and g2
from @GSee silently return incorrect results for an input of odd string length:
g1('abc')
## [1] "ab"
g2('abc')
## [1] "ab" "cb"
Here is function in the spirit of s2, taking a parameter for the number of characters in each group, and leaves the last entry short if necessary:
s <- function(x, n) {
sst <- strsplit(x, '')[[1]]
m <- matrix('', nrow=n, ncol=(length(sst)+n-1)%/%n)
m[seq_along(sst)] <- sst
apply(m, 2, paste, collapse='')
}
s('hello world', 2)
## [1] "he" "ll" "o " "wo" "rl" "d"
s('hello world', 3)
## [1] "hel" "lo " "wor" "ld"
(It is indeed slower than g2
, but faster than g1
by about a factor of 7)
Solution 5
Ugly but works
sequenceString <- "ATGAATAAAG"
J=3#maximum sequence length in file
sequenceSmallVecStart <-
substring(sequenceString, seq(1, nchar(sequenceString)-J+1, J),
seq(J,nchar(sequenceString), J))
sequenceSmallVecEnd <-
substring(sequenceString, max(seq(J, nchar(sequenceString), J))+1)
sequenceSmallVec <-
c(sequenceSmallVecStart,sequenceSmallVecEnd)
cat(sequenceSmallVec,sep = "\n")
Gives ATG AAT AAA G
MadSeb
Updated on April 24, 2020Comments
-
MadSeb about 4 years
I have a string such as:
"aabbccccdd"
I want to break this string into a vector of substrings of length 2 :
"aa" "bb" "cc" "cc" "dd"
-
mindless.panda almost 12 yearsInteresting, didn't know about
substring
. Much nicer sincesubstr
doesn't take vector args for start/end. -
MadSeb almost 12 yearsbrilliant ! the second version is really really fast !
-
jackStinger over 11 yearsI was wondering if there was something like this that would split "aabbbcccccdd" into aa bbb ccccc dd I use grepexpr at the moment.
-
GSee about 11 yearsIf it's possible to have an odd number of characters, then it seems to me it would be faster to handle that after the fact than to introduce an
apply
loop. I bet this is faster:out <- g2(x); if (nchar(x) %% 2 == 1L) out[length(out)] <- substring(out[length(out)], 1, 1); out
-
Joe almost 10 years@GSee You might want to re-post the g2 portion of this answer on the question this is a duplicate of: stackoverflow.com/questions/2247045/…,
-
mathematical.coffee almost 9 yearsGot any tricks to extend the fast version to arbitrary chunk length
n
? -
GSee almost 9 years@mathematical.coffee maybe something like this:
do.call(paste0, lapply(seq_len(n), function(i) { idx <- rep(FALSE, n); idx[i] <- TRUE; sst[idx] }))
but see my comment on Matthew's post about paying attention to whether your input is divisible byn
-
rjss over 4 yearsthese possibilities are equivalent for the proposed
s
but what ifs <- "aabbccccdde"
?. I like the second option better -
vwvan over 3 yearsDouble check that result: ~~~ test replications elapsed relative user.self sys.self user.child sys.child g1 100 0.262 1.000 0.216 0.044 0 0 g2 100 0.562 2.145 0.530 0.031 0 0 ~~~
-
Øystein S over 2 yearsThe second option works for any number, e.g.,
strsplit(s, "(?<=.{11})", perl = TRUE)[[1]]
, while the first only first for single digits.