Split string by length in Golang

28,859

Solution 1

Make sure to convert your string into a slice of rune: see "Slice string into letters".

for automatically converts string to rune so there is no additional code needed in this case to convert the string to rune first.

for i, r := range s {
    fmt.Printf("i%d r %c\n", i, r)
    // every 3 i, do something
}

r[n:n+3] will work best with a being a slice of rune.

The index will increase by one every rune, while it might increase by more than one for every byte in a slice of string: "世界": i would be 0 and 3: a character (rune) can be formed of multiple bytes.


For instance, consider s := "世a界世bcd界efg世": 12 runes. (see play.golang.org)

If you try to parse it byte by byte, you will miss (in a naive split every 3 chars implementation) some of the "index modulo 3" (equals to 2, 5, 8 and 11), because the index will increase past those values:

for i, r := range s {
    res = res + string(r)
    fmt.Printf("i %d r %c\n", i, r)
    if i > 0 && (i+1)%3 == 0 {
        fmt.Printf("=>(%d) '%v'\n", i, res)
        res = ""
    }
}

The output:

i  0 r 世
i  3 r a   <== miss i==2
i  4 r 界
i  7 r 世  <== miss i==5
i 10 r b  <== miss i==8
i 11 r c  ===============> would print '世a界世bc', not exactly '3 chars'!
i 12 r d
i 13 r 界
i 16 r e  <== miss i==14
i 17 r f  ===============> would print 'd界ef'
i 18 r g
i 19 r 世 <== miss the rest of the string

But if you were to iterate on runes (a := []rune(s)), you would get what you expect, as the index would increase one rune at a time, making it easy to aggregate exactly 3 characters:

for i, r := range a {
    res = res + string(r)
    fmt.Printf("i%d r %c\n", i, r)
    if i > 0 && (i+1)%3 == 0 {
        fmt.Printf("=>(%d) '%v'\n", i, res)
        res = ""
    }
}

Output:

i 0 r 世
i 1 r a
i 2 r 界 ===============> would print '世a界'
i 3 r 世
i 4 r b
i 5 r c ===============> would print '世bc'
i 6 r d
i 7 r 界
i 8 r e ===============> would print 'd界e'
i 9 r f
i10 r g
i11 r 世 ===============> would print 'fg世'

Solution 2

Here is another variant playground. It is by far more efficient in terms of both speed and memory than other answers. If you want to run benchmarks here they are benchmarks. In general it is 5 times faster than the previous version that was a fastest answer anyway.

func Chunks(s string, chunkSize int) []string {
    if len(s) == 0 {
        return nil
    }
    if chunkSize >= len(s) {
        return []string{s}
    }
    var chunks []string = make([]string, 0, (len(s)-1)/chunkSize+1)
    currentLen := 0
    currentStart := 0
    for i := range s {
        if currentLen == chunkSize {
            chunks = append(chunks, s[currentStart:i])
            currentLen = 0
            currentStart = i
        }
        currentLen++
    }
    chunks = append(chunks, s[currentStart:])
    return chunks
}

Please note that the index points to a first byte of a rune on iterating over a string. The rune takes from 1 to 4 bytes. Slicing also treats the string as a byte array.

PREVIOUS SLOWER ALGORITHM

The code is here playground. The conversion from bytes to runes and then to bytes again takes a lot of time actually. So better use the fast algorithm at the top of the answer.

func ChunksSlower(s string, chunkSize int) []string {
    if chunkSize >= len(s) {
        return []string{s}
    }
    var chunks []string
    chunk := make([]rune, chunkSize)
    len := 0
    for _, r := range s {
        chunk[len] = r
        len++
        if len == chunkSize {
            chunks = append(chunks, string(chunk))
            len = 0
        }
    }
    if len > 0 {
        chunks = append(chunks, string(chunk[:len]))
    }
    return chunks
}

Please note that these two algorithms treat invalid UTF-8 characters in a different way. First one processes them as is when second one replaces them by utf8.RuneError symbol ('\uFFFD') that has following hexadecimal representation in UTF-8: efbfbd.

Solution 3

Also needed a function to do this recently, see example usage here

func SplitSubN(s string, n int) []string {
    sub := ""
    subs := []string{}

    runes := bytes.Runes([]byte(s))
    l := len(runes)
    for i, r := range runes {
        sub = sub + string(r)
        if (i + 1) % n == 0 {
            subs = append(subs, sub)
            sub = ""
        } else if (i + 1) == l {
            subs = append(subs, sub)
        }
    }

    return subs
}

Solution 4

Here is another example (you can try it here):

package main

import (
    "fmt"
    "strings"
)

func ChunkString(s string, chunkSize int) []string {
    var chunks []string
    runes := []rune(s)

    if len(runes) == 0 {
        return []string{s}
    }

    for i := 0; i < len(runes); i += chunkSize {
        nn := i + chunkSize
        if nn > len(runes) {
            nn = len(runes)
        }
        chunks = append(chunks, string(runes[i:nn]))
    }
    return chunks
}

func main() {
    fmt.Println(ChunkString("helloworld", 3))
    fmt.Println(strings.Join(ChunkString("helloworld", 3), "\n"))
}

Solution 5

An easy solution using regex

re := regexp.MustCompile((\S{3})) x := re.FindAllStringSubmatch("HelloWorld", -1) fmt.Println(x)

https://play.golang.org/p/mfmaQlSRkHe

Share:
28,859

Related videos on Youtube

Fernando Parra
Author by

Fernando Parra

Updated on October 01, 2021

Comments

  • Fernando Parra
    Fernando Parra over 2 years

    Does anyone know how to split a string in Golang by length?

    For example to split "helloworld" after every 3 characters, so it should ideally return an array of "hel" "low" "orl" "d"?

    Alternatively a possible solution would be to also append a newline after every 3 characters..

    All ideas are greatly appreciated!

    • Volker
      Volker over 9 years
      Well, some programming might help here? Like s[n:n+3]+"\n"?
  • rahul
    rahul over 6 years
    Didi u run it in The Go Playground? Could you pls let me know wht u didnt understand?
  • PLG
    PLG almost 6 years
    should be FindAllString instead of FindAllStringSubmatch, no?
  • Igor Mikushkin
    Igor Mikushkin about 4 years
    This is by far the best answer here. However len(runes) check looks unnecessary. You can check len(s) and return nil or empty array. This way you can define runes and chunks after this check
  • VonC
    VonC about 4 years
    Interesting. Upvoted. Certainly more effective than my 6 years old answer!
  • Igor Mikushkin
    Igor Mikushkin about 4 years
    It is very ineffective because of string concatenations. There are answers that don't use them
  • Igor Mikushkin
    Igor Mikushkin about 4 years
    It is very ineffective because of string concatenations. There are answers that don't use them
  • Igor Mikushkin
    Igor Mikushkin about 4 years
    Actually I wrote my own answer now that is more effective
  • TomOnTime
    TomOnTime almost 4 years
    This should be added to the standard library.
  • ardnew
    ardnew almost 3 years
    You can improve performance over this by using strings.Builder, see this playground. Benchmarks here
  • Igor Mikushkin
    Igor Mikushkin almost 3 years
    @ardnew Thanks a lot for pointing it out! It shows that copying is a performance hit. However this builder does some copying by itself and it is not needed here. Instead I came up with even faster version that I will post here soon
  • Mark Robson
    Mark Robson almost 3 years
    this is by far the best example. Judging by the number of upvotes it has, it seems go programmers like to write lots of code!
  • leonardo
    leonardo over 2 years
    over-engineering, this is a simple problem.
  • Junru Zhu
    Junru Zhu over 2 years
    this losts the last item. DON'T USE.