How do I iterate over the words of a string?

2,330,209

Solution 1

For what it's worth, here's another way to extract tokens from an input string, relying only on standard library facilities. It's an example of the power and elegance behind the design of the STL.

#include <iostream>
#include <string>
#include <sstream>
#include <algorithm>
#include <iterator>

int main() {
    using namespace std;
    string sentence = "And I feel fine...";
    istringstream iss(sentence);
    copy(istream_iterator<string>(iss),
         istream_iterator<string>(),
         ostream_iterator<string>(cout, "\n"));
}

Instead of copying the extracted tokens to an output stream, one could insert them into a container, using the same generic copy algorithm.

vector<string> tokens;
copy(istream_iterator<string>(iss),
     istream_iterator<string>(),
     back_inserter(tokens));

... or create the vector directly:

vector<string> tokens{istream_iterator<string>{iss},
                      istream_iterator<string>{}};

Solution 2

I use this to split string by a delimiter. The first puts the results in a pre-constructed vector, the second returns a new vector.

#include <string>
#include <sstream>
#include <vector>
#include <iterator>

template <typename Out>
void split(const std::string &s, char delim, Out result) {
    std::istringstream iss(s);
    std::string item;
    while (std::getline(iss, item, delim)) {
        *result++ = item;
    }
}

std::vector<std::string> split(const std::string &s, char delim) {
    std::vector<std::string> elems;
    split(s, delim, std::back_inserter(elems));
    return elems;
}

Note that this solution does not skip empty tokens, so the following will find 4 items, one of which is empty:

std::vector<std::string> x = split("one:two::three", ':');

Solution 3

A possible solution using Boost might be:

#include <boost/algorithm/string.hpp>
std::vector<std::string> strs;
boost::split(strs, "string to split", boost::is_any_of("\t "));

This approach might be even faster than the stringstream approach. And since this is a generic template function it can be used to split other types of strings (wchar, etc. or UTF-8) using all kinds of delimiters.

See the documentation for details.

Solution 4

#include <vector>
#include <string>
#include <sstream>

int main()
{
    std::string str("Split me by whitespaces");
    std::string buf;                 // Have a buffer string
    std::stringstream ss(str);       // Insert the string into a stream

    std::vector<std::string> tokens; // Create vector to hold our words

    while (ss >> buf)
        tokens.push_back(buf);

    return 0;
}

Solution 5

For those with whom it does not sit well to sacrifice all efficiency for code size and see "efficient" as a type of elegance, the following should hit a sweet spot (and I think the template container class is an awesomely elegant addition.):

template < class ContainerT >
void tokenize(const std::string& str, ContainerT& tokens,
              const std::string& delimiters = " ", bool trimEmpty = false)
{
   std::string::size_type pos, lastPos = 0, length = str.length();

   using value_type = typename ContainerT::value_type;
   using size_type  = typename ContainerT::size_type;

   while(lastPos < length + 1)
   {
      pos = str.find_first_of(delimiters, lastPos);
      if(pos == std::string::npos)
      {
         pos = length;
      }

      if(pos != lastPos || !trimEmpty)
         tokens.push_back(value_type(str.data()+lastPos,
               (size_type)pos-lastPos ));

      lastPos = pos + 1;
   }
}

I usually choose to use std::vector<std::string> types as my second parameter (ContainerT)... but list<> is way faster than vector<> for when direct access is not needed, and you can even create your own string class and use something like std::list<subString> where subString does not do any copies for incredible speed increases.

It's more than double as fast as the fastest tokenize on this page and almost 5 times faster than some others. Also with the perfect parameter types you can eliminate all string and list copies for additional speed increases.

Additionally it does not do the (extremely inefficient) return of result, but rather it passes the tokens as a reference, thus also allowing you to build up tokens using multiple calls if you so wished.

Lastly it allows you to specify whether to trim empty tokens from the results via a last optional parameter.

All it needs is std::string... the rest are optional. It does not use streams or the boost library, but is flexible enough to be able to accept some of these foreign types naturally.

Share:
2,330,209
Markus Joschko
Author by

Markus Joschko

I work with GPUs on deep learning and computer vision.

Updated on July 08, 2022

Comments

  • Markus Joschko
    Markus Joschko almost 2 years

    I'm trying to iterate over the words of a string.

    The string can be assumed to be composed of words separated by whitespace.

    Note that I'm not interested in C string functions or that kind of character manipulation/access. Also, please give precedence to elegance over efficiency in your answer.

    The best solution I have right now is:

    #include <iostream>
    #include <sstream>
    #include <string>
    
    using namespace std;
    
    int main()
    {
        string s = "Somewhere down the road";
        istringstream iss(s);
    
        do
        {
            string subs;
            iss >> subs;
            cout << "Substring: " << subs << endl;
        } while (iss);
    }
    

    Is there a more elegant way to do this?

    • Admin
      Admin over 15 years
      Dude... Elegance is just a fancy way to say "efficiency-that-looks-pretty" in my book. Don't shy away from using C functions and quick methods to accomplish anything just because it is not contained within a template ;)
    • pyon
      pyon over 14 years
      while (iss) { string subs; iss >> subs; cout << "Substring: " << sub << endl; }
    • Ofer
      Ofer over 13 years
      @nlaq, Except that you'd have to convert your string object using c_str(), and back to a string again if you still needed it to be a string, no?
    • Tony Delroy
      Tony Delroy about 12 years
      @Eduardo: that's wrong too... you need to test iss between trying to stream another value and using that value, i.e. string sub; while (iss >> sub) cout << "Substring: " << sub << '\n';
    • James Oravec
      James Oravec about 11 years
      How about a string tokenizer: cplusplus.com/reference/cstring/strtok
    • hB0
      hB0 over 10 years
      Various options in C++ to do this by default: cplusplus.com/faq/sequences/strings/split
    • Matt
      Matt about 7 years
      There's more to elegance than just pretty efficiency. Elegant attributes include low line count and high legibility. IMHO Elegance is not a proxy for efficiency but maintainability.
    • Konchog
      Konchog over 5 years
      Most of the answers here are notably latin-centric. Many of the answers assume a single character can be used as 'whitespace' even though the question defines the delimiter to be whitespace. Unicode has at least 25 whitespace characters. But word-delimiting is not merely a whitespace issue. For instance, in syllabic writing, such as Tibetan, word delimitation is a semantic, rather than syntactic, problem. Therefore, using whitespace to extract words is not a suitable approach for many languages.
    • Martin York
      Martin York about 5 years
      Small addition to the above. You can add a locale facet that treats punctuation as space so you don't need to handle that separately. codereview.stackexchange.com/a/57467/507
    • ttulinsky
      ttulinsky over 3 years
      Your original code is more elegant than the answers.
  • Markus Joschko
    Markus Joschko over 15 years
    I'm aware of the C string functions and I'm aware of the performance issues too (both of which I've noted in my question). However, for this specific question, I'm looking for an elegant C++ solution.
  • Admin
    Admin over 15 years
    ... and you dont want to just build a OO wrapper over the C functions why?
  • paercebal
    paercebal over 15 years
    @Nelson LaQuet: Let me guess: Because strtok is not reentrant?
  • Jason
    Jason over 15 years
    @Nelson don't ever pass string.c_str() to strtok! strtok trashes the input string (inserts '\0' chars to replace each foudn delimiter) and c_str() returns a non-modifiable string.
  • Admin
    Admin over 15 years
    char* ch = new char[str.size()]; strcpy(ch, str.c_str()); ... delete[] ch; // problem solved.
  • Tom
    Tom about 15 years
    Speed is irrelevant here, as both of these cases are much slower than a strtok-like function.
  • l3dx
    l3dx almost 15 years
    Is it possible to specify a delimiter for this? Like for instance splitting on commas?
  • Ankit Roy
    Ankit Roy over 14 years
    This is practical and quick enough if you know the line will contain just a few tokens, but if it contains many then you will burn a ton of memory (and time) growing the vector. So no, it's not faster than the stringstream solution -- at least not for large n, which is the only case where speed matters.
  • Ankit Roy
    Ankit Roy over 14 years
    @Nelson: That array needs to be of size str.size() + 1 in your last comment. But I agree with your thesis that it's silly to avoid C functions for "aesthetic" reasons.
  • Ankit Roy
    Ankit Roy over 14 years
    I was tempted to +1 this answer for its simple, readable code (which I presume rubbed an elegantophile the wrong way, hence the -1), but then I saw that you allocated a fixed-size array of strings to hold the tokens. Come on, you know that's gonna break at the worst possible moment! :)
  • Jonathan
    Jonathan over 14 years
    @l3dx: it seems that the parameter "\n" is the delimiter. This code is very nice, but I would like to know better about it. Maybe somebody could explain each line of that snippet?
  • huy
    huy over 14 years
    @Jonathan: \n is not the delimiter in this case, it's the deliminer for outputting to cout.
  • boskom
    boskom almost 14 years
    elegant solution, I always forget about this particular "getline", thou I do not believe it is aware of quotes and escape sequences.
  • Roman Starkov
    Roman Starkov almost 14 years
    And for those who don't already have boost... bcp copies over 1,000 files for this :)
  • littlebroccoli
    littlebroccoli almost 14 years
    strtok is from the C standard library, not C++. It is not safe to use in multithreaded programs. It modifies the input string.
  • littlebroccoli
    littlebroccoli almost 14 years
    Because it stores the char pointer from the first call in a static variable, so that on the subsequent calls when NULL is passed, it remembers what pointer should be used. If a second thread calls strtok when another thread is still processing, this char pointer will be overwritten, and both threads will then have incorrect results. mkssoftware.com/docs/man3/strtok.3.asp
  • abatishchev
    abatishchev almost 14 years
    Is it possible to declare word as a char?
  • gnomed
    gnomed almost 14 years
    Sorry abatishchev, C++ is not my strong point. But I imagine it would not be difficult to add an inner loop to loop through every character in each word. But right now I believe the current loop depends on spaces for word separation. Unless you know that there is only a single character between every space, in which case you can just cast "word" to a char... sorry I cant be of more help, ive been meaning to brush up on my C++
  • systemsfault
    systemsfault almost 14 years
    as mentioned before strtok is unsafe and even in C strtok_r is recommended for use
  • Wayne Werner
    Wayne Werner almost 14 years
    based on this: cplusplus.com/reference/algorithm/copy no. The whitespace behavior is a function of the istream_iterator. It would be more elegant to roll your own.
  • Wayne Werner
    Wayne Werner almost 14 years
    if you declare word as a char it will iterate over every non-whitespace character. It's simple enough to try: stringstream ss("Hello World, this is*@#&$(@ a string"); char c; while(ss >> c) cout << c;
  • user276641
    user276641 over 13 years
    @graham.reeds, @l3dx: Please don't write another CSV parser which can't handle quoted fields: en.wikipedia.org/wiki/Comma-separated_values
  • Jason
    Jason over 13 years
    @stijn: are you saying that split("one two three", ' '); returns a vector with 4 elements? I'm not sure that is the case, but I'll test it.
  • stijn
    stijn over 13 years
    wait, it seems the formatting removed some spaces (or I forgot them): I'm talking about the string "one two three" with 2 spaces between "two" and "three"
  • SmallChess
    SmallChess over 13 years
    This is a poor solution as it doesn't take any other delimiter, therefore not scalable and not maintable.
  • user470379
    user470379 over 13 years
    To people asking how this works: equivalent code using less of the STL would look like string token; istringstream iss(sentence); while (iss >> token) { cout << token; } or { tokens.push_back(token); }
  • szx
    szx about 13 years
    Why do I get "error C2664: 'std::back_inserter' : cannot convert parameter 1 from 'std::vector<_Ty> (__cdecl *)(void)' to 'std::vector<_Ty> &'" in VS2008?
  • Erik Aronesty
    Erik Aronesty over 12 years
    strtok_r can be used if you are in a section of code that may be accessed. this is the only solution of all of the above that isn't "line noise", and is a testament to what, exactly, is wrong with c++
  • Offirmo
    Offirmo over 12 years
    Warning, when given an empty string (""), this method return a vector containing the "" string. So add an "if (!string_to_split.empty())" before the split.
  • Admin
    Admin over 12 years
    I'm quite a fan of this, but for g++ (and probably good practice) anyone using this will want typedefs and typenames: typedef ContainerT Base; typedef typename Base::value_type ValueType; typedef typename ValueType::size_type SizeType; Then to substitute out the value_type and size_types accordingly.
  • gregschlom
    gregschlom over 12 years
    The first version is simple and gets the job done perfectly. The only change I would made would be to return the result directly, instead of passing it as a parameter.
  • ACK_stoverflow
    ACK_stoverflow over 12 years
    @Ian Embedded developers aren't all using boost.
  • Drake
    Drake over 12 years
    The output is passed as a parameter for efficiency. If the result were returned it would require either a copy of the vector, or a heap allocation which would then have to be freed.
  • gregschlom
    gregschlom over 12 years
    My bad, I was wrongly assuming that that STL would use lazy copy, as Qt containers do. Too bad they don't.
  • WDRust
    WDRust about 12 years
    @ACK_stoverflow are embedded developers using C++ anyway?
  • PlasmaHH
    PlasmaHH about 12 years
    That doesnt split by arbitray whitespace and produces empty strings for subsequent whitespace.
  • Xander Tulip
    Xander Tulip about 12 years
    Inefficient and you're deriving from an STL container - possibly one of the worst things you could do.
  • Xander Tulip
    Xander Tulip about 12 years
    You forgot to add to use list: "extremely inefficient"
  • Luis Machuca
    Luis Machuca about 12 years
    bcp'ing this brings forth libraries such as the MPL, which I think is really hardly needed to split text. Man it is a PITA...
  • Marco M.
    Marco M. about 12 years
    @XanderTulip, can you be more constructive and explain how or why?
  • Tony Delroy
    Tony Delroy about 12 years
    @j_random_hacker: "at least not for large n, which is the only case where speed matters" - also for smallish n in a large-n loop...
  • Tony Delroy
    Tony Delroy about 12 years
    @tuxSlayer: various POSIX/XOPEN/UNIX standards also specify strtok_r
  • tuxSlayer
    tuxSlayer about 12 years
    @TonyDelroy: Yeah, and it looks like in msvc it is called strtok_s (meaning safe?:)). Not too portable...
  • Joseph Garvin
    Joseph Garvin about 12 years
    @XanderTulip: I assume you are referring to it returning the vector by value. The Return-Value-Optimization (RVO, google it) should take care of this. Also in C++11 you could return by move reference.
  • Nawaz
    Nawaz almost 12 years
    The template argument to back_inserter should be string, not vector<string>. That is, it should be back_inserter<string>(tokens), not back_inserter<vector<string>>(tokens).
  • Tony Delroy
    Tony Delroy almost 12 years
    @tuxSlayer: if you'd prefer to write your own implementation instead of have a five line #if/#else/#endif then knock yourself out....
  • Nils
    Nils almost 12 years
    Use std::string::find(..) and std::string::substr(..) no need to use boost.
  • Kit10
    Kit10 almost 12 years
    I liked this solution, however, I wrapped the function in a template, changing the vectors std::string template parameter into a parameter. For me, I also used boost::lexical_cast on said template parameter in the push_back.
  • Wes Miller
    Wes Miller almost 12 years
    For those of us for whom the template stuff and the first comment are completely foreign, a usage example cmplete with required includes would be lovely.
  • Wes Miller
    Wes Miller almost 12 years
    Ahh well, I figured it out. I put the C++ lines from aws' comment inside the function body of tokenize(), then edited the tokens.push_back() lines to change the ContainerT::value_type to just ValueType and changed (ContainerT::value_type::size_type) to (SizeType). Fixed the bits g++ had been whining about. Just invoke it as tokenize( some_string, some_vector );
  • AndersK
    AndersK over 11 years
    actually in our company we are not allowed to use boost due to security, yeah i know but suits have decided.
  • Mihai Bişog
    Mihai Bişog over 11 years
    This can actually be optimized further: instead of .push_back(str.substr(...)) one can use .emplace_back(str, start, pos - start). This way the string object is constructed in the container and thus we avoid a move operation + other shenanigans done by the .substr function.
  • Marco M.
    Marco M. over 11 years
    @zoopp yes. Good idea. VS10 didn't have emplace_back support when I wrote this. I will update my answer. Thanks
  • user997112
    user997112 over 11 years
    Could someone be so kind as to provide a "summary" as to why this code has much greater performance?
  • Alexei Sholik
    Alexei Sholik over 11 years
    Take a look at ranges if you care about elegance in practical terms (i.e. do more with less code): slideshare.net/rawwell/iteratorsmustgo
  • thecoshman
    thecoshman over 11 years
    This is just a great big ugly wall of code. You should explain the logic behind it.
  • キキジキ
    キキジキ over 11 years
    How can I modify it to work with std::wstring, std::getline won't work right?
  • Jason
    Jason over 11 years
    std::getline is templated, so it may "just work", if not see en.cppreference.com/w/cpp/string/basic_string/getline to figure out how to tweak it. Passing a wchar_t character as the delim may be enough to trigger the right template.
  • Marius
    Marius over 11 years
    Apart from running a few performance tests on sample data, primarily I've reduced it to as few as possible instructions and also as little as possible memory copies enabled by the use of a substring class that only references offsets/lengths in other strings. (I rolled my own, but there are some other implementations). Unfortunately there is not too much else one can do to improve on this, but incremental increases were possible.
  • marko
    marko over 11 years
    Welcome to StackOverflow. Your answer would be improved if you described the code a bit further. What differentiates it from the one (very high scoring) answers on this old question?
  • Jerry Coffin
    Jerry Coffin over 11 years
    Actually, this can work just fine with other delimiters (though doing some is somewhat ugly). You create a ctype facet that classifies the desired delimiters as whitespace, create a locale containing that facet, then imbue the stringstream with that locale before extracting strings.
  • Oktalist
    Oktalist over 11 years
    This is the best answer here, if you only want to split on a single delimiter character. The original question wanted to split on whitespace though, meaning any combination of one or more consecutive spaces or tabs. You have actually answered stackoverflow.com/questions/53849
  • Clay
    Clay over 11 years
    The main purpose of istream_iterator is it can parse int, float, double, etc from an istream: istream_iterator<double> does a decent job reading doubles separated by space. With a front or especially back inserter it's a great combo! :)
  • legends2k
    legends2k over 11 years
    vector has a ctor that takes a begin and end iterator, so no need for the copy call to insert them into a container.
  • Christian Rau
    Christian Rau over 11 years
    @Kinderchocolate "The string can be assumed to be composed of words separated by whitespace" - Hmm, doesn't sound like a poor solution to the question's problem. "not scalable and not maintable" - Hah, nice one.
  • Christian Rau
    Christian Rau over 11 years
    @Nawaz Why should it? You're inserting into a std::vector<std::string> and not into a std::string. But then again, there shouldn't be an explicit template argument, anyway (well, there shouldn't even be a back_inserter or copy, but ok).
  • Nawaz
    Nawaz over 11 years
    @ChristianRau: Oh you're right; the first code-snippet probably confused me. Actually I should have said you don't need to mention the template argument in std::back_inserter; in fact, mentioning template argument defies the very purpose of back_inserter.
  • Peter M
    Peter M about 11 years
    I like this because it requires the minimum amount of extra headers. I might recommend an edit to make it follow best practice usage of namespaces (IE std:: in front of everything).
  • GMasucci
    GMasucci almost 11 years
    as an addendum: I use boost only when I must, normally I prefer to add to my own library of code which is standalone and portable so that I can achieve small precise specific code, which accomplishes a given aim. That way the code is non-public, performant, trivial and portable. Boost has its place but I would suggest that its a bit of overkill for tokenising strings: you wouldnt have your whole house transported to an engineering firm to get a new nail hammered into the wall to hang a picture.... they may do it extremely well, but the prosare by far outweighed by the cons.
  • Drake
    Drake almost 11 years
    A slight addendum to my comment above: this function could return the vector without penalty if using C++11 move semantics.
  • Rozuur
    Rozuur almost 11 years
    if you are enabling return value optimization, can't you make the function to return void?
  • Alex S
    Alex S almost 11 years
    @AlecThomas: Even before C++11, wouldn't most compilers optimise away the return copy via NRVO? (+1 anyway; very succinct)
  • user2083364
    user2083364 over 10 years
    nice it even works for calling of boost framework in xcode (iOS project) in cpp class
  • Andreas Spindler
    Andreas Spindler over 10 years
  • David G
    David G over 10 years
    In order to avoid it skipping empty tokens, do an empty() check: if (!item.empty()) elems.push_back(item)
  • Alex Spencer
    Alex Spencer over 10 years
    @Peter M I would rather have it be passed in by reference, just in case the vector<string> got large.
  • herohuyongtao
    herohuyongtao over 10 years
    How about the delim contains two chars as ->?
  • Jason
    Jason over 10 years
    @herohuyongtao, this solution only works for single char delimiters.
  • stewart99
    stewart99 over 10 years
    why do you need to use curly brackets in vector<string> tokens{istream_iterator<string>{iss}, istream_iterator<string>{}}; is it because otherwise it looks like function call?
  • loop
    loop over 10 years
    @Copperpot How did you do it in a template?
  • duslabo
    duslabo over 10 years
    @EvanTeran This may be not regarding splitting the string but general doubt in your code, The elems you are passing as an reference argument and returning the reference again. I just wanted to know is there any reason for that?
  • Jason
    Jason over 10 years
    @JeshwanthKumarNK, it's not necessary, but it lets you do things like pass the result directly to a function like this: f(split(s, d, v)) while still having the benefit of a pre-allocated vector if you like.
  • Drake
    Drake about 10 years
    @Veritas In what way does it not work if the delimiter is the last character? Also, outputting empty tokens is intentional, though it could obviously be easily modified to not do that if required.
  • Erik Aronesty
    Erik Aronesty about 10 years
    Updated so there can be no objections on the grounds of thread safety from C++ wonks.
  • paulm
    paulm about 10 years
    "For example, printf and scanf both are faster then cin and cout" only because synchronization is enabled by default
  • EvilTeach
    EvilTeach almost 10 years
    strtok is evil. It treats two delimiters as a single delimiter if there is nothing between them.
  • mchiasson
    mchiasson over 9 years
    This would have been my favourite answer, but std::regex is broken in GCC 4.8. They said that they implemented it correctly in GCC 4.9. I am still giving you my +1
  • jww
    jww over 9 years
    "The STL does not have such a method available already" - what's wrong with string's find_first_of and using iterators to remember positions? Then, use substr to extract.
  • Brent Bradburn
    Brent Bradburn over 9 years
    Similar responses with maybe better regex approach: here, and here.
  • Ben Voigt
    Ben Voigt about 9 years
    @paulm: No, the slowness of C++ streams is caused by facets. They're still slower than stdio.h functions even when synchronization is disabled (and on stringstreams, which can't synchronize).
  • Ziyuan
    Ziyuan about 9 years
    Questions: 1. why would istream_iterator stop at white spaces? For me spaces are also part of the string; 2. why is it very inefficient?
  • Michael Trouw
    Michael Trouw about 9 years
    The elegance in needing 5 includes, 3 lines (not counting using <namespace> and quite cryptic code to... split a string? dear god.
  • Andreas Spindler
    Andreas Spindler about 9 years
    There have been 2 revs. That's nice. Seems as if my English had to much of a "German". However, the revisionist did not fixed two minor bugs maybe because they were obvious anyway: std::isupper could be passed as argument, not std::upper. Second put a typename before the String::const_iterator.
  • Tiago
    Tiago about 9 years
    My personal opinion is that C and C++ are languages not meant to be agile or to provide fast to market solutions, using Boost is almost the same as choosing an higher level language that offer more abstraction, for those we choose Java, C#, etc... Because for those we don't care for exactly what it's doing beneath the hood. Using Boost would also mean that I would have to tell my client that I'm including a third party library. Thanks anyway. :)
  • Andrew
    Andrew almost 9 years
    Using Boost is like using a Booster seat: no thanks.
  • Yetti99
    Yetti99 almost 9 years
    A for() loop looks better. Like this davekb.com/browse_programming_tips:strtok_r_example:txt
  • Spacen Jasset
    Spacen Jasset almost 9 years
    Out of all the answers this appears to be one of the most appealing and flexible. Together with the getline with a delimiter, although its a less obvious solution. Does the c++11 standard not have anything for this? Does c++11 support punch cards these days?
  • Guosheng
    Guosheng over 8 years
    Maybe there's a bug. Given "xxxabcyyyabczzzabc" and "abo", the split result is "xxx|cyyy|czzz|c".
  • Marius
    Marius over 8 years
    That's the correct output for when trimEmpty = true. Keep in mind that "abo" is not a delimiter in this answer, but the list of delimiter characters. It would be simple to modify it to take a single delimiter string of characters (I think str.find_first_of should change to str.find_first, but I could be wrong... can't test)
  • Moiz Sajid
    Moiz Sajid over 8 years
    We could also have used STL to split a string.
  • dshin
    dshin over 8 years
    Caveat: split("one:two::three", ':') and split("one:two::three:", ':') return the same value.
  • fmuecke
    fmuecke over 8 years
    please use std::vector instead of list
  • fmuecke
    fmuecke over 8 years
    that's actually quite nice. although I don't think erasing is the most efficient way and (2) what about keeping empty tokens?
  • fmuecke
    fmuecke over 8 years
    almost perfect: split(":abc:def:", ':'); returns only 3 instead of 4 elements!
  • fmuecke
    fmuecke over 8 years
    Finally a solution that is handling empty tokens correctly at both sides of the string
  • Kaz
    Kaz over 8 years
    @fmuecke There is no requirement in the question for a specific representation to use for the pieces of the string, hence there is no need to incorporate your suggestion into the answer.
  • LearnCocos2D
    LearnCocos2D over 8 years
    If you pass in an empty string, it returns a vector with 1 element (empty string). If you pass in a string that's the same as sep, then it returns a vector with 2 elements (both empty strings). Should have "if (end > 0) {" before the push_back in while loop and "if (start > 0) {" before push_back below while loop to fix this.
  • Slava
    Slava over 8 years
    Many questionable implemetation details aside this answer is the only one which does it lazily. I am really disappointed in C++ world here. Well, streamiterator kind of does it too, but then everyone puts result into vector<string> killing all the benefits...
  • Drake
    Drake over 8 years
    @LearnCocos2D Please don't alter the meaning of a post with an edit. This behaviour is by design. It is identical behaviour to Python's split operator. I'll add a note to make this clear.
  • QuantumKarl
    QuantumKarl over 8 years
    This is my favorite with minor changes: vector returned as reference as you said, and the arguments "str" and "regex" passed by references also. thx.
  • Jonny
    Jonny over 8 years
    Being able to set max number of returned elements is crucial to me.
  • Jonny
    Jonny over 8 years
    Can someone please make it return up to a max N elements? Any remaining characters should end up in the last element.
  • Jason
    Jason over 8 years
    @Jonny, should be trivial, just add an extra condition to the while loop comparing the vector's size to the max. Something like this: while (elems.size() < max_count && std::getline(ss, item, delim)) {
  • Jason
    Jason over 8 years
    @Jonny, I see. Your answer looks a bit more complex than necessary. If you make the max default to something like size_t(-1), that will effectively be "infinity" (it's the biggest size your system can represent, so you'll run out of RAM before you hit this). Then you can make the condition as simple as my comment above. No more need to double check the stream state and do a second read and such. Just a suggestion :-).
  • Jonny
    Jonny over 8 years
    Might be wrong but you might lose the end of the string with that. Well basically I mimic the explode function of php, or so I believe.
  • Jason
    Jason over 8 years
    Gotcha. My solution will stop at max_count, skipping the rest of the string (since it found the amount it wanted). I guess you are looking for something that will always make the last one the rest of the string. I have some functions like that too here: github.com/eteran/cpp-utilities/blob/master/string.h Some are specifically designed to match php's string manipulation functions as closely as possible :-)
  • Gabriel
    Gabriel over 8 years
    Why not return split(s, delim, std::vector<std::string>()); ?
  • Jason
    Jason over 8 years
    @Gabriel, you could. But I think when it was written (a few years ago), having a named variable encouraged NVRO more reliably. With C++11 move semantics, it may be a lot less of a difference.
  • Pascal Kesseli
    Pascal Kesseli over 8 years
    Suggest using std::string::size_type instead of int, as some compilers might spit out signed/unsigned warnings otherwise.
  • David Doria
    David Doria over 8 years
    This only works for single character delimiters. A simple change lets it work with multicharacter: prev_pos = pos += delimiter.length();
  • sajas
    sajas about 8 years
    Can the boost::split really work on the utf-8 string? Can you share any documentation for that? I am trying to split a utf-8 string at newlines. Will the boost::split work correctly if the string that I pass is using utf-8 encoding?
  • noɥʇʎԀʎzɐɹƆ
    noɥʇʎԀʎzɐɹƆ almost 8 years
    This is much faster than Evan Teran's answer if you only need to split on whitespace.
  • exilit
    exilit almost 8 years
    While the missing delimiter concern is correct one should take into account that the OPs solution couldn't handle that either. So this seems to be not a requirement.
  • Alessandro Teruzzi
    Alessandro Teruzzi almost 8 years
    I like this solution because it allows the separator to be a string and not a char, however, it is modifying in place the string, so it is forcing the creation of a copy of the original string.
  • Marius
    Marius over 7 years
    Thanks @thomas-perl for the revision, it does indeed make it more readable and compact. My original implementation avoided the additional comparison per loop as I was optimizing for a very low latency application. Your edit will be more applicable to most users visiting here however.
  • Seppo Enarvi
    Seppo Enarvi about 7 years
    @doorfly The only place where curly brackets are needed is istream_iterator<string>{}, because that would otherwise be regarded as a function.
  • Timmmm
    Timmmm about 7 years
    This is quite nice. I feel like the code could be clearer though, e.g. end unexpectedly isn't s.end().
  • Galik
    Galik about 7 years
    @Timmmm Out of curiosity what would you suggest for pos, end and done?
  • Timmmm
    Timmmm about 7 years
    Also you can make it a bit simpler with find_first_of and find_first_not_of.
  • Galik
    Galik about 7 years
    @Timmmm Well I shouldn't be using ptr_fun but using std::isspace makes the code more easily modifiable to accommodate different locales. Having said that my current working version uses find_first_of. That makes it more efficient and able to split on any character not just whitespace. In fact I also have a version that splits on a supplied string too , that uses std::search (the possibilities for this function are multifold it seems).
  • Timmmm
    Timmmm about 7 years
    Yeah, I rewrote it like this. Thanks for the code!
  • Galik
    Galik about 7 years
    @Timmmm looks good. I'm not going to post my current version(s) here because they are monstrously templated to accommodate different string and container types and look horrendous (I'm overdue revisiting that code). But I will get rid of that std::ptr_fun hehe
  • Daniel Ryan
    Daniel Ryan almost 7 years
    Anyone else getting the error "missing 'typename' prior to dependent type name 'T::size_type'"?
  • andrewmu
    andrewmu almost 7 years
    @Andrew: any_of has been part of the standard library since 2011: en.cppreference.com/w/cpp/algorithm/all_any_none_of
  • Diedre
    Diedre almost 7 years
    be aware that if you are using OpenCV, split can be confused with split from OpenCV that splits images.
  • Oomph Fortuity
    Oomph Fortuity over 6 years
    Adding some explanation would be helpful.
  • Roman Shestakov
    Roman Shestakov over 6 years
    the first function in this answer is the best solution - works perfectly with a reverse join function - std::string strJoin(const std::vector<std::string> v, const char& delimiter) { if(!v.empty()) { std::stringstream ss; std::string str(1, delimiter); auto it = v.cbegin(); while(true) { ss << *it++; if(it != v.cend()) ss << delimiter; else return ss.str(); } } return ""; }
  • doctorram
    doctorram over 6 years
    I really wish they'd add a standard method with this signature: vector<string> std::string::split(char delimiter = ' ');
  • Sam
    Sam about 6 years
    Raw strings are pretty useful while dealing with regex patterns. That way, you don't have to use the escape sequences... You can just use R"([\s,]+)".
  • einpoklum
    einpoklum about 6 years
    Does this materialize a copy of all of the tokens, or does it only keep the start and end position of the current token?
  • kayleeFrye_onDeck
    kayleeFrye_onDeck almost 6 years
    I had some issues initially, but this does in fact work with wstring / unicode if you update the template accordingly. Be careful though; i ran into some easy to cause runtime errors that the compiler didn't catch in a couple different places.
  • kayleeFrye_onDeck
    kayleeFrye_onDeck almost 6 years
    If using wstring and your code breaks, check this answer for fixing the istream_iterator usage with wchar_t: stackoverflow.com/a/20959347/3543437
  • Marius
    Marius almost 6 years
    Thanks @kayleeFrye_onDeck , I've not been using C++ at this level for a few years now and may be a bit rusty on the new specs, but if there is anything I should fix on this post, let me know and I'll check it out.
  • Ali
    Ali over 5 years
    You can also split on other delimiters if you use getline in the while condition e.g. to split by commas, use while(getline(ss, buff, ',')).
  • Martin York
    Martin York about 5 years
    @l3dx Yes. You can add a specialized local to the stream that makes a , a space (and all other characters not a space). Then the code will work just the same. codereview.stackexchange.com/a/57467/507
  • Porsche9II
    Porsche9II about 5 years
    Yepp, the range for based looks better - I agree
  • tbeu
    tbeu almost 5 years
  • Admin
    Admin over 4 years
    This code could really use some comments to explain what the purpose of every item is. A typical person asking this question is only going to end up with more questions after reading this, e.g. what the purpose of the empty istream_iterator is, or why the "create the vector directly" solution has so many brackets.
  • luizfls
    luizfls about 4 years
  • tjysdsg
    tjysdsg about 4 years
    I don't think there is any power or elegance in this, compared to just std::string::split(). Of course there is not such split in STL
  • Pryftan
    Pryftan almost 4 years
    Or you could use strsep() (though not as portable). If you don't care about more than one char as the delimiter another answer gives an idea (getdelim()) but you could also iterate over the string with strchr(). Or...there are many ways depending on what you are after and need.
  • Mellester
    Mellester almost 4 years
    You can set the delimiter of istringstream stackoverflow.com/a/21814768/1943599
  • UserX
    UserX over 3 years
    This looks WAY more complex than the original proposed solution. You shouldn't have to do this much work just to split a string!
  • Teodor Maxim
    Teodor Maxim about 3 years
    I believe this could be optimized a bit by using word.clear() instead of word = "". Calling the clear method will empty the string but keep the already allocated buffer, which will be reused upon further concatenations. Right now a new buffer is created for every word, resulting in extra allocations.
  • nyarlathotep108
    nyarlathotep108 over 2 years
    @UserX this might be more complex than the original proposed solution, but it is also more efficient.
  • Pablo H
    Pablo H over 2 years
    This is mostly the same as stackoverflow.com/a/54134243/6655648.
  • Johannes Overmann
    Johannes Overmann over 2 years
    As others noted this does not correctly handle emtpy strings at the end. (This is not a matter of definition since "a,b," and "a,b" both give the same result.) This can be fixed by initializing iss with s + delim and handling the special case that an empty strig should return an empty list explicitly.
  • underscore_d
    underscore_d over 2 years
    it's only a two-liner because one of those two lines is huge and cryptic... no one who actually has to read code ever, wants to read something like this, or would write it. contrived brevity is worse than tasteful verbosity.
  • Optimus1
    Optimus1 over 2 years
    Your code is not working! Try string = "hih1ihi", substring = "hi". Your code is not giving the correct result. minus.
  • Optimus1
    Optimus1 over 2 years
    Your code is as slow as a turtle after drinking.
  • Marius
    Marius about 2 years
    @Optimus1 I think you assumed the delimiters parameter is not a character list of delimiters but rather a substring. Therein lies the rub.