How do I iterate over the words of a string?

c++ string split

2,330,209

Solution 1

For what it's worth, here's another way to extract tokens from an input string, relying only on standard library facilities. It's an example of the power and elegance behind the design of the STL.

#include <iostream>
#include <string>
#include <sstream>
#include <algorithm>
#include <iterator>

int main() {
    using namespace std;
    string sentence = "And I feel fine...";
    istringstream iss(sentence);
    copy(istream_iterator<string>(iss),
         istream_iterator<string>(),
         ostream_iterator<string>(cout, "\n"));
}

Instead of copying the extracted tokens to an output stream, one could insert them into a container, using the same generic copy algorithm.

vector<string> tokens;
copy(istream_iterator<string>(iss),
     istream_iterator<string>(),
     back_inserter(tokens));

... or create the vector directly:

vector<string> tokens{istream_iterator<string>{iss},
                      istream_iterator<string>{}};

Solution 2

I use this to split string by a delimiter. The first puts the results in a pre-constructed vector, the second returns a new vector.

#include <string>
#include <sstream>
#include <vector>
#include <iterator>

template <typename Out>
void split(const std::string &s, char delim, Out result) {
    std::istringstream iss(s);
    std::string item;
    while (std::getline(iss, item, delim)) {
        *result++ = item;
    }
}

std::vector<std::string> split(const std::string &s, char delim) {
    std::vector<std::string> elems;
    split(s, delim, std::back_inserter(elems));
    return elems;
}

Note that this solution does not skip empty tokens, so the following will find 4 items, one of which is empty:

std::vector<std::string> x = split("one:two::three", ':');

Solution 3

A possible solution using Boost might be:

#include <boost/algorithm/string.hpp>
std::vector<std::string> strs;
boost::split(strs, "string to split", boost::is_any_of("\t "));

This approach might be even faster than the stringstream approach. And since this is a generic template function it can be used to split other types of strings (wchar, etc. or UTF-8) using all kinds of delimiters.

See the documentation for details.

Solution 4

#include <vector>
#include <string>
#include <sstream>

int main()
{
    std::string str("Split me by whitespaces");
    std::string buf;                 // Have a buffer string
    std::stringstream ss(str);       // Insert the string into a stream

    std::vector<std::string> tokens; // Create vector to hold our words

    while (ss >> buf)
        tokens.push_back(buf);

    return 0;
}

Solution 5

For those with whom it does not sit well to sacrifice all efficiency for code size and see "efficient" as a type of elegance, the following should hit a sweet spot (and I think the template container class is an awesomely elegant addition.):

template < class ContainerT >
void tokenize(const std::string& str, ContainerT& tokens,
              const std::string& delimiters = " ", bool trimEmpty = false)
{
   std::string::size_type pos, lastPos = 0, length = str.length();

   using value_type = typename ContainerT::value_type;
   using size_type  = typename ContainerT::size_type;

   while(lastPos < length + 1)
   {
      pos = str.find_first_of(delimiters, lastPos);
      if(pos == std::string::npos)
      {
         pos = length;
      }

      if(pos != lastPos || !trimEmpty)
         tokens.push_back(value_type(str.data()+lastPos,
               (size_type)pos-lastPos ));

      lastPos = pos + 1;
   }
}

I usually choose to use std::vector<std::string> types as my second parameter (ContainerT)... but list<> is way faster than vector<> for when direct access is not needed, and you can even create your own string class and use something like std::list<subString> where subString does not do any copies for incredible speed increases.

It's more than double as fast as the fastest tokenize on this page and almost 5 times faster than some others. Also with the perfect parameter types you can eliminate all string and list copies for additional speed increases.

Additionally it does not do the (extremely inefficient) return of result, but rather it passes the tokens as a reference, thus also allowing you to build up tokens using multiple calls if you so wished.

Lastly it allows you to specify whether to trim empty tokens from the results via a last optional parameter.

All it needs is std::string... the rest are optional. It does not use streams or the boost library, but is flexible enough to be able to accept some of these foreign types naturally.

View more solutions

2,330,209

Author by

Markus Joschko

I work with GPUs on deep learning and computer vision.

Updated on July 08, 2022

Comments

Markus Joschko almost 2 years
I'm trying to iterate over the words of a string.

The string can be assumed to be composed of words separated by whitespace.

Note that I'm not interested in C string functions or that kind of character manipulation/access. Also, please give precedence to elegance over efficiency in your answer.

The best solution I have right now is:
```
#include <iostream>
#include <sstream>
#include <string>

using namespace std;

int main()
{
    string s = "Somewhere down the road";
    istringstream iss(s);

    do
    {
        string subs;
        iss >> subs;
        cout << "Substring: " << subs << endl;
    } while (iss);
}
```
Is there a more elegant way to do this?
- Admin over 15 years
  
  Dude... Elegance is just a fancy way to say "efficiency-that-looks-pretty" in my book. Don't shy away from using C functions and quick methods to accomplish anything just because it is not contained within a template ;)
- pyon over 14 years
  
  while (iss) { string subs; iss >> subs; cout << "Substring: " << sub << endl; }
- Ofer over 13 years
  
  @nlaq, Except that you'd have to convert your string object using c_str(), and back to a string again if you still needed it to be a string, no?
- Tony Delroy about 12 years
  
  @Eduardo: that's wrong too... you need to test iss between trying to stream another value and using that value, i.e. string sub; while (iss >> sub) cout << "Substring: " << sub << '\n';
- James Oravec about 11 years
  
  How about a string tokenizer: cplusplus.com/reference/cstring/strtok
- hB0 over 10 years
  
  Various options in C++ to do this by default: cplusplus.com/faq/sequences/strings/split
- Matt about 7 years
  
  There's more to elegance than just pretty efficiency. Elegant attributes include low line count and high legibility. IMHO Elegance is not a proxy for efficiency but maintainability.
- Konchog over 5 years
  
  Most of the answers here are notably latin-centric. Many of the answers assume a single character can be used as 'whitespace' even though the question defines the delimiter to be whitespace. Unicode has at least 25 whitespace characters. But word-delimiting is not merely a whitespace issue. For instance, in syllabic writing, such as Tibetan, word delimitation is a semantic, rather than syntactic, problem. Therefore, using whitespace to extract words is not a suitable approach for many languages.
- Martin York about 5 years
  
  Small addition to the above. You can add a locale facet that treats punctuation as space so you don't need to handle that separately. codereview.stackexchange.com/a/57467/507
- ttulinsky over 3 years
  
  Your original code is more elegant than the answers.
Markus Joschko over 15 years

I'm aware of the C string functions and I'm aware of the performance issues too (both of which I've noted in my question). However, for this specific question, I'm looking for an elegant C++ solution.
Admin over 15 years

... and you dont want to just build a OO wrapper over the C functions why?
paercebal over 15 years

@Nelson LaQuet: Let me guess: Because strtok is not reentrant?
Jason over 15 years

@Nelson don't ever pass string.c_str() to strtok! strtok trashes the input string (inserts '\0' chars to replace each foudn delimiter) and c_str() returns a non-modifiable string.
Admin over 15 years

char* ch = new char[str.size()]; strcpy(ch, str.c_str()); ... delete[] ch; // problem solved.
Tom about 15 years

Speed is irrelevant here, as both of these cases are much slower than a strtok-like function.
l3dx almost 15 years

Is it possible to specify a delimiter for this? Like for instance splitting on commas?
Ankit Roy over 14 years

This is practical and quick enough if you know the line will contain just a few tokens, but if it contains many then you will burn a ton of memory (and time) growing the vector. So no, it's not faster than the stringstream solution -- at least not for large n, which is the only case where speed matters.
Ankit Roy over 14 years

@Nelson: That array needs to be of size str.size() + 1 in your last comment. But I agree with your thesis that it's silly to avoid C functions for "aesthetic" reasons.
Ankit Roy over 14 years

I was tempted to +1 this answer for its simple, readable code (which I presume rubbed an elegantophile the wrong way, hence the -1), but then I saw that you allocated a fixed-size array of strings to hold the tokens. Come on, you know that's gonna break at the worst possible moment! :)
Jonathan over 14 years

@l3dx: it seems that the parameter "\n" is the delimiter. This code is very nice, but I would like to know better about it. Maybe somebody could explain each line of that snippet?
huy over 14 years

@Jonathan: \n is not the delimiter in this case, it's the deliminer for outputting to cout.
boskom almost 14 years

elegant solution, I always forget about this particular "getline", thou I do not believe it is aware of quotes and escape sequences.
Roman Starkov almost 14 years

And for those who don't already have boost... bcp copies over 1,000 files for this :)
littlebroccoli almost 14 years

strtok is from the C standard library, not C++. It is not safe to use in multithreaded programs. It modifies the input string.
littlebroccoli almost 14 years

Because it stores the char pointer from the first call in a static variable, so that on the subsequent calls when NULL is passed, it remembers what pointer should be used. If a second thread calls strtok when another thread is still processing, this char pointer will be overwritten, and both threads will then have incorrect results. mkssoftware.com/docs/man3/strtok.3.asp
abatishchev almost 14 years

Is it possible to declare word as a char?
gnomed almost 14 years

Sorry abatishchev, C++ is not my strong point. But I imagine it would not be difficult to add an inner loop to loop through every character in each word. But right now I believe the current loop depends on spaces for word separation. Unless you know that there is only a single character between every space, in which case you can just cast "word" to a char... sorry I cant be of more help, ive been meaning to brush up on my C++
systemsfault almost 14 years

as mentioned before strtok is unsafe and even in C strtok_r is recommended for use
Wayne Werner almost 14 years

based on this: cplusplus.com/reference/algorithm/copy no. The whitespace behavior is a function of the istream_iterator. It would be more elegant to roll your own.
Wayne Werner almost 14 years

if you declare word as a char it will iterate over every non-whitespace character. It's simple enough to try: stringstream ss("Hello World, this is*@#&$(@ a string"); char c; while(ss >> c) cout << c;
user276641 over 13 years

@graham.reeds, @l3dx: Please don't write another CSV parser which can't handle quoted fields: en.wikipedia.org/wiki/Comma-separated_values
Jason over 13 years

@stijn: are you saying that split("one two three", ' '); returns a vector with 4 elements? I'm not sure that is the case, but I'll test it.
stijn over 13 years

wait, it seems the formatting removed some spaces (or I forgot them): I'm talking about the string "one two three" with 2 spaces between "two" and "three"
SmallChess over 13 years

This is a poor solution as it doesn't take any other delimiter, therefore not scalable and not maintable.
user470379 over 13 years

To people asking how this works: equivalent code using less of the STL would look like string token; istringstream iss(sentence); while (iss >> token) { cout << token; } or { tokens.push_back(token); }
szx about 13 years

Why do I get "error C2664: 'std::back_inserter' : cannot convert parameter 1 from 'std::vector<_Ty> (__cdecl *)(void)' to 'std::vector<_Ty> &'" in VS2008?
Erik Aronesty over 12 years

strtok_r can be used if you are in a section of code that may be accessed. this is the only solution of all of the above that isn't "line noise", and is a testament to what, exactly, is wrong with c++
Offirmo over 12 years

Warning, when given an empty string (""), this method return a vector containing the "" string. So add an "if (!string_to_split.empty())" before the split.
Admin over 12 years

I'm quite a fan of this, but for g++ (and probably good practice) anyone using this will want typedefs and typenames: typedef ContainerT Base; typedef typename Base::value_type ValueType; typedef typename ValueType::size_type SizeType; Then to substitute out the value_type and size_types accordingly.
gregschlom over 12 years

The first version is simple and gets the job done perfectly. The only change I would made would be to return the result directly, instead of passing it as a parameter.
ACK_stoverflow over 12 years

@Ian Embedded developers aren't all using boost.
Drake over 12 years

The output is passed as a parameter for efficiency. If the result were returned it would require either a copy of the vector, or a heap allocation which would then have to be freed.
gregschlom over 12 years

My bad, I was wrongly assuming that that STL would use lazy copy, as Qt containers do. Too bad they don't.
WDRust about 12 years

@ACK_stoverflow are embedded developers using C++ anyway?
PlasmaHH about 12 years

That doesnt split by arbitray whitespace and produces empty strings for subsequent whitespace.
Xander Tulip about 12 years

Inefficient and you're deriving from an STL container - possibly one of the worst things you could do.
Xander Tulip about 12 years

You forgot to add to use list: "extremely inefficient"
Luis Machuca about 12 years

bcp'ing this brings forth libraries such as the MPL, which I think is really hardly needed to split text. Man it is a PITA...
Marco M. about 12 years

@XanderTulip, can you be more constructive and explain how or why?
Tony Delroy about 12 years

@j_random_hacker: "at least not for large n, which is the only case where speed matters" - also for smallish n in a large-n loop...
Tony Delroy about 12 years

@tuxSlayer: various POSIX/XOPEN/UNIX standards also specify strtok_r
tuxSlayer about 12 years

@TonyDelroy: Yeah, and it looks like in msvc it is called strtok_s (meaning safe?:)). Not too portable...
Joseph Garvin about 12 years

@XanderTulip: I assume you are referring to it returning the vector by value. The Return-Value-Optimization (RVO, google it) should take care of this. Also in C++11 you could return by move reference.
Nawaz almost 12 years

The template argument to back_inserter should be string, not vector<string>. That is, it should be back_inserter<string>(tokens), not back_inserter<vector<string>>(tokens).
Tony Delroy almost 12 years

@tuxSlayer: if you'd prefer to write your own implementation instead of have a five line #if/#else/#endif then knock yourself out....
Nils almost 12 years

Use std::string::find(..) and std::string::substr(..) no need to use boost.
Kit10 almost 12 years

I liked this solution, however, I wrapped the function in a template, changing the vectors std::string template parameter into a parameter. For me, I also used boost::lexical_cast on said template parameter in the push_back.
Wes Miller almost 12 years

For those of us for whom the template stuff and the first comment are completely foreign, a usage example cmplete with required includes would be lovely.
Wes Miller almost 12 years

Ahh well, I figured it out. I put the C++ lines from aws' comment inside the function body of tokenize(), then edited the tokens.push_back() lines to change the ContainerT::value_type to just ValueType and changed (ContainerT::value_type::size_type) to (SizeType). Fixed the bits g++ had been whining about. Just invoke it as tokenize( some_string, some_vector );
AndersK over 11 years

actually in our company we are not allowed to use boost due to security, yeah i know but suits have decided.
Mihai Bişog over 11 years

This can actually be optimized further: instead of .push_back(str.substr(...)) one can use .emplace_back(str, start, pos - start). This way the string object is constructed in the container and thus we avoid a move operation + other shenanigans done by the .substr function.
Marco M. over 11 years

@zoopp yes. Good idea. VS10 didn't have emplace_back support when I wrote this. I will update my answer. Thanks
user997112 over 11 years

Could someone be so kind as to provide a "summary" as to why this code has much greater performance?
Alexei Sholik over 11 years

Take a look at ranges if you care about elegance in practical terms (i.e. do more with less code): slideshare.net/rawwell/iteratorsmustgo
thecoshman over 11 years

This is just a great big ugly wall of code. You should explain the logic behind it.
キキジキ over 11 years

How can I modify it to work with std::wstring, std::getline won't work right?
Jason over 11 years

std::getline is templated, so it may "just work", if not see en.cppreference.com/w/cpp/string/basic_string/getline to figure out how to tweak it. Passing a wchar_t character as the delim may be enough to trigger the right template.
Marius over 11 years

Apart from running a few performance tests on sample data, primarily I've reduced it to as few as possible instructions and also as little as possible memory copies enabled by the use of a substring class that only references offsets/lengths in other strings. (I rolled my own, but there are some other implementations). Unfortunately there is not too much else one can do to improve on this, but incremental increases were possible.
marko over 11 years

Welcome to StackOverflow. Your answer would be improved if you described the code a bit further. What differentiates it from the one (very high scoring) answers on this old question?
Jerry Coffin over 11 years

Actually, this can work just fine with other delimiters (though doing some is somewhat ugly). You create a ctype facet that classifies the desired delimiters as whitespace, create a locale containing that facet, then imbue the stringstream with that locale before extracting strings.
Oktalist over 11 years

This is the best answer here, if you only want to split on a single delimiter character. The original question wanted to split on whitespace though, meaning any combination of one or more consecutive spaces or tabs. You have actually answered stackoverflow.com/questions/53849
Clay over 11 years

The main purpose of istream_iterator is it can parse int, float, double, etc from an istream: istream_iterator<double> does a decent job reading doubles separated by space. With a front or especially back inserter it's a great combo! :)
legends2k over 11 years

vector has a ctor that takes a begin and end iterator, so no need for the copy call to insert them into a container.
Christian Rau over 11 years

@Kinderchocolate "The string can be assumed to be composed of words separated by whitespace" - Hmm, doesn't sound like a poor solution to the question's problem. "not scalable and not maintable" - Hah, nice one.
Christian Rau over 11 years

@Nawaz Why should it? You're inserting into a std::vector<std::string> and not into a std::string. But then again, there shouldn't be an explicit template argument, anyway (well, there shouldn't even be a back_inserter or copy, but ok).
Nawaz over 11 years

@ChristianRau: Oh you're right; the first code-snippet probably confused me. Actually I should have said you don't need to mention the template argument in std::back_inserter; in fact, mentioning template argument defies the very purpose of back_inserter.
Peter M about 11 years

I like this because it requires the minimum amount of extra headers. I might recommend an edit to make it follow best practice usage of namespaces (IE std:: in front of everything).
GMasucci almost 11 years

as an addendum: I use boost only when I must, normally I prefer to add to my own library of code which is standalone and portable so that I can achieve small precise specific code, which accomplishes a given aim. That way the code is non-public, performant, trivial and portable. Boost has its place but I would suggest that its a bit of overkill for tokenising strings: you wouldnt have your whole house transported to an engineering firm to get a new nail hammered into the wall to hang a picture.... they may do it extremely well, but the prosare by far outweighed by the cons.
Drake almost 11 years

A slight addendum to my comment above: this function could return the vector without penalty if using C++11 move semantics.
Rozuur almost 11 years

if you are enabling return value optimization, can't you make the function to return void?
Alex S almost 11 years

@AlecThomas: Even before C++11, wouldn't most compilers optimise away the return copy via NRVO? (+1 anyway; very succinct)
user2083364 over 10 years

nice it even works for calling of boost framework in xcode (iOS project) in cpp class
Andreas Spindler over 10 years

Nice. Regarding Appender note "Why shouldn't we inherit a class from STL classes?"
David G over 10 years

In order to avoid it skipping empty tokens, do an empty() check: if (!item.empty()) elems.push_back(item)
Alex Spencer over 10 years

@Peter M I would rather have it be passed in by reference, just in case the vector<string> got large.
herohuyongtao over 10 years

How about the delim contains two chars as ->?
Jason over 10 years

@herohuyongtao, this solution only works for single char delimiters.
stewart99 over 10 years

why do you need to use curly brackets in vector<string> tokens{istream_iterator<string>{iss}, istream_iterator<string>{}}; is it because otherwise it looks like function call?
loop over 10 years

@Copperpot How did you do it in a template?
duslabo over 10 years

@EvanTeran This may be not regarding splitting the string but general doubt in your code, The elems you are passing as an reference argument and returning the reference again. I just wanted to know is there any reason for that?
Jason over 10 years

@JeshwanthKumarNK, it's not necessary, but it lets you do things like pass the result directly to a function like this: f(split(s, d, v)) while still having the benefit of a pre-allocated vector if you like.
Drake about 10 years

@Veritas In what way does it not work if the delimiter is the last character? Also, outputting empty tokens is intentional, though it could obviously be easily modified to not do that if required.
Erik Aronesty about 10 years

Updated so there can be no objections on the grounds of thread safety from C++ wonks.
paulm about 10 years

"For example, printf and scanf both are faster then cin and cout" only because synchronization is enabled by default
EvilTeach almost 10 years

strtok is evil. It treats two delimiters as a single delimiter if there is nothing between them.
mchiasson over 9 years

This would have been my favourite answer, but std::regex is broken in GCC 4.8. They said that they implemented it correctly in GCC 4.9. I am still giving you my +1
jww over 9 years

"The STL does not have such a method available already" - what's wrong with string's find_first_of and using iterators to remember positions? Then, use substr to extract.
Brent Bradburn over 9 years

Similar responses with maybe better regex approach: here, and here.
Ben Voigt about 9 years

@paulm: No, the slowness of C++ streams is caused by facets. They're still slower than stdio.h functions even when synchronization is disabled (and on stringstreams, which can't synchronize).
Ziyuan about 9 years

Questions: 1. why would istream_iterator stop at white spaces? For me spaces are also part of the string; 2. why is it very inefficient?
Michael Trouw about 9 years

The elegance in needing 5 includes, 3 lines (not counting using <namespace> and quite cryptic code to... split a string? dear god.
Andreas Spindler about 9 years

There have been 2 revs. That's nice. Seems as if my English had to much of a "German". However, the revisionist did not fixed two minor bugs maybe because they were obvious anyway: std::isupper could be passed as argument, not std::upper. Second put a typename before the String::const_iterator.
Tiago about 9 years

My personal opinion is that C and C++ are languages not meant to be agile or to provide fast to market solutions, using Boost is almost the same as choosing an higher level language that offer more abstraction, for those we choose Java, C#, etc... Because for those we don't care for exactly what it's doing beneath the hood. Using Boost would also mean that I would have to tell my client that I'm including a third party library. Thanks anyway. :)
Andrew almost 9 years

Using Boost is like using a Booster seat: no thanks.
Yetti99 almost 9 years

A for() loop looks better. Like this davekb.com/browse_programming_tips:strtok_r_example:txt
Spacen Jasset almost 9 years

Out of all the answers this appears to be one of the most appealing and flexible. Together with the getline with a delimiter, although its a less obvious solution. Does the c++11 standard not have anything for this? Does c++11 support punch cards these days?
Guosheng over 8 years

Maybe there's a bug. Given "xxxabcyyyabczzzabc" and "abo", the split result is "xxx|cyyy|czzz|c".
Marius over 8 years

That's the correct output for when trimEmpty = true. Keep in mind that "abo" is not a delimiter in this answer, but the list of delimiter characters. It would be simple to modify it to take a single delimiter string of characters (I think str.find_first_of should change to str.find_first, but I could be wrong... can't test)
Moiz Sajid over 8 years

We could also have used STL to split a string.
dshin over 8 years

Caveat: split("one:two::three", ':') and split("one:two::three:", ':') return the same value.
fmuecke over 8 years

please use std::vector instead of list
fmuecke over 8 years

that's actually quite nice. although I don't think erasing is the most efficient way and (2) what about keeping empty tokens?
fmuecke over 8 years

almost perfect: split(":abc:def:", ':'); returns only 3 instead of 4 elements!
fmuecke over 8 years

Finally a solution that is handling empty tokens correctly at both sides of the string
Kaz over 8 years

@fmuecke There is no requirement in the question for a specific representation to use for the pieces of the string, hence there is no need to incorporate your suggestion into the answer.
LearnCocos2D over 8 years

If you pass in an empty string, it returns a vector with 1 element (empty string). If you pass in a string that's the same as sep, then it returns a vector with 2 elements (both empty strings). Should have "if (end > 0) {" before the push_back in while loop and "if (start > 0) {" before push_back below while loop to fix this.
Slava over 8 years

Many questionable implemetation details aside this answer is the only one which does it lazily. I am really disappointed in C++ world here. Well, streamiterator kind of does it too, but then everyone puts result into vector<string> killing all the benefits...
Drake over 8 years

@LearnCocos2D Please don't alter the meaning of a post with an edit. This behaviour is by design. It is identical behaviour to Python's split operator. I'll add a note to make this clear.
QuantumKarl over 8 years

This is my favorite with minor changes: vector returned as reference as you said, and the arguments "str" and "regex" passed by references also. thx.
Jonny over 8 years

Being able to set max number of returned elements is crucial to me.
Jonny over 8 years

Can someone please make it return up to a max N elements? Any remaining characters should end up in the last element.
Jason over 8 years

@Jonny, should be trivial, just add an extra condition to the while loop comparing the vector's size to the max. Something like this: while (elems.size() < max_count && std::getline(ss, item, delim)) {
Jason over 8 years

@Jonny, I see. Your answer looks a bit more complex than necessary. If you make the max default to something like size_t(-1), that will effectively be "infinity" (it's the biggest size your system can represent, so you'll run out of RAM before you hit this). Then you can make the condition as simple as my comment above. No more need to double check the stream state and do a second read and such. Just a suggestion :-).
Jonny over 8 years

Might be wrong but you might lose the end of the string with that. Well basically I mimic the explode function of php, or so I believe.
Jason over 8 years

Gotcha. My solution will stop at max_count, skipping the rest of the string (since it found the amount it wanted). I guess you are looking for something that will always make the last one the rest of the string. I have some functions like that too here: github.com/eteran/cpp-utilities/blob/master/string.h Some are specifically designed to match php's string manipulation functions as closely as possible :-)
Gabriel over 8 years

Why not return split(s, delim, std::vector<std::string>()); ?
Jason over 8 years

@Gabriel, you could. But I think when it was written (a few years ago), having a named variable encouraged NVRO more reliably. With C++11 move semantics, it may be a lot less of a difference.
Pascal Kesseli over 8 years

Suggest using std::string::size_type instead of int, as some compilers might spit out signed/unsigned warnings otherwise.
David Doria over 8 years

This only works for single character delimiters. A simple change lets it work with multicharacter: prev_pos = pos += delimiter.length();
sajas about 8 years

Can the boost::split really work on the utf-8 string? Can you share any documentation for that? I am trying to split a utf-8 string at newlines. Will the boost::split work correctly if the string that I pass is using utf-8 encoding?
noɥʇʎԀʎzɐɹƆ almost 8 years

This is much faster than Evan Teran's answer if you only need to split on whitespace.
exilit almost 8 years

While the missing delimiter concern is correct one should take into account that the OPs solution couldn't handle that either. So this seems to be not a requirement.
Alessandro Teruzzi almost 8 years

I like this solution because it allows the separator to be a string and not a char, however, it is modifying in place the string, so it is forcing the creation of a copy of the original string.
Marius over 7 years

Thanks @thomas-perl for the revision, it does indeed make it more readable and compact. My original implementation avoided the additional comparison per loop as I was optimizing for a very low latency application. Your edit will be more applicable to most users visiting here however.
Seppo Enarvi about 7 years

@doorfly The only place where curly brackets are needed is istream_iterator<string>{}, because that would otherwise be regarded as a function.
Timmmm about 7 years

This is quite nice. I feel like the code could be clearer though, e.g. end unexpectedly isn't s.end().
Galik about 7 years

@Timmmm Out of curiosity what would you suggest for pos, end and done?
Timmmm about 7 years

Also you can make it a bit simpler with find_first_of and find_first_not_of.
Galik about 7 years

@Timmmm Well I shouldn't be using ptr_fun but using std::isspace makes the code more easily modifiable to accommodate different locales. Having said that my current working version uses find_first_of. That makes it more efficient and able to split on any character not just whitespace. In fact I also have a version that splits on a supplied string too , that uses std::search (the possibilities for this function are multifold it seems).
Timmmm about 7 years

Yeah, I rewrote it like this. Thanks for the code!
Galik about 7 years

@Timmmm looks good. I'm not going to post my current version(s) here because they are monstrously templated to accommodate different string and container types and look horrendous (I'm overdue revisiting that code). But I will get rid of that std::ptr_fun hehe
Daniel Ryan almost 7 years

Anyone else getting the error "missing 'typename' prior to dependent type name 'T::size_type'"?
andrewmu almost 7 years

@Andrew: any_of has been part of the standard library since 2011: en.cppreference.com/w/cpp/algorithm/all_any_none_of
Diedre almost 7 years

be aware that if you are using OpenCV, split can be confused with split from OpenCV that splits images.
Oomph Fortuity over 6 years

Adding some explanation would be helpful.
Roman Shestakov over 6 years

the first function in this answer is the best solution - works perfectly with a reverse join function - std::string strJoin(const std::vector<std::string> v, const char& delimiter) { if(!v.empty()) { std::stringstream ss; std::string str(1, delimiter); auto it = v.cbegin(); while(true) { ss << *it++; if(it != v.cend()) ss << delimiter; else return ss.str(); } } return ""; }
doctorram over 6 years

I really wish they'd add a standard method with this signature: vector<string> std::string::split(char delimiter = ' ');
Sam about 6 years

Raw strings are pretty useful while dealing with regex patterns. That way, you don't have to use the escape sequences... You can just use R"([\s,]+)".
einpoklum about 6 years

Does this materialize a copy of all of the tokens, or does it only keep the start and end position of the current token?
kayleeFrye_onDeck almost 6 years

I had some issues initially, but this does in fact work with wstring / unicode if you update the template accordingly. Be careful though; i ran into some easy to cause runtime errors that the compiler didn't catch in a couple different places.
kayleeFrye_onDeck almost 6 years

If using wstring and your code breaks, check this answer for fixing the istream_iterator usage with wchar_t: stackoverflow.com/a/20959347/3543437
Marius almost 6 years

Thanks @kayleeFrye_onDeck , I've not been using C++ at this level for a few years now and may be a bit rusty on the new specs, but if there is anything I should fix on this post, let me know and I'll check it out.
Ali over 5 years

You can also split on other delimiters if you use getline in the while condition e.g. to split by commas, use while(getline(ss, buff, ',')).
Martin York about 5 years

@l3dx Yes. You can add a specialized local to the stream that makes a , a space (and all other characters not a space). Then the code will work just the same. codereview.stackexchange.com/a/57467/507
Porsche9II about 5 years

Yepp, the range for based looks better - I agree
tbeu almost 5 years

@loop See gitlab.com/tbeu/wcx_setfolderdate/blob/master/src/splitstrin‌g.h for a templated implementation.
Admin over 4 years

This code could really use some comments to explain what the purpose of every item is. A typical person asking this question is only going to end up with more questions after reading this, e.g. what the purpose of the empty istream_iterator is, or why the "create the vector directly" solution has so many brackets.
luizfls about 4 years

@tbeu fixing your link: gitlab.com/tbeu/wcx_setfolderdate/-/blob/master/src/…
tjysdsg about 4 years

I don't think there is any power or elegance in this, compared to just std::string::split(). Of course there is not such split in STL
Pryftan almost 4 years

Or you could use strsep() (though not as portable). If you don't care about more than one char as the delimiter another answer gives an idea (getdelim()) but you could also iterate over the string with strchr(). Or...there are many ways depending on what you are after and need.
Mellester almost 4 years

You can set the delimiter of istringstream stackoverflow.com/a/21814768/1943599
UserX over 3 years

This looks WAY more complex than the original proposed solution. You shouldn't have to do this much work just to split a string!
Teodor Maxim about 3 years

I believe this could be optimized a bit by using word.clear() instead of word = "". Calling the clear method will empty the string but keep the already allocated buffer, which will be reused upon further concatenations. Right now a new buffer is created for every word, resulting in extra allocations.
nyarlathotep108 over 2 years

@UserX this might be more complex than the original proposed solution, but it is also more efficient.
Pablo H over 2 years

This is mostly the same as stackoverflow.com/a/54134243/6655648.
Johannes Overmann over 2 years

As others noted this does not correctly handle emtpy strings at the end. (This is not a matter of definition since "a,b," and "a,b" both give the same result.) This can be fixed by initializing iss with s + delim and handling the special case that an empty strig should return an empty list explicitly.
underscore_d over 2 years

it's only a two-liner because one of those two lines is huge and cryptic... no one who actually has to read code ever, wants to read something like this, or would write it. contrived brevity is worse than tasteful verbosity.
Optimus1 over 2 years

Your code is not working! Try string = "hih1ihi", substring = "hi". Your code is not giving the correct result. minus.
Optimus1 over 2 years

Your code is as slow as a turtle after drinking.
Marius about 2 years

@Optimus1 I think you assumed the delimiters parameter is not a character list of delimiters but rather a substring. Therein lies the rub.