Split string into key-value pairs using C++
Solution 1
Well I have two methods here. The first one is the easy, obvious method that I use all the time (performance is rarely an issue). The second method is likely more efficient but I have not done any formal timings.
In my tests the second method is about 3 times faster.
#include <map>
#include <string>
#include <sstream>
#include <iostream>
std::map<std::string, std::string> mappify1(std::string const& s)
{
std::map<std::string, std::string> m;
std::string key, val;
std::istringstream iss(s);
while(std::getline(std::getline(iss, key, ':') >> std::ws, val))
m[key] = val;
return m;
}
std::map<std::string, std::string> mappify2(std::string const& s)
{
std::map<std::string, std::string> m;
std::string::size_type key_pos = 0;
std::string::size_type key_end;
std::string::size_type val_pos;
std::string::size_type val_end;
while((key_end = s.find(':', key_pos)) != std::string::npos)
{
if((val_pos = s.find_first_not_of(": ", key_end)) == std::string::npos)
break;
val_end = s.find('\n', val_pos);
m.emplace(s.substr(key_pos, key_end - key_pos), s.substr(val_pos, val_end - val_pos));
key_pos = val_end;
if(key_pos != std::string::npos)
++key_pos;
}
return m;
}
int main()
{
std::string s = "CA: ABCD\nCB: ABFG\nCC: AFBV\nCD: 4567";
std::cout << "mappify1: " << '\n';
auto m = mappify1(s);
for(auto const& p: m)
std::cout << '{' << p.first << " => " << p.second << '}' << '\n';
std::cout << "mappify2: " << '\n';
m = mappify2(s);
for(auto const& p: m)
std::cout << '{' << p.first << " => " << p.second << '}' << '\n';
}
Output:
mappify1:
{CA => ABCD}
{CB => ABFG}
{CC => AFBV}
{CD => 4567}
mappify2:
{CA => ABCD}
{CB => ABFG}
{CC => AFBV}
{CD => 4567}
Solution 2
This format is called "Tag-Value".
The most performance critical place where such encoding is used in the industry is probably financial FIX Protocol (=
for key-value separator, and '\001'
as entries delimiter). So if you are on x86 hardware then your best bet would be to google 'SSE4 FIX protocol parser github' and reuse the open sourced findings of HFT shops.
If you still want to delegate the vectorization part to the compiler and can spare few nanoseconds for readability then the most elegant solution is to store the result in a std::string
(data) + boost::flat_map<boost::string_ref, boost::string_ref>
(view). Parsing is a matter of taste, while-loop or strtok would be easiest for the compiler to parse. Boost-spirit based parser would be easiest for a human (familiar with boost-spirit) to read.
C++ for-loop based solution
#include <boost/container/flat_map.hpp>
#include <boost/range/iterator_range.hpp>
#include <boost/range/iterator_range_io.hpp>
#include <iostream>
// g++ -std=c++1z ~/aaa.cc
int main()
{
using range_t = boost::iterator_range<std::string::const_iterator>;
using map_t = boost::container::flat_map<range_t, range_t>;
char const sep = ':';
char const dlm = '\n';
// this part can be reused for parsing multiple records
map_t result;
result.reserve(1024);
std::string const input {"hello:world\n bye: world"};
// this part is per-line/per-record
result.clear();
for (auto _beg = begin(input), _end = end(input), it = _beg; it != _end;)
{
auto sep_it = std::find(it, _end, sep);
if (sep_it != _end)
{
auto dlm_it = std::find(sep_it + 1, _end, dlm);
result.emplace(range_t {it, sep_it}, range_t {sep_it + 1, dlm_it});
it = dlm_it + (dlm_it != _end);
}
else throw std::runtime_error("cannot parse");
}
for (auto& x: result)
std::cout << x.first << " => " << x.second << '\n';
return 0;
}
Solution 3
The format is simple enough that doing the parsing "by hand" IMO is the best option, overall remains quite readable.
This should also be reasonably efficient (the key
and value
strings are always the same - albeit clear
ed, so the reallocations inside the main loop should just stop after a few iterations); ret
also should qualify for NRVO, OTOH in case of problems with that you can always change to an output parameter.
Of course std::map
may not be the fastest gun in the west, but it's a request in the problem text.
std::map<std::string, std::string> parseKV(const std::string &sz) {
std::map<std::string, std::string> ret;
std::string key;
std::string value;
const char *s=sz.c_str();
while(*s) {
// parse the key
while(*s && *s!=':' && s[1]!=' ') {
key.push_back(*s);
++s;
}
// if we quit due to the end of the string exit now
if(!*s) break;
// skip the ": "
s+=2;
// parse the value
while(*s && *s!='\n') {
value.push_back(*s);
++s;
}
ret[key]=value;
key.clear(); value.clear();
// skip the newline
++s;
}
return ret;
}
Solution 4
If worried about performance, you should probably rethink the need for the end result to be a map. That could end up being a lot of char buffers in memory. Ideally keeping track of just the char* and length of each sub string will be faster/smaller.
Viking
Updated on June 14, 2022Comments
-
Viking about 2 years
I have a string like this:
"CA: ABCD\nCB: ABFG\nCC: AFBV\nCD: 4567"
Now
": "
splits key from value while\n
separates the pairs. I want to add the key-value pairs to a map in C++.Is there any efficient way of doing this considering optimization in mind?
-
Ed Heal almost 8 yearsHave you read the manual page for
std::string
-
Some programmer dude almost 8 yearsUsing e.g.
std::istringstream
andstd::getline
could be a good start. Note thatstd::getline
can be used for arbitrary separators, not only newlines. -
Some programmer dude almost 8 yearsAlso don't worry about optimizations at this stage. First make sure your program works, then benchmark, measure and profile to find the bottlenecks, and optimize those. Premature optimization is only going to lead you astray.
-
AnatolyS almost 8 yearsyou can implement it "as hoc", after that you can profile your solution and find slow places which you can optimize if you need.
-
Karoly Horvath almost 8 yearsHere on SO there's a strong correlation between the inability to do the task and asking for (the most) efficient way of doing it.
-
Geula Vainappel almost 8 yearsI think you can find some tips here: stackoverflow.com/questions/7621727/…
-
mpromonet almost 8 yearsYou can find some information from stackoverflow C++ documentation
-
-
Matteo Italia almost 8 yearsUsing a parser generator (and specifically the
boost::spirit
monstrosity) to parse a tag-value string is definitely overkill... -
bobah almost 8 years@MatteoItalia - totally, a while loop will be the most natural way of doing it, and this is how it would be done in most github FIX protocol parsers, which I suggested to look at.
-
bobah almost 8 years
std::map
without custom allocator is the best tool to slow down the code (memory allocations) and fragment the heap on the way. -
Jojje about 4 yearsThanks for sharing. I think your solution2 has an issue in that the delimiter might end up in the value,