Split string into key-value pairs using C++

18,791

Solution 1

Well I have two methods here. The first one is the easy, obvious method that I use all the time (performance is rarely an issue). The second method is likely more efficient but I have not done any formal timings.

In my tests the second method is about 3 times faster.

#include <map>
#include <string>
#include <sstream>
#include <iostream>

std::map<std::string, std::string> mappify1(std::string const& s)
{
    std::map<std::string, std::string> m;

    std::string key, val;
    std::istringstream iss(s);

    while(std::getline(std::getline(iss, key, ':') >> std::ws, val))
        m[key] = val;

    return m;
}

std::map<std::string, std::string> mappify2(std::string const& s)
{
    std::map<std::string, std::string> m;

    std::string::size_type key_pos = 0;
    std::string::size_type key_end;
    std::string::size_type val_pos;
    std::string::size_type val_end;

    while((key_end = s.find(':', key_pos)) != std::string::npos)
    {
        if((val_pos = s.find_first_not_of(": ", key_end)) == std::string::npos)
            break;

        val_end = s.find('\n', val_pos);
        m.emplace(s.substr(key_pos, key_end - key_pos), s.substr(val_pos, val_end - val_pos));

        key_pos = val_end;
        if(key_pos != std::string::npos)
            ++key_pos;
    }

    return m;
}

int main()
{
    std::string s = "CA: ABCD\nCB: ABFG\nCC: AFBV\nCD: 4567";

    std::cout << "mappify1: " << '\n';

    auto m = mappify1(s);
    for(auto const& p: m)
        std::cout << '{' << p.first << " => " << p.second << '}' << '\n';

    std::cout << "mappify2: " << '\n';

    m = mappify2(s);
    for(auto const& p: m)
        std::cout << '{' << p.first << " => " << p.second << '}' << '\n';
}

Output:

mappify1: 
{CA => ABCD}
{CB => ABFG}
{CC => AFBV}
{CD => 4567}
mappify2: 
{CA => ABCD}
{CB => ABFG}
{CC => AFBV}
{CD => 4567}

Solution 2

This format is called "Tag-Value".

The most performance critical place where such encoding is used in the industry is probably financial FIX Protocol (= for key-value separator, and '\001' as entries delimiter). So if you are on x86 hardware then your best bet would be to google 'SSE4 FIX protocol parser github' and reuse the open sourced findings of HFT shops.

If you still want to delegate the vectorization part to the compiler and can spare few nanoseconds for readability then the most elegant solution is to store the result in a std::string (data) + boost::flat_map<boost::string_ref, boost::string_ref> (view). Parsing is a matter of taste, while-loop or strtok would be easiest for the compiler to parse. Boost-spirit based parser would be easiest for a human (familiar with boost-spirit) to read.

C++ for-loop based solution

#include <boost/container/flat_map.hpp> 
#include <boost/range/iterator_range.hpp>

#include <boost/range/iterator_range_io.hpp> 
#include <iostream>

// g++ -std=c++1z ~/aaa.cc
int main()
{
    using range_t = boost::iterator_range<std::string::const_iterator>;
    using map_t = boost::container::flat_map<range_t, range_t>;

    char const sep = ':';
    char const dlm = '\n';

    // this part can be reused for parsing multiple records
    map_t result;
    result.reserve(1024);

    std::string const input {"hello:world\n bye: world"};

    // this part is per-line/per-record
    result.clear();
    for (auto _beg = begin(input), _end = end(input), it = _beg; it != _end;)
    {
        auto sep_it = std::find(it, _end, sep);
        if (sep_it != _end)
        {
            auto dlm_it = std::find(sep_it + 1, _end, dlm);
            result.emplace(range_t {it, sep_it}, range_t {sep_it + 1, dlm_it});
            it = dlm_it + (dlm_it != _end);
        }
        else throw std::runtime_error("cannot parse");
    }

    for (auto& x: result)
        std::cout << x.first << " => " << x.second << '\n';

    return 0;
}

Solution 3

The format is simple enough that doing the parsing "by hand" IMO is the best option, overall remains quite readable.

This should also be reasonably efficient (the key and value strings are always the same - albeit cleared, so the reallocations inside the main loop should just stop after a few iterations); ret also should qualify for NRVO, OTOH in case of problems with that you can always change to an output parameter.

Of course std::map may not be the fastest gun in the west, but it's a request in the problem text.

std::map<std::string, std::string> parseKV(const std::string &sz) {
    std::map<std::string, std::string> ret;
    std::string key;
    std::string value;
    const char *s=sz.c_str();
    while(*s) {
        // parse the key
        while(*s && *s!=':' && s[1]!=' ') {
            key.push_back(*s);
            ++s;
        }
        // if we quit due to the end of the string exit now
        if(!*s) break;
        // skip the ": "
        s+=2;
        // parse the value
        while(*s && *s!='\n') {
            value.push_back(*s);
            ++s;
        }
        ret[key]=value;
        key.clear(); value.clear();
        // skip the newline
        ++s;
    }
    return ret;
}

Solution 4

If worried about performance, you should probably rethink the need for the end result to be a map. That could end up being a lot of char buffers in memory. Ideally keeping track of just the char* and length of each sub string will be faster/smaller.

Share:
18,791
Viking
Author by

Viking

Updated on June 14, 2022

Comments

  • Viking
    Viking about 2 years

    I have a string like this:

    "CA: ABCD\nCB: ABFG\nCC: AFBV\nCD: 4567"
    

    Now ": " splits key from value while \n separates the pairs. I want to add the key-value pairs to a map in C++.

    Is there any efficient way of doing this considering optimization in mind?

    • Ed Heal
      Ed Heal almost 8 years
      Have you read the manual page for std::string
    • Some programmer dude
      Some programmer dude almost 8 years
      Using e.g. std::istringstream and std::getline could be a good start. Note that std::getline can be used for arbitrary separators, not only newlines.
    • Some programmer dude
      Some programmer dude almost 8 years
      Also don't worry about optimizations at this stage. First make sure your program works, then benchmark, measure and profile to find the bottlenecks, and optimize those. Premature optimization is only going to lead you astray.
    • AnatolyS
      AnatolyS almost 8 years
      you can implement it "as hoc", after that you can profile your solution and find slow places which you can optimize if you need.
    • Karoly Horvath
      Karoly Horvath almost 8 years
      Here on SO there's a strong correlation between the inability to do the task and asking for (the most) efficient way of doing it.
    • Geula Vainappel
      Geula Vainappel almost 8 years
      I think you can find some tips here: stackoverflow.com/questions/7621727/…
    • mpromonet
      mpromonet almost 8 years
      You can find some information from stackoverflow C++ documentation
  • Matteo Italia
    Matteo Italia almost 8 years
    Using a parser generator (and specifically the boost::spirit monstrosity) to parse a tag-value string is definitely overkill...
  • bobah
    bobah almost 8 years
    @MatteoItalia - totally, a while loop will be the most natural way of doing it, and this is how it would be done in most github FIX protocol parsers, which I suggested to look at.
  • bobah
    bobah almost 8 years
    std::map without custom allocator is the best tool to slow down the code (memory allocations) and fragment the heap on the way.
  • Jojje
    Jojje about 4 years
    Thanks for sharing. I think your solution2 has an issue in that the delimiter might end up in the value,