initializing std::string from char* without copy

29,710

Solution 1

Is it at all possible to use a C++ string in step 1? If you use string::reserve(size_t), you can allocate a large enough buffer to prevent multiple heap allocations while appending the smaller strings, and then you can just use that same C++ string throughout all of the remaining steps.

See this link for more information on the reserve function.

Solution 2

You can't actually form a std::string without copying the data. A stringstream would probably reuse the memory from pass to pass (though I think the standard is silent on whether it actually has to), but it still wouldn't avoid the copying.

A common approach to this sort of problem is to write the code which processes the data in step 3 to use a begin/end iterator pair; then it can easily process either a std::string, a vector of chars, a pair of raw pointers, etc. Unlike passing it a container type like std::string, it would no longer know or care how the memory was allocated, since it would still belong to the caller. Carrying this idea to its logical conclusion is boost::range, which adds all the overloaded constructors to still let the caller just pass a string/vector/list/any sort of container with .begin() and .end(), or separate iterators.

Having written your processing code to work on an arbitrary iterator range, you could then even write a custom iterator (not as hard as it sounds, basically just an object with some standard typedefs, and operator ++/*/=/==/!= overloaded to get a forward-only iterator) that takes care of advancing to the next fragment each time it hit the end of the one it's working on, skipping over whitespace (I assume that's what you meant by trim). That you never had to assemble the whole string contiguously at all. Whether or not this would be a win depends on how many fragments/how large of fragments you have. This is essentially what the SGI rope mentioned by Martin York is: a string where append forms a linked list of fragments instead of a contiguous buffer, which is thus suitable for much longer values.


UPDATE (since I still see occasional upvotes on this answer):

C++17 introduces another choice: std::string_view, which replaced std::string in many function signatures, is a non-owning reference to a character data. It is implicitly convertible from std::string, but can also be explicitly constructed from contiguous data owned somewhere else, avoiding the unnecessary copying std::string imposes.

Solution 3

To help with really big strings SGI has the class Rope in its STL.
Non standard but may be usefull.

http://www.sgi.com/tech/stl/Rope.html

Apparently rope is in the next version of the standard :-)
Note the developer joke. A rope is a big string. (Ha Ha) :-)

Solution 4

This is a lateral thinking answer, not directly addressing the question but "thinking" around it. Might be useful, might not...

Readonly processing of std::string doesn't really require a very complex subset of std::string's features. Is there a possibility that you could do search/replace on the code that performs all the processing on std::strings so it takes some other type instead? Start with a blank class:

class lightweight_string { };

Then replace all std::string references with lightweight_string. Perform a compilation to find out exactly what operations are needed on lightweight_string for it to act as a drop-in replacement. Then you can make your implementation work however you want.

Solution 5

Is each iteration independent enough that you can use the same std::string for each iteration? One would hope that your std::string implementation is smart enough to re-use memory if you assign a const char * to it when it was previously used for something else.

Assigning a char * into a std::string must always at least copy the data. Memory management is one of the main reasons to use std::string, so you won't be a able to override it.

Share:
29,710
cutiko
Author by

cutiko

Updated on December 01, 2020

Comments

  • cutiko
    cutiko over 3 years

    I have a situation where I need to process large (many GB's) amounts of data as such:

    1. build a large string by appending many smaller (C char*) strings
    2. trim the string
    3. convert the string into a C++ const std::string for processing (read only)
    4. repeat

    The data in each iteration are independent.

    My question is, I'd like to minimise (if possible eliminate) heap allocated memory usage, as it at the moment is my largest performance problem.

    Is there a way to convert a C string (char*) into a stl C++ string (std::string) without requiring std::string to internally alloc/copy the data?

    Alternatively, could I use stringstreams or something similar to re-use a large buffer?

    Edit: Thanks for the answers, for clarity, I think a revised question would be:

    How can I build (via multiple appends) a stl C++ string efficiently. And if performing this action in a loop, where each loop is totally independant, how can I re-use thisallocated space.

  • cutiko
    cutiko over 15 years
    Yes it would, although the inputs (C char*'s) and output (std::string) are not under my control.
  • cutiko
    cutiko over 15 years
    Your solution sounds like what im after, although, when you say "you can just use that same C++ string" do you mean using clear(), and then continuing to build the next string?
  • cutiko
    cutiko over 15 years
    I think your solution is the best approach (chaning the processing code) unfortunatly in this situation its not an option.
  • cutiko
    cutiko over 15 years
    This solution is the simplest for me to adopt, currently, although I'm unsure if the re-use is actually specified by the standard (although it does seem to work on my implementation)
  • Mathias
    Mathias over 15 years
    Using clear() should work. As far as I know, clear() does not free the memory used by the string, and so does not affect the space allocated by reserve().
  • Nicholas Knight
    Nicholas Knight almost 13 years
    @Akusete @e.James: (Yes, I know it's 2.5 years later) As I read the relevant parts of the standard, clear() is defined in terms of erase(), and erase() specifically does not dictate whether the block of memory in use after the call to erase() is the same as that in use before the call to erase(), nor does it dictate that capacity() remain the same. See e.g. 21.3.3 and 21.3.5.5. Reasonable implementations certainly may function this way, but be sure to treat it only as a helpful optimization -- not a requisite for proper program functioning.
  • Mathias
    Mathias almost 13 years
    @Nicholas Knight: Fair point. As always, the standard procedure is to make optimizations only after taking measurements and making sure that it is necessary. I wonder if I even knew that back in '08? :)
  • Bart
    Bart over 10 years
    @jons34yp FYI, your edits seem to have attracted some attention. Nothing to worry about, but they are now discussed on meta meta.stackexchange.com/questions/194788/…
  • SasQ
    SasQ almost 9 years
    Unless that processing code is a library function which doesn't use iterators nor strings, just a plain old char* + size.
  • TheArchitect
    TheArchitect almost 7 years
    @Akusete The only way to change the string buffer size directly by a function member is std::string::shrink_to_fit. std::string::clear just set the size (not the buffer) of string to 0.
  • codesniffer
    codesniffer about 6 years
    std::string::clear implementations are free to do anything with the underlying buffer, including freeing it.