Remove extra white spaces in C++

30,436

Solution 1

Here's a simple, non-C++11 solution, using the same remove_extra_whitespace() signature as in the question:

#include <cstdio>

void remove_extra_whitespaces(char* input, char* output)
{
    int inputIndex = 0;
    int outputIndex = 0;
    while(input[inputIndex] != '\0')
    {
        output[outputIndex] = input[inputIndex];

        if(input[inputIndex] == ' ')
        {
            while(input[inputIndex + 1] == ' ')
            {
                // skip over any extra spaces
                inputIndex++;
            }
        }

        outputIndex++;
        inputIndex++;
    }

    // null-terminate output
    output[outputIndex] = '\0';
}

int main(int argc, char **argv)
{
    char input[0x255] = "asfa sas    f f dgdgd  dg   ggg";
    char output[0x255] = "NO_OUTPUT_YET";
    remove_extra_whitespaces(input,output);

    printf("input: %s\noutput: %s\n", input, output);

    return 1;
}

Output:

input: asfa sas    f f dgdgd  dg   ggg
output: asfa sas f f dgdgd dg ggg

Solution 2

There are already plenty of nice solutions. I propose you an alternative based on a dedicated <algorithm> meant to avoid consecutive duplicates: unique_copy():

void remove_extra_whitespaces(const string &input, string &output)
{
    output.clear();  // unless you want to add at the end of existing sring...
    unique_copy (input.begin(), input.end(), back_insert_iterator<string>(output),
                                     [](char a,char b){ return isspace(a) && isspace(b);});  
    cout << output<<endl; 
}

Here is a live demo. Note that I changed from c style strings to the safer and more powerful C++ strings.

Edit: if keeping c-style strings is required in your code, you could use almost the same code but with pointers instead of iterators. That's the magic of C++. Here is another live demo.

Solution 3

Since you use C++, you can take advantage of standard-library features designed for that sort of work. You could use std::string (instead of char[0x255]) and std::istringstream, which will replace most of the pointer arithmetic.

First, make a string stream:

std::istringstream stream(input);

Then, read strings from it. It will remove the whitespace delimiters automatically:

std::string word;
while (stream >> word)
{
    ...
}

Inside the loop, build your output string:

    if (!output.empty()) // special case: no space before first word
        output += ' ';
    output += word;

A disadvantage of this method is that it allocates memory dynamically (including several reallocations, performed when the output string grows).

Solution 4

You can use std::unique which reduces adjacent duplicates to a single instance according to how you define what makes two elements equal is.

Here I have defined elements as equal if they are both whitespace characters:

inline std::string& remove_extra_ws_mute(std::string& s)
{
    s.erase(std::unique(std::begin(s), std::end(s), [](unsigned char a, unsigned char b){
        return std::isspace(a) && std::isspace(b);
    }), std::end(s));

    return s;
}

inline std::string remove_extra_ws_copy(std::string s)
{
    return remove_extra_ws_mute(s);
}

std::unique moves the duplicates to the end of the string and returns an iterator to the beginning of them so they can be erased.

Additionally, if you must work with low level strings then you can still use std::unique on the pointers:

char* remove_extra_ws(char const* s)
{
    std::size_t len = std::strlen(s);

    char* buf = new char[len + 1];
    std::strcpy(buf, s);

    // Note that std::unique will also retain the null terminator
    // in its correct position at the end of the valid portion
    // of the string    
    std::unique(buf, buf + len + 1, [](unsigned char a, unsigned char b){
        return (a && std::isspace(a)) && (b && std::isspace(b));
    });

    return buf;
}

Solution 5

There are plenty of ways of doing this (e.g., using regular expressions), but one way you could do this is using std::copy_if with a stateful functor remembering whether the last character was a space:

#include <algorithm>
#include <string>
#include <iostream>

struct if_not_prev_space
{
    // Is last encountered character space.
    bool m_is = false;

    bool operator()(const char c)
    {                                      
        // Copy if last was not space, or current is not space.                                                                                                                                                              
        const bool ret = !m_is || c != ' ';
        m_is = c == ' ';
        return ret;
    }
};


int main()
{
    const std::string s("abc  sssd g g sdg    gg  gf into abc sssd g g sdg gg gf");
    std::string o;
    std::copy_if(std::begin(s), std::end(s), std::back_inserter(o), if_not_prev_space());
    std::cout << o << std::endl;
}
Share:
30,436
Damian
Author by

Damian

Updated on July 12, 2022

Comments

  • Damian
    Damian almost 2 years

    I tried to write a script that removes extra white spaces but I didn't manage to finish it.

    Basically I want to transform abc sssd g g sdg gg gf into abc sssd g g sdg gg gf.

    In languages like PHP or C#, it would be very easy, but not in C++, I see. This is my code:

    #include <iostream>
    #include <stdio.h>
    #include <stdlib.h>
    #include <cstring>
    #include <unistd.h>
    #include <string.h>
    
    char* trim3(char* s) {
        int l = strlen(s);
    
        while(isspace(s[l - 1])) --l;
        while(* s && isspace(* s)) ++s, --l;
    
        return strndup(s, l);
    }
    
    char *str_replace(char * t1, char * t2, char * t6)
    {
        char*t4;
        char*t5=(char *)malloc(10);
        memset(t5, 0, 10);
        while(strstr(t6,t1))
        {
            t4=strstr(t6,t1);
            strncpy(t5+strlen(t5),t6,t4-t6);
            strcat(t5,t2);
            t4+=strlen(t1);
            t6=t4;
        }
    
        return strcat(t5,t4);
    }
    
    void remove_extra_whitespaces(char* input,char* output)
    {
        char* inputPtr = input; // init inputPtr always at the last moment.
        int spacecount = 0;
        while(*inputPtr != '\0')
        {
            char* substr;
            strncpy(substr, inputPtr+0, 1);
    
            if(substr == " ")
            {
                spacecount++;
            }
            else
            {
                spacecount = 0;
            }
    
            printf("[%p] -> %d\n",*substr,spacecount);
    
            // Assume the string last with \0
            // some code
            inputPtr++; // After "some code" (instead of what you wrote).
        }   
    }
    
    int main(int argc, char **argv)
    {
        printf("testing 2 ..\n");
    
        char input[0x255] = "asfa sas    f f dgdgd  dg   ggg";
        char output[0x255] = "NO_OUTPUT_YET";
        remove_extra_whitespaces(input,output);
    
        return 1;
    }
    

    It doesn't work. I tried several methods. What I am trying to do is to iterate the string letter by letter and dump it in another string as long as there is only one space in a row; if there are two spaces, don't write the second character to the new string.

    How can I solve this?

  • Damian
    Damian over 8 years
    yes, string > char[0x255] , i agree, but i want to stick with char* because all the code is in char* ...
  • Damian
    Damian over 8 years
    yes, string > char[0x255] , i agree, but i want to stick with char* because all the code is in char* ...
  • Damian
    Damian over 8 years
    yes, string > char[0x255] , i agree, but i want to stick with char* because all the code is in char* ... , can it be done?
  • anatolyg
    anatolyg over 8 years
    You can convert back and forth - from char* to string by a constructor, and back by c_str() and strcpy. Lots of unnecessary work for the CPU, but less headache for you.
  • Ami Tavory
    Ami Tavory over 8 years
    Not sure you meant to address the comment to me, but see string::c_str.
  • jaggedSpire
    jaggedSpire over 8 years
    this leaves one extra space at the end of the string if it ends in whitespace. Not sure if OP's shifting requirements need that to be taken care of...
  • Deduplicator
    Deduplicator over 8 years
    @anatolyg: If it's done at the right places at the right times, there's probably at most a little amount of extra-work for the optimizer.
  • Christophe
    Christophe over 8 years
    Nice, but the static prev_is_space would not be reset if you would execute this bloc several times (in a loop or in a function or in several threads). For this to work safely you'd need to capture a local bool that you can reset when needed.
  • Ami Tavory
    Ami Tavory over 8 years
    @jaggedSpire Good point. I must say I thought of that, and decided (perhaps wishfull-thinkingly) that it fits the problem requirements. If not, though, it can be solved with a single line after the application of copy_if.
  • Deduplicator
    Deduplicator over 8 years
    That's a nice one. Though it should have the original's signature, probably.
  • Lol4t0
    Lol4t0 over 8 years
    @Christophe, I see. Thanks.
  • Christophe
    Christophe over 8 years
    @Deduplicator yes, I edited to recommend switching to std::string
  • Damian
    Damian over 8 years
    yes, i agree as well, string is the best, but all the script is written (2000 lines) using char* ... and this script must run on centos 4, 5.1 , debian 4, unix based systems ... and so on, and it is better to use the simplest functions possible, to not get segmentation fault ...
  • Damian
    Damian over 8 years
    yes, i agree as well, string is the best, but all the script is written (2000 lines) using char* ... and this script must run on centos 4, 5.1 , debian 4, unix based systems ... and so on, and it is better to use the simplest functions possible, to not get segmentation fault ...
  • Damian
    Damian over 8 years
    yes, i agree as well, string is the best, but all the script is written (2000 lines) using char* ... and this script must run on centos 4, 5.1 , debian 4, unix based systems ... and so on, and it is better to use the simplest functions possible, to not get segmentation fault ...
  • Damian
    Damian over 8 years
    hmm, very intresting, so basically your int remove_whitesaces(char *p) function, does not have to take two parameters, just modify it "on the fly" with the power of pointers, right?
  • Damian
    Damian over 8 years
    yes, i agree as well, string is the best, but all the script is written (2000 lines) using char* ... and this script must run on centos 4, 5.1 , debian 4, unix based systems ... and so on, and it is better to use the simplest functions possible, to not get segmentation fault ...
  • Jts
    Jts over 8 years
    Yeah, because the output length will always be equal or lower than the input length, so there's no need to create another object. I also overloaded it to support std::strings (and again no memory allocation takes place). I thought you would accept my answer since it's actually customizable (and doesn't accept tabs ('\t') which are considered spaces by almost everyone. And it can ignore line breaks if needed.
  • Jts
    Jts over 8 years
    Your function doesn't work properly. If there's spaces in the beggining or the end, it keeps them. Not what the op wants.
  • villapx
    villapx over 8 years
    No problem. Note also that remove_extra_whitespaces() assumes that the final string won't overflow the memory allocated for output; if it does, you'd likely get a segmentation fault.
  • Christophe
    Christophe over 8 years
    @José my function removes redundant spaces as requested by the OP. I couldn't find any evidence in the question that the starting space or the ending space should be removed. If this would be a requirement, you'd just replace input.begin() with a find_if() and add a conditional erase before returning.
  • Christophe
    Christophe over 8 years
    @Damian the nice thing with the algorithm library is that many algorithms also work with pointers instead of iterators. Here the online demo using the same algorithm , yet keeping c-style strings as you like them ;-)
  • Deduplicator
    Deduplicator over 8 years
    Just two general comments: 1. using namespace is a scourge, only acceptable when the namespace is guaranteed to only contain the symbols you want to import. 2. std::endl does a manual flush, which is generally simply wateful.
  • Deduplicator
    Deduplicator over 8 years
    BTW: You might want to add the cstring-solution to your answer.
  • Peter Cordes
    Peter Cordes over 8 years
    @Damian: using simpler functions is no guarantee of avoiding bugs. The more code you have to write yourself, instead of using library tools, the more chance there is of having a bug. Obviously you have to understand the library functions you use, and C++ has way more than C.
  • Peter - Reinstate Monica
    Peter - Reinstate Monica over 8 years
    This is an elegant solution (stateful predicate).
  • Damian
    Damian about 8 years
    sscanf is a function that ca be used in ANSI C (plain C) as well?
  • Peter - Reinstate Monica
    Peter - Reinstate Monica about 8 years
    @Damian Oh yes, it is. It's part of the C standard (and with it, part of the POSIX standard for Unix-like systems).
  • Damian
    Damian about 8 years
    thank you, you know, C is a very old programming language, it gives me headaches all the time ... look at this : stackoverflow.com/questions/35873677/…
  • Damian
    Damian about 8 years
    C is a very old programming language, it gives me headaches all the time ... look at this : stackoverflow.com/questions/35873677/…