regex with only numbers in a string c++

14,810

Solution 1

Actually, C++ regex module supports look-aheads.

Here is my suggestion:

#include <iostream>
#include <regex>
using namespace std;

int main() {
    std::string buffer = " li 12.12 si 43,23 45 31 uf 889 uf31 3.12345";
    std::regex rx(R"((?:^|\s)([+-]?[[:digit:]]+(?:\.[[:digit:]]+)?)(?=$|\s))"); // Declare the regex with a raw string literal
    std::smatch m;
    std::string str = buffer;
    while (regex_search(str, m, rx)) {
        std::cout << "Number found: " << m[1] << std::endl; // Get Captured Group 1 text
        str = m.suffix().str(); // Proceed to the next match
    }  
    return 0;
}

See IDEONE demo

Due to the raw string literal declaration, there is no need using double backslashes with \s.

The lookahead (?=$|\s) checks the presence, but does not consume the whitespace and consecutive numbers can be extracted.

Note that if you need to extract decimal values like .5, you need

R"((?:^|\s)([+-]?[[:digit:]]*\.?[[:digit:]]+)(?=$|\s))"

Solution 2

You need this regex:

(?<!,)\b([\d\.]+)\b(?!,)

Solution 3

As is stated by stribizhev this can only be accomplished via look arrounds. Since a single whitespace separating numbers would otherwise be needed to be consumed in the search for the number before and after the whitespace.

user2079303 poses a viable option to regexes which could be simplified to the point where it rivaled the simplicity of a regexes:

for_each(istream_iterator<string>(istringstream(" li 12.12 si 43,23 45 31 uf 889 uf31 3.12345")),
         istream_iterator<string>(),
         [](const string& i) {
            char* it;
            double num = strtod(i.c_str(), &it);
            if (distance(i.c_str(), const_cast<const char*>(it)) == i.size()) cout << num << endl; });

However it is possible to accomplish this without the weight of an istringstream or a regex, by simply using strtok:

char buffer[] = " li 12.12 si 43,23 45 31 uf 889 uf31 3.12345";

for (auto i = strtok(buffer, " \f\n\r\t\v"); i != nullptr; i = strtok(nullptr, " \f\n\r\t\v")) {
    char* it;
    double num = strtod(i, &it);

    if (*it == '\0') cout << num << endl;
}

Note that for my delimiter argument I'm simply using the default isspace values.

Solution 4

Regexes are usually unreadable and hard to prove correct. Regexes matching only valid rational numbers need to be intricate and are easy to mess up. Therefore, I propose an alternative approach. Instead of regexes, tokenize your string with c++ and use std::strtod to test if input is a valid number. Here is example code:

std::vector<std::string> split(const std::string& str) {
    std::istringstream iss(str);
    return {
        std::istream_iterator<std::string>{iss},
        std::istream_iterator<std::string>{}
    };
}

bool isValidNumber(const std::string& str) {
    char* end;
    std::strtod(str.data(), &end);
    return *end == '\0';
}

// ...
auto tokens = split(" li 12.12 si 43,23 45 31 uf 889 uf31 3.12345");
std::vector<std::string> matches;
std::copy_if(tokens.begin(), tokens.end(), std::back_inserter(matches), isValidNumber);
Share:
14,810
Mau
Author by

Mau

Updated on June 04, 2022

Comments

  • Mau
    Mau almost 2 years

    I'm looking for a regex to find numbers in a string; if I have a string like:

    li 12.12 si 43,23 45 31 uf 889 uf31 3.12345

    I want to find only the numbers:

    12.12 45 31 889 3.12345

    I tried with the following pattern:

    ((\\+|-)?[[:digit:]]+)(\\.(([[:digit:]]+)?))?

    but the output included uf31 and 43,23.

    I tried with:

    (?!([a-z]*((\\+|-)?[[:digit:]]+)(\\.(([[:digit:]]+)?))?[a-z]*))?((\\+|-)?[[:digit:]]+)(\\.(([[:digit:]]+)?))?

    but this gave the same result.

    What is the solution?

    SOLUTION leave to posterity the solution:

  • Mau
    Mau over 8 years
    thank! but with your regex i print token . token , token .
  • Jonathan Mee
    Jonathan Mee over 8 years
    @user3641602 This will match 1.2.3... Do you want to enforce correct numbering on your number?
  • Jonathan Mee
    Jonathan Mee over 8 years
    C++ doesn't support look aheads or look behinds
  • Karoly Horvath
    Karoly Horvath over 8 years
    ATM I don't really see another way of doing it.
  • Mike P
    Mike P over 8 years
    This regex also retrieves 43,23 and the 31 from uf31. Don't think that's what the OP wanted.
  • Mayur Koshti
    Mayur Koshti over 8 years
    do you need 43 and 23 separately and don't want 31 from the uf31? Am I right?
  • Mayur Koshti
    Mayur Koshti over 8 years
    Then you need to just modification in current regex: \b([\d\.]+)\b
  • Jonathan Mee
    Jonathan Mee over 8 years
    From the question it seems that the OP does not want "43,23" to be captured, that seems to be one of the reasons that he asked the question in the first place. But he also doesn't seem to want numbers that are not whitespace delimited from letters.
  • Wiktor Stribiżew
    Wiktor Stribiżew over 8 years
    No need escaping if raw string literal is used. 31 is not matched, BTW.
  • Jonathan Mee
    Jonathan Mee over 8 years
    @MayurKoshti No that's wrong. That will still pick up the 43 and 23 from "43,23" which the OP expressedly stated that he did not want.
  • Jonathan Mee
    Jonathan Mee over 8 years
    @KarolyHorvath Wrong, notice those are non-capturing parenthesis.
  • Mayur Koshti
    Mayur Koshti over 8 years
    Sorry Jonathan Mee! I modified regex : (?<!,)\b([\d\.]+)\b(?!,)
  • Jonathan Mee
    Jonathan Mee over 8 years
    This has the bugs of your original regex: it captures 1.2.3... but now it's also picked up the need for Boost in the regex by @KarolyHorvath
  • Mayur Koshti
    Mayur Koshti over 8 years
    I modified my original regex.
  • eerorika
    eerorika over 8 years
    @KarolyHorvath the current, modified answer gives the output that OP wants with the given string. Since OP doesn't specify what kind of numbers they want and what they don't want, I'd say this answer is a simple solution.
  • Jonathan Mee
    Jonathan Mee over 8 years
    I stand corrected, I was using Visual Studio 2013 last I tested look arounds. It appears C++ now fully supports ECMAScript! However I'd still make the case that look arounds are the most expensive regex operation. They should be avoided unless absolutely necessary, which they aren't here.
  • Jonathan Mee
    Jonathan Mee over 8 years
    @user2079303 No it does not give what the OP wants, he doesn't want to capture the 43 or 23 from "43,23" this will capture that.
  • eerorika
    eerorika over 8 years
    @JonathanMee no, it does not. The look-ahead and look-behind checks for comma prevent that. Unless the engine doesn't support those - like you asserted in a comment to another answer - this does not capture 43,23.
  • Wiktor Stribiżew
    Wiktor Stribiżew over 8 years
    In this case, following this logic, the look-ahead is a must. You can't match the numbers in <SPACE>41<SPACE>31<SPACE> without a look-ahead.
  • Jonathan Mee
    Jonathan Mee over 8 years
    Impressive way to think about it, but this will capture: "123abc" and "12#3" do you have a way to work around that?
  • Jonathan Mee
    Jonathan Mee over 8 years
    @user2079303 Correction, I meant to type that this will pull from any symbol other than a comma: "12#3" for example will capture 12 and 3.
  • Jonathan Mee
    Jonathan Mee over 8 years
    Is your "<SPACE>" a literal whitespace? If not from my understanding the OP wouldn't want those numbers to match. If it is a literal whitespace I don't see why you need look arounds?
  • bobble bubble
    bobble bubble over 8 years
    @JonathanMee This approach only makes sense, if cases that could occur are known. For your samples have to add those cases like this.
  • Jonathan Mee
    Jonathan Mee over 8 years
    Just a note: [^\\s] is looking for characters that are not '\\' or 's'. What you actually meant was \S
  • Karoly Horvath
    Karoly Horvath over 8 years
    @JonathanMee: I escaped all the backlashes, assuming it's in a string. I probably should have used quotes.
  • Wiktor Stribiżew
    Wiktor Stribiżew over 8 years
    @JonathanMee: Please have a look at your results - your regex does not match the expected 31.
  • Jonathan Mee
    Jonathan Mee over 8 years
    You beat me to the use of strtod +1
  • eerorika
    eerorika over 8 years
    +1 Thanks for the simplified use of second parameter of strtod. Took me a while to understand the documentation.
  • Mau
    Mau over 8 years
    yes, it's a possible way. But i have the solution of the problem. I would like to reduce my code via regex, because if you use regex then you have a powerful tool by hands!! :) But, such as you are mentioned before, "Regexes are usually unreadable and hard to prove correct." :)
  • Jonathan Mee
    Jonathan Mee over 8 years
    @user3641602 His solution is I believe a simpler one than the regex solution in the first place. I've streamlined his code in one of the options I provide in my answer: stackoverflow.com/a/33521413/2642059
  • Mau
    Mau over 8 years
    first is correct, but ignore 31 and .5 second ignore always
  • Simon Kraemer
    Simon Kraemer over 8 years
    31 is not ignored - I just tested both variants. You're right about .5 - I'll update my answer
  • Jonathan Mee
    Jonathan Mee over 8 years
    Cancel that, my regex without lookarounds did work fine but I believe that the best solution is the use of strtod, which I have changed my answer to use.
  • Mau
    Mau over 8 years
    @JonathanMee cplusplus.com/reference/regex/ECMAScript c++ support lookahead
  • Mau
    Mau over 8 years
    I refuse and have always refused to use Boost. I prefer to use standard, for compatibility in team.
  • Karoly Horvath
    Karoly Horvath over 8 years
    @user3641602: That's your own personal preference and choice, but I don't think it's something that should be advertised in comments.
  • Jonathan Mee
    Jonathan Mee over 8 years
    @user3641602 You don't need Boost, I was wrong, this solution works without it. However the look ahead and behinds are expensive. Prefer a solution that doesn't use them: ideone.com/JSuULo or better yet a solution that uses strtod.
  • Wiktor Stribiżew
    Wiktor Stribiżew over 8 years
    @user3641602: Glad it works for you, please consider accepting the answer.
  • Mau
    Mau over 8 years
    @stribizhev just once: why do you write R before regex??
  • Wiktor Stribiżew
    Wiktor Stribiżew over 8 years
    It is a raw string literal. The notation is R"()". Inside the parentheses, \ symbol means a literal \ symbol, not a C-escaping symbol.
  • Jonathan Mee
    Jonathan Mee over 8 years
    @user2079303 It seems we're the only ones on board with the strtod :( Ah well, if you're looking for a better explanation of how to use it you might want to check out: stackoverflow.com/q/32991193/2642059