How do I write a simple regular expression pattern matching function in C or C++?

33,037

Solution 1

Brian Kernighan provided a short article on A Regular Expression Matcher that Rob Pike wrote as a demonstration program for a book they were working on. The article is a very nice read explaining a bit about the code and regular expressions in general.

I have played with this code, making a few changes to experiment with some extensions such as to also return where in the string the pattern matches so that the substring matching the pattern can be copied from the original text.

From the article:

I suggested to Rob that we needed to find the smallest regular expression package that would illustrate the basic ideas while still recognizing a useful and non-trivial class of patterns. Ideally, the code would fit on a single page.

Rob disappeared into his office, and at least as I remember it now, appeared again in no more than an hour or two with the 30 lines of C code that subsequently appeared in Chapter 9 of TPOP. That code implements a regular expression matcher that handles these constructs:

c    matches any literal character c
.    matches any single character
^    matches the beginning of the input string
$    matches the end of the input string
*    matches zero or more occurrences of the previous character

This is quite a useful class; in my own experience of using regular expressions on a day-to-day basis, it easily accounts for 95 percent of all instances. In many situations, solving the right problem is a big step on the road to a beautiful program. Rob deserves great credit for choosing so wisely, from among a wide set of options, a very small yet important, well-defined and extensible set of features.

Rob's implementation itself is a superb example of beautiful code: compact, elegant, efficient, and useful. It's one of the best examples of recursion that I have ever seen, and it shows the power of C pointers. Although at the time we were most interested in conveying the important role of a good notation in making a program easier to use and perhaps easier to write as well, the regular expression code has also been an excellent way to illustrate algorithms, data structures, testing, performance enhancement, and other important topics.

The actual C source code from the article is very very nice.

/* match: search for regexp anywhere in text */
int match(char *regexp, char *text)
{
    if (regexp[0] == '^')
        return matchhere(regexp+1, text);
    do {    /* must look even if string is empty */
        if (matchhere(regexp, text))
            return 1;
    } while (*text++ != '\0');
    return 0;
}

/* matchhere: search for regexp at beginning of text */
int matchhere(char *regexp, char *text)
{
    if (regexp[0] == '\0')
        return 1;
    if (regexp[1] == '*')
        return matchstar(regexp[0], regexp+2, text);
    if (regexp[0] == '$' && regexp[1] == '\0')
        return *text == '\0';
    if (*text!='\0' && (regexp[0]=='.' || regexp[0]==*text))
        return matchhere(regexp+1, text+1);
    return 0;
}

/* matchstar: search for c*regexp at beginning of text */
int matchstar(int c, char *regexp, char *text)
{
    do {    /* a * matches zero or more instances */
        if (matchhere(regexp, text))
            return 1;
    } while (*text != '\0' && (*text++ == c || c == '.'));
    return 0;
}

Solution 2

See This Question for a solution you can not submit. See this paper for a description of how to implement a more readable one.

Solution 3

Here is recursive extendable implementation. Tested for first order of pattern complexity.

#include <string.h>
#include <string>
#include <vector>
#include <iostream>

struct Match {
  Match():_next(0) {}
  virtual bool match(const char * pattern, const char * input) const {
    return !std::strcmp(pattern, input);
  }
  bool next(const char * pattern, const char * input) const {
    if (!_next) return false;
    return _next->match(pattern, input);
  }
  const Match * _next;
};

class MatchSet: public Match {
  typedef std::vector<Match *> Set;
  Set toTry;
public:
  virtual bool match(const char * pattern, const char * input) const {
    for (Set::const_iterator i = toTry.begin(); i !=toTry.end(); ++i) {
      if ((*i)->match(pattern, input)) return true;
    }
    return false;
  }
  void add(Match * m) {
    toTry.push_back(m);
    m->_next = this;
  }
  ~MatchSet() {
    for (Set::const_iterator i = toTry.begin(); i !=toTry.end(); ++i)
      if ((*i)->_next==this) (*i)->_next = 0;
  }
};

struct MatchQuestion: public Match  {
  virtual bool match(const char * pattern, const char * input) const {
    if (pattern[0] != '?')
      return false;
    if (next(pattern+1, input))
      return true;
    if (next(pattern+1, input+1))
      return true;
    return false;
  }
};

struct MatchEmpty: public Match {
  virtual bool match(const char * pattern, const char * input) const {
    if (pattern[0]==0 && input[0]==0)
      return true;
    return false;
  }
};

struct MatchAsterisk: public Match {
  virtual bool match(const char * pattern, const char * input) const {
    if (pattern[0] != '*')
      return false;
    if (pattern[1] == 0) {
      return true;
    }
    for (int i = 0; input[i] != 0; ++i) {
      if (next(pattern+1, input+i))
        return true;
    }
    return false;
  }
};

struct MatchSymbol: public Match {
  virtual bool match(const char * pattern, const char * input) const {
    // TODO: consider cycle here to prevent unnecessary recursion
    // Cycle should detect special characters and call next on them
    // Current implementation abstracts from that
    if (pattern[0] != input[0])
      return false;
    return next(pattern+1, input+1);
  }
};

class DefaultMatch: public MatchSet {
  MatchEmpty empty;
  MatchQuestion question;
  MatchAsterisk asterisk;
  MatchSymbol symbol;
public:
  DefaultMatch() {
    add(&empty);
    add(&question);
    add(&asterisk);
    add(&symbol);
  }
  void test(const char * p, const char * input) const {
    testOneWay(p, input);
    if (!std::strcmp(p, input)) return;
    testOneWay(input, p);
  }
  bool testOneWay(const char * p, const char * input) const {
    const char * eqStr = " == ";
    bool rv = match(p, input);
    if (!rv) eqStr = " != ";
    std::cout << p << eqStr << input << std::endl;
    return rv;
  }

};


int _tmain(int argc, _TCHAR* argv[])
{
  using namespace std;

  typedef vector<string> Strings;
  Strings patterns;

  patterns.push_back("*");
  patterns.push_back("*hw");
  patterns.push_back("h*w");
  patterns.push_back("hw*");

  patterns.push_back("?");
  patterns.push_back("?ab");
  patterns.push_back("a?b");
  patterns.push_back("ab?");

  patterns.push_back("c");
  patterns.push_back("cab");
  patterns.push_back("acb");
  patterns.push_back("abc");

  patterns.push_back("*this homework?");
  patterns.push_back("Is this homework?");
  patterns.push_back("This is homework!");
  patterns.push_back("How is this homework?");

  patterns.push_back("hw");
  patterns.push_back("homework");
  patterns.push_back("howork");

  DefaultMatch d;
  for (unsigned i = 0; i < patterns.size(); ++i)
    for (unsigned j =i; j < patterns.size(); ++j)
      d.test(patterns[i].c_str(), patterns[j].c_str());

    return 0;
}

If something is unclear, ask.

Solution 4

Cheat. Use #include <boost/regex/regex.hpp>.

Solution 5

try to make a list of interesting test cases:

is_match("dummy","dummy") should return true;

is_match("dumm?y","dummy") should return true;

is_match("dum?y","dummy") should return false;

is_match("dum*y","dummy") should return true;

and so on ...

then see how to make the easier test pass, then the next one ...

Share:
33,037
Tracy
Author by

Tracy

Those who wish to succeed must ask the right preliminary questions You could email me: [email protected]

Updated on July 09, 2022

Comments

  • Tracy
    Tracy almost 2 years

    This is a question in my paper test today, the function signature is

    int is_match(char* pattern,char* string)
    

    The pattern is limited to only ASCII chars and the quantification * and ?, so it is relatively simple. is_match should return 1 if matched, otherwise 0.

    How do I do this?