Chapter 06 Regular Expression

    Regular expressions describe a pattern of string matching. The general use of regular expressions is mainly to achieve the following three requirements:

    1. Check if a string contains some form of substring;
    2. Replace the matching substrings;
    3. Take the eligible substring from a string.

    Regular expressions are text patterns consisting of ordinary characters (such as a to z) and special characters. A pattern describes one or more strings to match when searching for text. Regular expressions act as a template to match a character pattern to the string being searched.

    Normal characters include all printable and unprintable characters that are not explicitly specified as metacharacters. This includes all uppercase and lowercase letters, all numbers, all punctuation, and some other symbols.

    A special character is a character with special meaning in a regular expression, and is also the core matching syntax of a regular expression. See the table below:

    The qualifier is used to specify how many times a given component of a regular expression must appear to satisfy the match. See the table below:

    Character Description
    * matches the previous subexpression zero or more times. For example, foo* matches fo and foooo. * is equivalent to {0,}.
    + matches the previous subexpression one or more times. For example, foo+ matches foo and foooo but does not match fo. + is equivalent to {1,}.
    ? matches the previous subexpression zero or one time. For example, Your(s)? can match Your in Your or Yours. ? is equivalent to {0,1}.
    {n} n is a non-negative integer. Matches the determined n times. For example, o{2} cannot match o in for, but can match two o in .
    {n,} n is a non-negative integer. Match at least n times. For example, o{2,} cannot match o in for, but matches all o in foooooo. o{1,} is equivalent to o+. o{0,} is equivalent to o*.
    {n,m} m and n are non-negative integers, where n is less than or equal to m. Matches at least n times and matches up to m times. For example, o{1,3} will match the first three o in foooooo. o{0,1} is equivalent to o?. Note that there can be no spaces between the comma and the two numbers.

    With these two tables, we can usually read almost all regular expressions.

    6.2 std::regex and Its Related

    The general solution is to use the regular expression library of boost. C++11 officially incorporates the processing of regular expressions into the standard library, providing standard support from the language level and no longer relying on third parties.

    The regular expression library provided by C++11 operates on the std::string object, and the pattern std::regex (essentially std::basic_regex) is initialized and matched by std::regex_match Produces std::smatch (essentially the std::match_results object).

    We use a simple example to briefly introduce the use of this library. Consider the following regular expression:

    • [az]+\.txt: In this regular expression, [az] means matching a lowercase letter, + can match the previous expression multiple times, so [az]+ can Matches a string of lowercase letters. In the regular expression, a . means to match any character, and \. means to match the character ., and the last txt means to match txt exactly three letters. So the content of this regular expression to match is a text file consisting of pure lowercase letters.

    std::regex_match is used to match strings and regular expressions, and there are many different overloaded forms. The simplest form is to pass std::string and a std::regex to match. When the match is successful, it will return true, otherwise it will return false. For example:

    Another common form is to pass in the three arguments std::string//std::regex. The essence of std::smatch is actually std::match_results. In the standard library, std::smatch is defined as std::match_results<std::string::const_iterator>, which means match_results of a substring iterator type. Use std::smatch to easily get the matching results, for example:

    1. std::regex base_regex("([a-z]+)\\.txt");
    2. std::smatch base_match;
    3. for(const auto &fname: fnames) {
    4. // the first element of std::smatch matches the entire string
    5. // the second element of std::smatch matches the first expression with brackets
    6. if (base_match.size() == 2) {
    7. std::string base = base_match[1].str();
    8. std::cout << "sub-match[0]: " << base_match[0].str() << std::endl;
    9. std::cout << fname << " sub-match[1]: " << base << std::endl;
    10. }
    11. }
    12. }

    The output of the above two code snippets is:

    Exercise

    In web server development, we usually want to serve some routes that satisfy a certain condition. Regular expressions are one of the tools to accomplish this. Given the following request structure:

    1. struct Request {
    2. // request method, POST, GET; path; HTTP version
    3. std::string method, path, http_version;
    4. // use smart pointer for reference counting of content
    5. std::shared_ptr<std::istream> content;
    6. // hash container, key-value dict
    7. std::unordered_map<std::string, std::string> header;
    8. std::smatch path_match;
    9. };

    Requested resource type:

    And server template:

    1. template <typename socket_type>
    2. class ServerBase {
    3. public:
    4. resource_type resource;
    5. resource_type default_resource;
    6. void start() {
    7. // TODO
    8. }
    9. protected:
    10. Request parse_request(std::istream& stream) const {
    11. // TODO
    12. }
    13. }

    Please implement the member functions start() and parse_request. Enable server template users to specify routes as follows:

    An suggested solution can be found .

    Table of Content | | Next Chapter: Threads and Concurrency

    Licenses