Chapter 10. Boost.Tokenizer

    Example 10.1. Iterating over partial expressions in a string with

    Boost.Tokenizer defines a class template called boost::tokenizer in boost/tokenizer.hpp. It expects as a template parameter a class that identifies coherent expressions. Example 10.1 uses the class boost::char_separator, which interprets spaces and punctuation marks as separators.

    A tokenizer must be initialized with a string of type std::string. Using the member functions begin() and end(), the tokenizer can be accessed like a container. Partial expressions of the string used to initialize the tokenizer are available via iterators. How partial expressions are evaluated depends on the kind of class passed as the template parameter.

    Because boost::char_separator interprets spaces and punctuation marks as separators by default, displays Boost, C, +, +, and Libraries. boost::char_separator uses std::isspace() and std::ispunct() to identify separator characters. Boost.Tokenizer distinguishes between separators that should be displayed and separators that should be suppressed. By default, spaces are suppressed and punctuation marks are displayed.

    Example 10.2. Initializing boost::char_separator to adapt the iteration

    1. #include <boost/tokenizer.hpp>
    2. #include <string>
    3. #include <iostream>
    4. int main()
    5. {
    6. typedef boost::tokenizer<boost::char_separator<char>> tokenizer;
    7. std::string s = "Boost C++ Libraries";
    8. boost::char_separator<char> sep{" "};
    9. for (const auto &t : tok)
    10. std::cout << t << '\n';

    To keep punctuation marks from being interpreted as separators, initialize the boost::char_separator object before passing it to the tokenizer.

    The second parameter specifies the separators that should be displayed. If this parameter is omitted, no separators are displayed, and the program will now display Boost, C++ and Libraries.

    Example 10.3. Simulating the default behavior with boost::char_separator

    If a plus sign is passed as the second parameter, Example 10.3 behaves like .

    The third parameter determines whether or not empty partial expressions are displayed. If two separators are found back-to-back, the corresponding partial expression is empty. By default, these empty expressions are not displayed. Using the third parameter, the default behavior can be changed.

    Example 10.4. Initializing boost::char_separator to display empty partial expressions

    1. #include <boost/tokenizer.hpp>
    2. #include <string>
    3. #include <iostream>
    4. int main()
    5. {
    6. typedef boost::tokenizer<boost::char_separator<char>> tokenizer;
    7. std::string s = "Boost C++ Libraries";
    8. boost::char_separator<char> sep{" ", "+", boost::keep_empty_tokens};
    9. tokenizer tok{s, sep};
    10. for (const auto &t : tok)
    11. std::cout << t << '\n';

    Example 10.4 displays two additional empty partial expressions. The first one is found between the two plus signs, while the second one is found between the second plus sign and the following space.

    iterates over a string of type std::wstring. In order to support this string type, the tokenizer must be initialized with additional template parameters. The class boost::char_separator must also be initialized with .

    Besides boost::char_separator, Boost.Tokenizer provides two additional classes to identify partial expressions.

    Example 10.6. Parsing CSV files with boost::escaped_list_separator

    1. #include <boost/tokenizer.hpp>
    2. #include <string>
    3. #include <iostream>
    4. int main()
    5. {
    6. typedef boost::tokenizer<boost::escaped_list_separator<char>> tokenizer;
    7. std::string s = "Boost,\"C++ Libraries\"";
    8. tokenizer tok{s};
    9. for (const auto &t : tok)
    10. std::cout << t << '\n';
    11. }

    boost::escaped_list_separator is used to read multiple values separated by commas. This format is commonly known as CSV (Comma Separated Values). boost::escaped_list_separator also handles double quotes and escape sequences. Therefore, the output of Example 10.6 is Boost and C++ Libraries.

    The second class provided is boost::offset_separator, which must be instantiated. The corresponding object must be passed to the constructor of boost::tokenizer as a second parameter.

    Example 10.7. Iterating over partial expressions with boost::offset_separator