手写 parser
- 预测分析器
- 完整示例代码

手写 parser

在龙书 2.5 节，有关于语法制导翻译的完整示例代码，即实现一个翻译器将表达式从中缀表达式转换成后缀表达式，之后我又想实现 2.3 节中语法制导定义来实现这个转换。在这个过程中对于龙书 2.4 Parsing 节又有了更多的理解，于是整理记录如下。

通常手写 parser 时，会使用 top-down parsing 技术。递归下降分析是 top-down parsing 的一种技术，而预测分析是递归下降分析的一种简单形式，比如编程语言 lua 的解析器就是使用预测分析技术，见 lparser.c 。

龙书 2.4 Parsing 小节中关于 top-down parsing 的描述如下，

In top-down parsers, construction starts at the root and proceeds towards the leaves, while in bottom-up parsers, construction starts at the leaves and proceeds towards the root. The popularity of top-down parsers is due to the fact that efficient parsers can be constructed more easily by hand using top-down methods.

Recursive-descent parsing is a top-down method of syntax analysis in which a set of recursive procedures is used to process the input. One procedure is associated with each nonterminal of a grammar.

Here, we consider a simple form of recursive-descent parsing, called predictive parsing, in which the lookahead symbol unambiguously determines the fow of control through the procedure body for each nonterminal.

预测分析器

在继续后续的内容之前，先整理书中关于设计一个预测分析器的描述，

预测分析器从 start symbol 开始解析

The top-down construction of a parse tree like the one in Fig. 2.17, is done by starting with the root, labeled with the starting nonterminal stmt. (2.4.1 Top-Down Parsing)

Parsing begins with a call of the procedure for the starting nonterminal stmt. (2.4.2 Predictive Parsing)

每个非终结符号对应的 procedure 做 2 件事情

Recall that a predictive parser is a program consisting of a procedure for every nonterminal. The procedure for nonterminal A does two things.

It decides which A-production to use by examining the lookahead symbol. The production with body α (where α is not ε, the empty string) is used if the lookahead symbol is in FIRST(α). If there is a conflict between two nonempty bodies for any lookahead symbol, then we cannot use this parsing method on this grammar. In addition, the ε-production for A, if it exists, is used if the lookahead symbol is not in the FIRST set for any other production body for A.

The procedure then mimics the body of the chosen production. That is, the symbols of the body are “executed” in turn, from the left. A nonterminal is “executed” by a call to the procedure for that nonterminal, and a terminal matching the lookahead symbol is “executed” by reading the next input symbol. If at some point the terminal in the body does not match the lookahead symbol, a syntax error is reported.

可以将翻译方案的 action 代码嵌入到非终结符号对应的 procedure 中执行，这样就通过扩展预测分析器实现了语法制导翻译器

We shall also see that when we have a translation scheme that is, a grammar with embedded actions — it is possible to execute those actions as part of the procedures designed for the parser.
Just as a translation scheme is formed by extending a grammar, a syntax-directed translator can be formed by extending a predictive parser. An algorithm for this purpose is given in Section 5.4. The following limited construction suffices for the present:

Construct a predictive parser, ignoring the actions in productions.

Copy the actions from the translation scheme into the parser. If an action appears after grammar symbol X in production p, then it is copied after the implementation of X in the code for p. Otherwise, if it appears at the beginning of the production, then it is copied just before the code for the production body.

至此，通过 2.5 Translator for Simple Expressions 中的完整的示例就实现了一个简单的基于预测分析器的翻译器。那么如何用语法制导定义来实现 2.3.2 Synthesized Attributes 中示例来达到相同的效果呢？

实现 SDD

在我想通过语法制导定义 SDD 来实现 2.3.2 Synthesized Attributes 中的示例时，遇见了一些问题，下面来一一记录。示例就是将表达式从中缀形式转换成后缀形式，表达式是由 + 和 - 分隔的数位序列，比如 9-5+2 。书中给的产生式和 semantic rules 如下，

expr -> expr1 + term | expr1 - term               expr.t = expr1.t || term.t || '+' | expr1.t || term.t || '-'
expr -> term                                      expr.t = term.t
term -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9     term.t = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'

根据龙书 2.4.5 Left Recursion 小节可知 expr -> expr1 + term | expr1 - term 是左递归产生式，无法使用预测分析法。但是在 Learn LLVM 17 chapter2 calc 示例以及 lua parser 实现代码中，发现也就是在一个循环中去解析算术运算符。就让我觉得好像是在预测分析代码中进行一些微调，就可以解析上述左递归文法。

于是，我就先微调 expr 产生式，将其改成 expr -> term + expr1 | term - expr1 。似乎与 2.5 节中转换后的产生式类似，如下所示，

expr -> term rest
rest -> + term rest | - term rest | ε

编码完成后运行发现，虽然能解析表达式，但是无法将其转换成后缀表达式。通过构造语法分析树发现了问题所在，以 9-5+2 为例，该文法构造的语法分析树的根节点的子节点从左到右分别是 9 和 - 以及 5+2 。这与表达式的实际语义不符合，应该先计算 9-5 ，所以构造符合表达式语义的语法分析树根节点的子节点应该是 9-5 和 + 以及 2 。这样计算表达式值时才能得到正确的结果。

通过构造语法分析树发现了问题所在，这也与龙书 2.4 Parsing 中下面文字相呼应，

Parsing is the process of determining how a string of terminals can be generated by a grammar. In discussing this problem, it is helpful to think of a parse tree being constructed, even though a compiler may not construct one, in practice. However, a parser must be capable of constructing the tree in principle, or else the translation cannot be guaranteed correct.

在回顾产生式 expr -> expr1 + term | expr1 - term 时，虽然这是左递归文法，该文法在解析输入的字符串时，每扫描到一个运算符，就先构造该运算符表示的表达式子树，接着继续解析，此表达式子树的根节点作为下一个运算符的表达式子树的子节点。最后出现的运算符构造的表达式子树的根节点是整个表达式树的根节点，也即是语法分析树的根节点。见 2.3.2 Synthesized Attributes 中 annotated parse tree 示例。这与上述转换后的产生式构造的语法分析树是一致的。因此未必一定要把产生式 expr -> expr1 + term | expr1 - term 写成转换后的产生式（如下所示），因为转换前的产生式更直观易懂，只要通过产生式描述的文法是明确的，不要出现构造的语法分析树是含糊不清的就行。

expr -> term rest
rest -> + term rest | - term rest | ε

继续看在 Learn LLVM 17 chapter2 calc 示例和 lua parser 实现代码中都是在循环中解析算术运算符，二者虽然都是采用预测分析法，但并未出现上述转换后的产生式中 rest 非终结符对应的 procedure 。这又是为什么呢？其实龙书在 2.5.4 Simplifying the Translator 和 2.5.5 The Complete Program 中有说明。

The simplifications will fold procedure rest into procedure expr. When expressions with multiple levels of precedence are translated, such simplifications reduce the number of procedures needed.

简化前如下所示，

void expr() {
    term(); rest();
}
void rest() {
    if (lookahead == '+') {
        match('+'); term(); print('+'); rest();
    }
    else if (lookahead == '-') {
        match('-'); term(); print('+'); rest();
    }
    else {}
}
void term() {
    if (lookahead is digit) {
        t = lookahead; match(lookahead); print(t);
    }
    else report("syntax error")
}

简化后如下所示，

void expr() {
    term();
    while (true) {
        if (lookahead == '+') {
            match('+'); term(); print('+'); rest();
        }
        else if (lookahead == '-') {
            match('-'); term(); print('+'); rest();
        }
        else {}
    }
}
void term() {
    if (lookahead is digit) {
        t = lookahead; match(lookahead); print(t);
    }
    else report("syntax error")
}

简化后对比简化前，

rest 函数中的递归调用被替换成了 while 循环。
合并 rest 函数中的代码到 expr 函数中。

虽然代码实现层面有简化，但本质还是用预测分析法解析。

编译器技术应该是计算机科学中的典型之一，即书中的理论知识与实际的代码实现会有一些不同，原因就是实际的代码实现可能会有各种各样的优化，无论是简化实现还是性能优化。但回归到本质时，还是书中的理论知识。所以学习了书中的知识点后，也要看看实际项目中的实现是怎样的，才能加深理解，得以运用。

我在一开始看龙书第二章时没有动手实现，所以很多地方不理解。之后再去看 Learn LLVM 17 chapter2 calc 示例和 lua parser 代码实现时，也没有注意到这就是预测分析法的应用。直到我重新再去看龙书第二章并且去实现其中的示例，特别是 2.3.2 Synthesized Attributes 小节中的 SDD 示例才有了总体的更深的理解。

如果我自己手写 parser 时，会如何定义其对应的 CFG 文法。在过程中模拟构造 parse tree 语法分析树，来辅助 CFG 定义。未必需要 CFG 就是能让 predictive parsing 预测分析法解析的格式，但通过该 CFG 构造的语法分析树要是明确的，且与预测分析过程中隐式或显式构造的语法分析树一致。
如何解析表达式，特别是不同优先级包括带有括号的表达式。
到底什么是预测分析，什么是递归下降分析，如何理解 CFG 产生式并且模拟构造语法分析树来验证文法和解析的正确性。

特别要多实践，结合现实中的项目理解知识点的运用，才能真正学会知识点。否则，只是一味的去看，很难理解并加以运用。
知识点是用来解决现实中的问题，应该以实际中的知识点为基准，灵活的理解知识点。这句话的意思是，当知识点应用到实践中去时，发现与现实情况有出入，以实践为基准，以实际解决问题为基准，然后再反思总结知识点，不要死板硬套。

完整示例代码

龙书 2.3.2 Synthesized Attributes 小节中完成示例代码如下所示。其中 parse_expr2 是 parse_expr 的精简实现，而 parse_expr3 是错误实现。

/*
** 本代码是实现龙书 2.3.2 Synthesized Attributes 中的语法制导定义示例，
** 示例是将中缀表达式转换成后缀表达式，
** 表达式是由 `+` 和 `-` 分隔的数位序列，如 1+2-3-4+5 等，
** 
** 书中给出的 CFG 如下所示，
**   expr -> expr + term    expr.t = expr.t || term.t || "+"
**   expr -> expr - term    expr.t = expr.t || term.t || "-"
**   expr -> term
**   term -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
**
** 但无法使用 predictive parser 预测分析器解析上述文法，于是对 expr 前 2 个产生式稍作修改，而 term 保持不变，
**   expr -> term + expr
**   expr -> term - expr
** 这样，就把文法从左递归文法变成了右递归文法，二者的区别如下，以 1+2+3 为例，
**   左递归文法解析时根节点的 3 个子节点分别是 1+2 和 + 以及 3
**   右递归文法解析时根节点的 3 个子节点分别是 1 和 + 以及 2+3
** 最终实现时，发现右递归文法无法使用上述语义规则将表达式转换成后缀表达式，见 parse_expr3 函数。
*/
#include <iostream>
#include <stdlib.h>
#include <string.h>

using std::cout;
using std::endl;
using std::string;

class Lexer {
public:
    Lexer(const char* i, const char* e)
        : input_(i), end_(e), ptr_(i)
    {}

    int next_token()
    {
        if (ptr_ != end_)
            return *ptr_++;
        else
            return 0;
    }

private:
    const char* input_;
    const char* end_;
    const char* ptr_;
};

class Parser {
public:
    Parser(const char* i, const char* e)
        : lex_(i, e)
    {}

    string parse()
    {
        read_token();
        return parse_expr();
    }

    string parse2()
    {
        read_token();
        return parse_expr2();
    }

    string parse3()
    {
        read_token();
        return parse_expr3();
    }

private:
    string parse_expr()
    {
        string left = parse_term();
        return parse_rest(left);
    }

    string parse_rest(string left)
    {
        int v = lookahead_;
        string right;
        switch (v) {
            case '+':
            case '-':
                match(v);
                right = parse_term();
                left = left + right + string(1, (char)v);
                return parse_rest(left);
            case 0:
                break;
            default:
                error("invalid lookahead in rest");
                break;
        }
        return left;
    }

    string parse_expr2()
    {
        string left = parse_term();
        string right;
        for (;;) {
            int v = lookahead_;
            switch (v) {
                case '+':
                case '-':
                    match(v);
                    right = parse_term();
                    left = left + right + string(1, (char)v);
                    break;
                case 0:
                    return left;
                default:
                    error("invalid lookahead in expr2");
                    return left;
            }
        }
        return left;
    }

    string parse_expr3()
    {
        string left = parse_term();
        string right;

        int v = lookahead_;
        switch (v) {
            case '+':
            case '-':
                match(v);
                right = parse_expr3();
                right += string(1, (char)v);
                break;
            case 0:
                break;
            default:
                error("invalid lookahead in expr");
                break;
        }
        
        return left + right;
    }

    string parse_term()
    {
        char v = (char)lookahead_;
        if (!(v >= '0' && v <= '9'))
            error("invalid term value");
        read_token();
        return string(1, v);
    }

    void read_token()
    {
        int t = lex_.next_token();
        if (t < 0)
            error("invalid input");
        lookahead_ = t;
    }

    void match(int t)
    {
       if (lookahead_ == t)
           read_token();
    }

    void error(const char *msg)
    {
        cout << "error msg: " << msg << endl;
        exit(1);
    }

private:
    Lexer lex_;
    int lookahead_;
};

static void test1()
{
    cout << "=== === === test1" << endl;
    const char* text = "9-5+2"; // 95-2+
    const char* end = text + strlen(text);
    Parser p(text, end);
    cout << p.parse() << endl;
    cout << "=== === === test1" << endl;
}

static void test2()
{
    cout << "=== === === test2" << endl;
    const char* text = "9-5+2"; // 95-2+
    const char* end = text + strlen(text);
    Parser p(text, end);
    cout << p.parse2() << endl;
    cout << "=== === === test2" << endl;
}

static void test3()
{
    cout << "=== === === test3" << endl;
    const char* text = "9-5+2";
    const char* end = text + strlen(text);
    Parser p(text, end);
    cout << p.parse3() << endl;
    cout << "=== === === test3" << endl;
}

int main()
{
    test1();
    test2();
    test3();
    return 0;
}