I'm trying to write a yacc (menhir) grammar for parsing a very reduced subset of C (no templates, headers only with no function bodies allowed...) and am already running into ambiguities.
typedef int B;
class A {
A(); // (*)
B(c)(); // (**)
};
Case * is a constructor and case ** is a parenthesized declarator. How does the parser tell the difference? I can imagine some ways I could tell, but I want to know how a compliant c parser does it. I also understand that I likely won’t be able to parse even a practical subset of C with yacc, but I just want to understand a bit better what’s going on. In the end maybe I’ll switch to a parser combinator or something. Also, I should note that I have zero interest in linking against clang as I intend to add some custom syntax.
CodePudding user response:
Parsing C is frustrating exercise, because C is essentially not context-free. You need to know whether an identifier refers to a template, a type, or something else, and name resolution in C is not a simple task either. You might, for example, have to instantiate a template in order to know whether a member of templated class is a variable or a typename. (Programs only have to use typename for names in dependent types.)
So you certainly cannot expect to parse a translation unit without considering included header files. (Anyway, you couldn't do that because a header file could define a macro whose expansion changes the syntax in unexpected ways. So you need a preprocessor.)
Having said all that, all is not lost. You could use bison to generate a GLR parser, and then try to write custom disambiguation functions. If the disambiguation function needs to know whether a name is a typedef or not, it can try to look it up in the symbol table. Unlike the classic "lexer hack" approach, GLR disambiguation happens in the parser, so it doesn't require building unwieldy backchannels.
This can be an iterative process, because you don't need to figure out how to resolve ambiguities which are not present in the particular code you are trying to compile. Eventually, you'll probably want to handle all of them, but doing them as needed might prove to be a more fulfilling experience.
CodePudding user response:
Generally, recognizing constructors (and destructors) in C requires recognizing the name of the containing class specially, distinct from any other identifier.
This can be done with a contextual lookup in the lexer returning distinct tokens. Inside the {...} composing the class A body, the lexer could recognize the A token and return a different token type for it (distinct from the TYPENAME token it would return for B)
