Extending C++ using LLVM
Introduction
LLVM is an open-source, modular compiler infrastructure, and Clang is its C++ front-end.
In this project, I use LLVM to build a custom Clang compiler that introduces a new C++ builtin -
counting the number of fields in a C++ struct / class / union.
This is a toy example that would help you walk through Clang’s most important internals -
Parser (AST generation), Semantic Analysis and finally Code Generation.
The new keyword
In this example, we introduce a new keyword - __builtin_struct_field_count(type or variable)
An example usage of this keyword is shown below -
#include <iostream>
class A {
int a;
int b;
double c;
};
int main() {
A x;
// Our new builtin, evaluated at compile time.
std::cerr << __builtin_struct_field_count(A) << std::endl; // prints 3
std::cerr << __builtin_struct_field_count(x) << std::endl; // prints 3
return 0;
}
Code changes
The code changes can be found under this commit -
https://github.com/llvm/llvm-project/commit/5df0838ba2e6b9fff4d1c702f663a337a4fa9d58
Let’s go through the changes step by step -
Step 1 : Defining the keyword
All the keywords are defined inside clang/include/clang/Basic/ folder.
You first need to figure out what kind of expression your keyword is. For simple expressions, you can directly add them under Builtins.td file by defining the Name, Prototype and any other property (as defined inside Builtins.def). Eg. clang/include/clang/Basic/Builtins.td (defines a simple keyword for add(int, int) function).
But __builtin_struct_field_count() does not fit any of the prototypes defined under Builtins.def. So after facing a couple of compilation errors, I decided to check the implementation for keywords that have similar prototype like sizeof() or alignof(). After some keyword searches, I figured that __builtin_struct_field_count() falls under “Unary Expression or Type Trait” (or UETT in short) expression.
Unary Expression or Type Trait (UETT) is used for expressions that can operate on either a type or an expression, such as sizeof(T) or alignof(expr). I defined it inside the file clang/include/clang/Basic/TokenKinds.def
If you read through TokenKinds.def and TypeTraits.h, you’ll find that the macro UNARY_EXPR_OR_TYPE_TRAIT(Spelling, Name, Key) would essentially define your keyword as a token of kind tok::kw_<Spelling> and also define an enum UnaryExprOrTypeTrait::UETT_<Name> (which shall be used in next steps).
Step 2 : Parsing Logic
Now that we have defined our new keyword, we want to make sure the Parsing logic identifies the keyword and sets its ExprKind (type of expression) correctly. Again following the implementations of sizeof() and alignof(), I found that the handling of these keywords is done inside ParseCastExpression() function of ParseExpr.cpp.
Although at first glance this seems like the wrong place as this expression does not qualify as a Cast Expression, but if you read through the comments mentioned at the start of this function, it mentions that this function is used to parse
all of cast-expression, unary-expression, postfix-expression, and primary-expression. We handle them together like this for efficiency and to simplify handling of an expression starting with a '(' token
It then calls ParseUnaryExprOrTypeTraitExpression() where in another switch case we define our ExprKind as UETT_StructFieldCount. I also noticed a couple of asserts where we needed to add our new keyword.
At the end of this step, the parser produces a UnaryExprOrTypeTraitExpr AST node with our custom UETT_StructFieldCount kind.
Relevant changes - clang/lib/Parse/ParseExpr.cpp
Step 3 : Semantic Analysis
The changes inside clang/lib/Sema/SemaExpr.cpp are done to add semantic checks on the Parsed Expression. These are the checks that would generate compilation errors when we write a code that does not conform to the logic defined here. I have skipped any specific handling for my new keyword for now and only added changes so that it goes through generic UETT expression checks.
Step 4 : Code Generation
This is the place where we add the final logic for our function. Code changes - clang/lib/CodeGen/CGExprScalar.cpp
Inside VisitUnaryExprOrTypeTraitExpr() function, we have the handling for all UETT expression kinds. Note that our output of this function is a simple scalar expression (an integer), hence we add the handling inside ScalarExprEmitter class. The handling is quite straight forward, we typecast our expression as RecordType instance. A RecordType is used to store information about Classes / Structs / Unions. We get the number of fields from the RecordType and return it as a ConstantInt (0 runtime overhead).
Compiling clang
You can compile this custom instance of clang from my llvm-project fork -
https://github.com/ayushbansal07/llvm-project
Branch - feature/struct-field-count
Caveats
I have skipped over the Semantic Analysis step in this branch. Hence the current compiler would crash on using the function incorrectly. For example, the following snippet crashes - __builtin_struct_field_count(1)