About three years ago, the LLVM framework started to pique my interest for a lot of different reasons. This collection of industrial strength compiler technology, as Latner said in 2008, was designed in a very modular way. It also looked like it had a lot of interesting features that could be used in a lot of (different) domains: code-optimization (think deobfuscation), (architecture independent) code obfuscation, static code instrumentation (think sanitizers), static analysis, for runtime software exploitation mitigations (think cfi, safestack), power a fuzzing framework (think libFuzzer), ..you name it.
A lot of the power that came with this giant library was partly because it would operate in mainly three stages, and you were free to hook your code in any of those: front-end, mid-end, back-end. Other strengths included: the high number of back-ends, the documentation, the C/C++ APIs, the community, ease of use compared to gcc (see below from kcc’s presentation), etc.
The front-end part takes as input source code and generates LLVM IL code, the middle part operates on LLVM IL and finally the last one receives LLVM IL in order to output assembly code and or an executable file.
In this post we will walk through a simple LLVM pass that does neither optimization, nor obfuscation; but acts more as a token finder for fuzzing purposes.
Source of inspiration
If you haven’t heard of the new lcamtuf’s coverage-guided fuzzer, it’s most likely because you have lived in a cave for the past year or two as it has been basically mentioned everywhere (now on this blog too!). The sources, the documentation and the afl-users group are really awesome resources if you’d like to know a little bit more and follow its development.
What you have to know for this post though, is that the fuzzer generates test cases and will pick and keep the interesting ones based on the code-coverage that they will exercise. You end-up with a set of test cases covering different part of the code, and can spend more time hammering and mutating a small number of files, instead of a zillion. It is also packed with clever hacks that just makes it one of the most used/easy fuzzer to use today (don’t ask me for proof to back this claim).
In order to measure the code-coverage, the first version of AFL would hook in the compiler toolchain and instrument basic block in the .S files generated by gcc. The instrumentation flips a bit in a bitmap as a sign of “I’ve executed this part of the code”. This tiny per-block static instrumentation (as opposed to DBI based ones) makes it hella fast, and can actually be used while fuzzing without too much of overheard. After a little bit of time, an LLVM based version has been designed (by László Szekeres and lcamtuf) in order to be less hacky, architecture independent (bonus that you get for free when writing a pass), and very elegant (no more reading/modifying raw .S files). The way this has been implemented is hooking into the mid-end in order to statically add the extra instrumentation afl-fuzz needs to have the code-coverage feedback. This is now known as afl-clang-fast.
A little later, some discussions on the googlegroup led the readers to believe that knowing “magics” used by a library would make the fuzzing more efficient. If I know all the magics and have a way to detect where they are located in a test-case, then I can use them instead of bit-flipping and hope it would lead to “better” fuzzing. This list of “magics” is called a dictionary. And what I just called “magics” are “tokens”. You can provide such a dictionary (list of tokens) to afl via the -X option. In order to ease, automate the process of semi-automatically generate a dictionary file, lcamtuf developed a runtime solution based on
LD_PRELOAD and instrumenting calls to memory compare like routines:
memcmp, etc. If one of the argument comes from a read-only section, then it is most likely a token and it is most likely a good candidate for the dictionary. This is called afl-tokencap.
What if instead of relying on a runtime solution that requires you to:
- Have built a complete enough corpus to exercise the code that will expose the tokens,
- Recompile your target with a set of extra options that tell your compiler to not use the built-ins version of
- Run every test cases through the new binary with the libtokencap
..we build the dictionary at compile time. The idea behind this, is to have another pass hooking the build process, is looking for tokens at compile time and is building a dictionary ready to use for your first fuzz run. Thanks to LLVM this can be written with less than 400 lines of code. It is also easy to read, easy to write and is architecture independent as it is even running before the back-end.
This is the problem that I will walk you through in this post, AKA yet-another-example-of-llvm-pass. Here we are anyway, an occasion to get back at blogging one might even say!
Before diving in, here what we actually want the pass to do:
- Walk through every instructions compiled, find all the function calls,
- When the function call target is one of the function of interest (
memcmp, etc), we extract the arguments,
- If one of the arguments is an hard-coded string, then we save it as a token in the dictionary being built at compile time.
In case you are already very familiar with LLVM and its pass mechanism, here is afl-llvm-tokencap-pass.so.cc and the afl.patch – it is about 300 lines of C++ and is pretty straightforward to understand.
Now, for all the others that would like a walk-through the source code let’s do it.
The most important part of this file is the
AFLTokenCap class which is walking through the LLVM IL instructions looking for tokens. LLVM gives you the possibility to work at different granularity levels when writing a pass (more granular to the less granular): BasicBlockPass, FunctionPass, ModulePass, etc. Note that those are not the only ones, there are quite a few others that work slightly differently: MachineFunctionPass, RegionPass, LoopPass, etc.
When you are writing a pass, you write a class that subclasses a
*Pass parent class. Doing that means you are expected to implement different virtual methods that will be called under specific circumstances – but basically you have three functions:
doFinalization. The first one and the last one are rarely used, but they can provide you a way to execute code once all the basic-blocks have been run through or prior. The
runOn* function is important though: this is the function that is going to get called with an LLVM object you are free to walk-through (Analysis passes according to the LLVM nomenclature) or modify (Transformation passes) it. As I said above, the LLVM objects are basically
BasicBlock instances. In case it is not that obvious, a
.c file) is made of
Functions, and a
Function is made of
BasicBlocks, and a
BasicBlock is a set of
Instructions. I also suggest you take a look at the HelloWorld pass from the LLVM wiki, it should give you another simple example to wrap your head around the concept of pass.
For today’s use-case I have chosen to subclass
BasicBlockPass because our analysis doesn’t need anything else than a
BasicBlock to work. This is the case because we are mainly interested to capture certain arguments passed to certain function calls. Here is what looks like a function call in the LLVM IR world:
1 2 3 4 5 6 7 8 9 10 11 12
AFLTokenCap::runOnBasicBlock is called, the LLVM mid-end will call into our analysis pass (either statically linked into clang/opt or will dynamically load it) with a
BasicBlock passed by reference. From there, we can iterate through the set of instructions contained in the basic block and find the call instructions. Every instructions subclass the top level llvm::Instruction class – in order to filter you can use the
dyn_cast<T> template function that works like the
dynamic_cast<T> operator but does not rely on RTTI (and is more efficient – according to the LLVM coding standards). Used in conjunction with a range-based for loop on the
BasicBlock object you can iterate through all the instructions you want.
1 2 3 4 5 6 7 8 9 10 11
Once we have found a llvm::CallInst instance, we need to:
- Get the name of the called function, assuming it is not an indirect target: llvm::CallInst::getCalledFunction
- Further the analysis only if only it is a function of interest:
- Extract the arguments passed to the function: llvm::CallInst::getNumArgOperands, llvm::CallInst::getArgOperand
- Detect hard-coded strings (we will consider a subset of them as tokens)
Not sure you have noticed yet, but all the objects we are playing with are not only subclassed from
llvm::Instruction. You also have to deal with llvm::Value which is an even more top-level class (
llvm::Instruction is a child of
llvm::Value is also used to represent constants: think of hard-coded strings, integers, etc.
Detecting hard-coded strings
In order to detect hard-coded strings in the arguments passed to function calls, I decided to filter out the
llvm::ConstantExpr. As its name suggests, this class handles “a constant value that is initialized with an expression using other constant values”.
The end goal, is to find
llvm::ConstantDataArrays and to retrieve their raw values – those will be the hard-coded strings we are looking for.
1 2 3 4 5
At this point, the pass basically does what the token capture library is able to do.
Harvesting integer immediate
After playing around with it on libpng though, I quickly was wondering why the pass would not extract all the constants I could find in one of the dictionary already generated and shipped with afl:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
In order to also grab those guys, I have decided to add the support for compare instructions with integer immediate (in one of the operand). Again, thanks to LLVM this is really easy to pull that off: we just need to find the llvm::ICmpInst instructions. The only thing to keep in mind is false positives. In order to lower the false positives rate, I have chosen to consider an integer immediate as a token only if only it is fully ASCII (like the
libpng tokens above)
We can even push it a bit more, and handle switch statements via the same strategy. The only additional step is to retrieve every
cases from in the
switch statement: llvm::SwitchInst::cases.
1 2 3 4 5 6 7 8
The main limitation is that as you are supposed to run the pass as part of the compilation process, it is most likely going to end-up compiling tests or utilities that the library ships with. Now, this is annoying as it may add some noise to your tokens – especially with utility programs. Those ones usually parse input arguments and some use
strcmp like function with hard-coded strings to do their parsing.
A partial solution (as in, it reduces the noise, but does not remove it entirely) I have implemented is just to not process any functions called
main. Most of the cases I have seen (the set of samples is pretty small I won’t lie >:]), this argument parsing is made in the
main function and it is very easy to not process it by blacklisting it as you can see below:
1 2 3 4 5 6 7
Another thing I wanted to experiment on, but did not, was to provide a regular expression like string (think “test/*”) and not process every files/path that are matching it. You could easily blacklist a whole directory of tests with this.
I have not spent much time trying it out on a lot of code-bases (feel free to send me your feedbacks if you run it on yours though!), but here are some example results with various degree of success.. or not. Starting with
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
On libxml2 (here is a library with a lot of test cases / utilities that raises the noise ratio in the tokens extracted – cf
xmlShell* for example):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192
Performance wise – here is what we are looking at on
1 2 3 4 5 6 7 8 9
I am very interested in hearing from you if you give a shot to this analysis pass on your code-base and / or your fuzzing sessions, so feel free to hit me up! Also, note that libfuzzer supports the same feature and is compatible with afl’s dictionary syntax – so you get it for free!
Here is a list of interesting articles talking about transformation/analysis passes that I recommend you read if you want to know more:
- Turning Regular Code Into Atrocities With LLVM
Go hax clang and or LLVM!