Binary instrumentation and analysis have been subjects that I have always found fascinating. At compile time via clang, or at runtime with dynamic binary instrumentation frameworks like Pin or DynamoRIO. One thing I have always looked for though, is a framework able to statically instrument a PE image. A framework designed a bit like clang where you can write 'passes' doing various things: transformation of the image, analysis of code blocks, etc. Until a couple of months ago, I wasn't aware of any public and robust projects providing this capability (as in, able to instrument real-world scale programs like Chrome or similar).
In this post (it's been a while I know!), I'll introduce the syzygy transformation tool chain with a focus on its instrumenter, and give an overview of the framework, its capabilities, its limitations, and how you can write transformations yourself. As examples, I'll walk through two simple examples: an analysis pass generating a call-graph, and a transformation pass rewriting the function
__report_gsfailure in /GS protected binaries.
About three years ago, the LLVM framework started to pique my interest for a lot of different reasons. This collection of industrial strength compiler technology, as Latner said in 2008, was designed in a very modular way. It also looked like it had a lot of interesting features that could be used in a lot of (different) domains: code-optimization (think deobfuscation), (architecture independent) code obfuscation, static code instrumentation (think sanitizers), static analysis, for runtime software exploitation mitigations (think cfi, safestack), power a fuzzing framework (think libFuzzer), ..you name it.
A lot of the power that came with this giant library was partly because it would operate in mainly three stages, and you were free to hook your code in any of those: front-end, mid-end, back-end. Other strengths included: the high number of back-ends, the documentation, the C/C++ APIs, the community, ease of use compared to gcc (see below from kcc's presentation), etc.
Sometimes, when you are reverse-engineering binaries you need somehow to measure, or just to have an idea about how much "that" execution is covering the code of your target. It can be for fuzzing purpose, maybe you have a huge set of inputs (it can be files, network traffic, anything) and you want to have the same coverage with only a subset of them. Or maybe, you are not really interested in the measure, but only with the coverage differences between two executions of your target: to locate where your program is handling a specific feature for example.
But it's not a trivial problem, usually you don't have the source-code of the target, and you want it to be quick. The other thing, is that you don't have an input that covers the whole code base, you don't even know if it's possible ; so you can't compare your analysis to that "ideal one". Long story short, you can't say to the user "OK, this input covers 10% of your binary". But you can clearly register what your program is doing with input A, what it is doing with input B and then analyzing the differences. With that way you can have a (more precise?) idea about which input seems to have better coverage than another.
Note also, this is a perfect occasion to play with Pin :-)).
In this post, I will explain briefly how you can build that kind of tool using Pin, and how it can be used for reverse-engineer purposes.more ...