Updated: Oct 17, 2022
The problem of analyzing obfuscated code has long been known. Obfuscation is a significant change in code redundancy, so even simple methods can greatly complicate analysis, including simple algorithms.
In this article, we propose to consider a deobfuscation method based on the work of the LLVM optimizer. A feature of this approach is its versatility, since optimization takes place over LLVM-IR, which makes it possible to successfully use the same optimization methods on different architectures.
LLVM optimizer specifics
LLVM is a project to simplify the creation of compilers based on a platform-independent system for encoding machine instructions, which is based on the RISC architecture. Since LLVM-IR is an intermediate language for this architecture, all transformations (optimizations) take place on the LLVM-IR bytecode.
In theory, when using various methods of decompiling native code in LLVM-IR, there are possibilities of extremely serious optimizations even for compiled code.
A feature of this deobfuscation method is the relative universality of the methodology (adjusted for the capabilities of the native->LLVM “lifter”).
Among the free products that can produce a “lifting”, there are several interesting projects:
1. RetDec https://github.com/avast/retdec is a decompiler based on native code lifting in LLVM-IR. This is what we will use as an example.
2. Llvm-mctoll https://github.com/microsoft/llvm-mctoll - was not considered as an option due to the lack of x86 support (while x86-64 is available).
3. Mcsema https://github.com/lifting-bits/mcsema - quite an interesting project for “transferring” code to other architectures. The main disadvantage is that it uses a rather complicated code transfer scheme with a “virtual” processor of the original architecture, which further complicates the structure. This greatly reduces the possibility of subsequent optimization and deobfuscation.
4. Dagger https://github.com/repzret/dagger - in general, is a good project, which, however, has problems with global variable lifting plus a complete lack of stack variable lifting, which severely limits its use for code optimization purposes.
LLVM optimizer practical implementation
For a practical demonstration, let’s take an illustrative example of hand-crafted code that is compiled with disabled optimization.
After compilation it will look like this
After the optimizer we get the following result
Let’s see how everything works. First of all, we need to select the block of code we want to optimize. In this case, it’s just one function located at 0x140001000-0x14000109B (assuming ALSR is disabled).
The next task is to get the converted code of this function, which the RetDec decompiler can successfully handle. For the best effect, it is necessary to correct its configuration.
The RetDec launch console looks like this
python retdec.py [filename].exe --select-ranges 0x140001000-0x14000109B --select-decode-only --stop-after bin2llvmir
The output will be [filename].ll file. It needs to be translated into assembler, which is done using the LLVM compiler (LLC) utility.
llc -o [filename].asm -x86-asm-syntax=intel -filetype=asm -march=x86-64 -mtriple x86_64-win32 [filename].ll
In order for the output assembler file to have a sane structure, it can be assembled using fasm. To do this, you need to use a certain set of filters (mainly related to syntax) and, more importantly, correct the names of library calls to addresses (features of lifting using RetDec).
The example does not use this, and the above implementation only works under x86. If desired, you can simply fix the regular expressions.
After that, we can start the process of translating the assembler into “bytes”.
The final step is to take the bin file obtained at the output and writes data from it to the same place inside the executable file, extra 0xCC scores for complacency. In Python with the help of lief it looks like this
LLVM optimizer - how to turn a proposed concept into a real product
The described example of obfuscation is quite simple. We can handle it with the help of such decompilers as IDA, Ghidra, or RetDec.
At the same time, the proposed concept can be used for more convenient dynamic analysis, because optimization greatly simplifies the execution graph and removes unnecessary stuff.
Potentially, the described mechanism can remove ollvm (with certain modifications), or even virtualization – it works on the VMProtect route, however, further improvements are required.
In addition, it is possible to use this approach in legacy software to optimize already compiled code blocks. Let’s take this code as an example
After optimization, we see the following picture
Thus, the optimizer did a good job of “optimizing the loop” by converting the loop operation to a constant one.
The biggest problem in the practical use of this method can be a “curve” lifting of native to LLVM-IR, since RetDec is primarily a decompiler, not a lifter.
Therefore, this method is more of a concept than an example of a “combat” tool, although there are already successful cases of using it to close tasks that IDA could not cope with. But each case is unique and requires specific “polishing”.
Conclusions - how LLVM optimizer could be used
In this article we have reviewed the deobfuscation method based on the work of the LLVM optimizer, which allows to use the same optimization methods on different architectures. We proposed a simple practical implementation for this deobfuscation method and discussed the key issues related to the method.
Want to know more about Reverse Engineering? Check out our Reverse Engineering with Ghidra article.
To learn more about ISSP Professional Services visit ISSP Labs and resource center