In reverse engineering, one often tries to identify well known functions in a stripped and statically linked binary. It is often possible to generate fingerprints for functions and match them against a debug build of the same libraries. Current approaches such as FLIRT signatures typically require a debug build with very similar compiler options. Additionally most current approaches require a one to one comparision between all pairs of functions. This prevents large databases of known libraries to be used to easily identify known functions.
To overcome this problem Indika generates robust hashes based on the semnatic of a function, that allow O(1) lookups in very large databases of known functions.
Even if a function is compiled with completely different compilers, for different platforms, using different optimization levels, the function fundamentally still needs to perform the same task. Using a Blanket Execution engine build on top of Unicorn, Indika gathers a set of traces covering all branches in the function. It observes all globally visible side effects. This set of side effects are a pretty good fingerprint for the actual function, independent of the actual compiler implementation. To map this set of features to a unique hash, we use min hashing. Assigning a unique hash has one very large advantage over current pairwise function similarity approaches. It becomes trivial to store the hashes of a very large number of applications in a database, and lookup unknown functions very fast.
A presentation on Indika can be found here.
Side Effect Filtering
Unfortunately, compiler optimizations often do change globally visible behavior such as memory accesses. Particularly, stack accesses are often optimized out. To achieve robust results, we need to filter events, if they are not stable across different compilers, versions, platforms or optimization levels. We manually identified a set of heuristics used to exclude events.
To name a few of the problems:
- Stack access
- Redundant reads/writes
- Inlined functions
- Leaf functions
A student (and now colleague) of mine improved upon this project in his (unpublished) master thesis, using machine learning to predict which effects are stable across targets and which are not, significantly raising the number of perfect matches. After hashing and inserting functions from roughly 2000 libraries present on a common linux system (compiled using GCC, 64 bit, no optimization), the database contains more than one million hashes. Afterwards, some of the libraries were recompiled using the highest optimization level of Visual Studio for 32 bit. Even though both compilation processes differ in nearly every parameter, in all but one case, more than 50% of the functions were uniquely identified by the hash. In fact, typically around 80% of the functions are identified perfectly. Common tricks for min hasing based nearest neighbor search can be used to search for near matches to further improve the performance.