LLVM and Its Intermediate Representation
You do not know what LLVM is? Intermediate representation sounds peculiar and static analysis is hard?
LLVM is a compiler infrastructure project that allows you to build your own compiler(s) and associated tools. If you do not know what that means think of it as a large C++ library that comes with a bunch of useful command-line tools. For a brief tour of LLVM check out the talk “Introduction to LLVM”.
In static program analysis, we usually do not analyze the actual source code. If you are familiar with C and C++ you know that these languages use an archaic syntax and allow for nightmarish language constructs. That is why the target source program is compiled to an intermediate language which is then eventually analyzed. The intermediate language is usually much easier to analyze. LLVM has its own intermediate representation—LLVM IR—whose language reference can be found at https://llvm.org/docs/LangRef.html. It looks more complicated than it actually is. You can produce LLVM IR for an individual C or C++ compilation unit by using the Clang compiler (which is part of the LLVM project):
$ clang++ -std=c++17 -Wall -Wextra -emit-llvm -S -fno-discard-value-names code.cpp
If you wish to install LLVM (and Clang) on your machine, follow the descriptions that can be found at https://llvm.org/docs/GettingStarted.html and https://llvm.org/docs/CMake.html.
The above command will generate LLVM IR for the file code.cpp in human readable version. A small program such as
int increment(int i) { return ++i; } int main() { int i = 42; int j = 13; int k = i + j; k = increment(k); return k; }
becomes
; ModuleID = 'code.cpp' source_filename = "code.cpp" target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-unknown-linux-gnu" ; Function Attrs: noinline nounwind optnone uwtable define dso_local i32 @_Z9incrementi(i32 %i) #0 { entry: %i.addr = alloca i32, align 4 store i32 %i, i32* %i.addr, align 4 %0 = load i32, i32* %i.addr, align 4 %inc = add nsw i32 %0, 1 store i32 %inc, i32* %i.addr, align 4 ret i32 %inc } ; Function Attrs: noinline norecurse nounwind optnone uwtable define dso_local i32 @main() #1 { entry: %retval = alloca i32, align 4 %i = alloca i32, align 4 %j = alloca i32, align 4 %k = alloca i32, align 4 store i32 0, i32* %retval, align 4 store i32 42, i32* %i, align 4 store i32 13, i32* %j, align 4 %0 = load i32, i32* %i, align 4 %1 = load i32, i32* %j, align 4 %add = add nsw i32 %0, %1 store i32 %add, i32* %k, align 4 %2 = load i32, i32* %k, align 4 %call = call i32 @_Z9incrementi(i32 %2) store i32 %call, i32* %k, align 4 %3 = load i32, i32* %k, align 4 ret i32 %3 } attributes #0 = { noinline nounwind optnone uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "frame-pointer"="all" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "unsafe-fp-math"="false" "use-soft-float"="false" } attributes #1 = { noinline norecurse nounwind optnone uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "frame-pointer"="all" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "unsafe-fp-math"="false" "use-soft-float"="false" } !llvm.module.flags = !{!0} !llvm.ident = !{!1} !0 = !{i32 1, !"wchar_size", i32 4} !1 = !{!"clang version 10.0.0 (https://github.com/llvm/llvm-project.git d32170dbd5b0d54436537b6b75beaf44324e0c28)"}
The LLVM IR version of this program only comprises basic instructions. You can think of it as a kind of machine independent assembler but more readable, and with functions. LLVM IR is generally much easier to analyze than the original source code.
„Hello, LLVM!“ Analyzing LLVM IR
A “Hello, LLVM!” program that reads an LLVM IR file can be found at https://github.com/GaZAR-UG/hello-llvm. Our small program reads the IR, iterates its functions, and inspects its individual instructions. This is pretty much static analysis: you are inspecting the code without actually executing it. Using LLVM’s powerful APIs for inspecting its IR makes implementing your own data-flow analysis relatively easy.
// Iterate the IR and analyze the functions, basic blocks or individual instructions. for (const auto &F : *M) { llvm::outs() << "found function " << F.getName() << '\n'; for (const auto &BB : F) { for (const auto &I : BB) { if (const auto *Alloca = llvm::dyn_cast<llvm::AllocaInst>(&I)) { llvm::outs() << "found alloca instruction, allocating: "; Alloca->getAllocatedType()->print(llvm::outs()); llvm::outs() << '\n'; } if (const auto *Load = llvm::dyn_cast<llvm::LoadInst>(&I)) { llvm::outs() << "found load instruction, loading from: "; Load->getPointerOperand()->print(llvm::outs()); llvm::outs() << '\n'; } if (llvm::isa<llvm::CallInst>(&I) || llvm::isa<llvm::InvokeInst>(&I)) { llvm::ImmutableCallSite CS(&I); if (!CS.isIndirectCall()) { const auto *Callee = CS.getCalledFunction(); if (Callee) { llvm::outs() << "found direct call to " << Callee->getName() << '\n'; } } else { llvm::outs() << "found indirect call to "; CS.getCalledValue()->print(llvm::outs()); llvm::outs() << '\n'; } } } } }
Static Analysis Using the Monotone Framework
How do you implement your own intra-procedural data-flow solver? Construct a map that maps each intra-procedural control-flow edge to an empty set. Find a suitable abstraction and model the program property that you are interested in as a data-flow fact. You are interested in constant variables? Use pairs of variables and associated constant values. Put all control-flow edges into a worklist. Process your worklist in a while loop, remove the first control-flow edge, inspect the corresponding source instruction, and model its effects in terms of data-flow facts. An instruction represents a local variable? Add this variable and its undefined value to the set that is associated with that control-flow edge. A variable is initialized with an integer literal? Lookup the variable in the set of flow facts and adjust its value accordingly. If you need to modify a set, put it back to the worklist. Iterate until your worklist is empty and your solution has stabilized. Voila, this is your solution. This approach is also known as Monotone Framework.
By the way, a static analysis considers all possible execution paths of the given target programs. Conditionals are usually not evaluated as they are typically depending on dynamic runtime values. If two control-flow edges lead to a common successor instruction you will need to merge the sets obtained along the different paths, first. This is usually done by using either set-union for may analyses (e.g. taint analysis) or set-intersection for must analyses (e.g. constant propagation).
That’s all!
Ok, depending on what properties you wish to show for a given target program, static analysis can get arbitrarily complex, of course. But that does not have to be your concern. We will help you with designing, implementing and deploying the static (and dynamic) program analyses you need to make your software more performant, more secure, and more robust.