In recent years, code obfuscation has attracted both researchers and software developers as a useful technique for protecting secret properties of proprietary programs. The idea of code obfuscation is to modify a program, while preserving its functionality, in order to make it more difficult to analyze. Thus, the aim of code obfuscation is to conceal certain properties to an attacker, while revealing its intended behavior. However, a general methodology for deriving an obfuscating transformation from the properties to conceal and reveal is still missing. In this work, we start to address this problem by studying the existence and the characterization of function transformers that minimally or maximally modify a program in order to reveal or conceal a certain property. Based on this general formal framework, we are able to provide a characterization of the maximal obfuscating strategy for transformations concealing a given property while revealing the desired observational behavior. To conclude, we discuss the applicability of the proposed characterization by showing how some common obfuscation techniques can be interpreted in this framework. Moreover, we show how this approach allows us to deeply understand what are the behavioral properties that these transformations conceal, and therefore protect, and which are the ones that they reveal, and therefore disclose.
The last years have seen a considerable growth in the amount of software that is distributed over the Internet and in the number of wireless devices that dominate our society. Common classes of web applications that are part of our daily lives include e-mail clients, e-banking, e-commerce, social shopping, social networks and e-voting. In this complex scenario, users need to protect their devices against malicious software attacks (e.g., software viruses and Internet worms), while software developers need to protect their products against malicious host attacks that usually aim at stealing, modifying or tampering the code in order to obtain (economic) advantages over it.
A key challenge, in defending code running on an untrusted host, is that there is no limit on the techniques that a human attacker can use to extract or modify sensitive data from the code and/or to violate its intellectual property. Indeed, software developers lose the control of their applications once they are distributed to a client machine. The most common malicious host attacks against proprietary programs are malicious reverse-engineering, software piracy and software tampering [5]. Malicious reverse engineering refers to those techniques inspecting the inner workings of software applications for unlawful purposes. Both software tampering and software piracy need a preliminary reverse-engineering phase in order to understand the inner working of programs they want to tamper with or to use, without authorization. Thus, the first defense against malicious host attacks consists in impeding reverse engineering as much as possible.
Of course there are legal measures to protect software against malicious host attacks, such as copyrights, patents and licenses. However, legal countermeasures are expensive and it may be hard, for a software developer, to enforce the law against a larger and more powerful competitor. For this reasons, software developers often resort to hardware and software solutions for protecting their products. Hardware solutions, for example by means of dongles, cryptographic processors or smart cards, are less flexible than software solutions, have higher costs and require more effort in maintenance. It is therefore important to develop general software security techniques, which are hardware and platform independent, and which are able to ensure the desired security requirements. Two of the most important software defense techniques against reverse engineering are code obfuscation and software watermarking. Code obfuscation [7] aims at transforming programs in order to make them more difficult to analyze, but anyway preserving their functionality, while software watermarking [6] inserts a signature in the code in order to link the software to the legitimate owner. Software watermarking is useful in identifying a violation once it has happened, while code obfuscation is considered a promising technique to prevent malicious attacks. In recent years, code obfuscation has attracted the attention of many researchers and this has led to the design and implementation of many obfuscation strategies (e.g., [7,18,33,34,40]) together with the study of theoretical models and frameworks for understanding the potentialities and limits of obfuscation (e.g., [2,4,15,21,25,39]).
A well-known negative theoretical result on code obfuscation is the one of Barak et al. [2] showing that obfuscation is impossible. Note that, this result states the impossibility of an “ideal” obfuscator that obfuscates every program by revealing only the properties that can be derived from the I/O semantics. Indeed, the obfuscator of Barak et al. is a program transformer that: (1) preserves program’s functionality, (2) generates obfuscated programs whose slowdown with respect to the original program is polynomial both in time and space, (3) and satisfies the virtual black box (VBB) property, namely anything that one can compute from the obfuscated program can also be computed having access to the input/output behavior of the original program. Given the impossibility of such a VBB-obfuscator, Barak et al. suggested another notion of obfuscation called indistinguishability obfuscator, that weakens the VBB property. Indeed, an indistinguishability obfuscator has to (1) preserve program’s functionality, (2) cause a polynomial slowdown and (3) be such that it is not possible to distinguish the obfuscated version of two equivalent programs. Recently, Garg et al. [19] have provided the first construction for indistinguishability obfuscation, thus proving the existence of such an obfuscation. However, the proposed construction has a very high complexity that makes it impractical to use. Observe that, the results of Barak et al. and of Garg et al. aim at studying the existence of a VBB, or indistinguishability obfuscator, working for each program (more specifically for each polynomial size circuit) and guaranteeing a polynomial (in the worst case) loss of performance in time and space. Besides the negative result of Barak et al. and the positive, but still impractical, result of Garg et al., in recent decades, we have seen a big effort in developing and implementing new and efficient obfuscation strategies. Of course, these obfuscating techniques introduce a kind of practical obfuscators weakening the VBB-obfuscators in different ways, and which can be effectively used in real application protection in the markets. For example, these obfuscators may work only for a certain class of programs, or may be able to hide only certain properties of programs (e.g., control flow), or may cause a loss of performance in time and space that is more than polynomial. Indeed, the attention on code obfuscation poses the need to deeply understand what it is possible to obfuscate of a particular program, when it is possible to obfuscate it, and how it is possible to obfuscate it. Surely, an important task in this field, related to what we can obfuscate, consists in understanding obfuscations by characterizing which kind of attacks they are able to defeat. We believe that the development of a systematic strategy for the design of an obfuscator parameterized with respect to program properties both to protect (conceal) and to preserve (reveal), would be an important advance in the understanding of property-driven obfuscation techniques. In particular, it would provide a better insight into the relation between the property preserved by an obfuscator, which usually is the I/O program behavior but which can be any (more precise) observable property of the program, and the property protected, e.g., the program dependencies, the control structure, and so on. The first step toward the development of a systematic design of property-driven obfuscation strategy is the ability to formally, and therefore semantically, characterize the obfuscation of a particular program with respect to the property to be protected and preserved. The idea is that a program property is obfuscated, namely protected by obfuscation, when analyzing the obfuscated program it is not possible to observe precisely the property of the original program. This means that obfuscation has modified the program in order to add noise and make imprecise the observation of the property to protect. In this work we address the problem at the semantic level: we provide a methodology for designing semantic code transformations that act as obfuscation with respect to the properties to be protected and preserved. The necessity of fixing and considering both these constraints comes from the fact that, while transforming a code for protecting some of its properties, we could transform other code properties that we would like to preserve, such as the I/O functionality. For this reason, we need to formally fix also the property we aim at preserving, in order to provide a bound to the noise we can add for obfuscating the program semantics.
By considering, for instance, data-type obfuscation [18], the obfuscation is achieved by transforming data manipulations in order to deceive particular data observations. Without entering in details, consider the following code and a possible data-type obfuscation of variable s concealing the parity property of the values assumed by variable s during computation and at the end of the program, while preserving its I/O value:
In this case, all the definitions of x are doubled, while in all the uses of x its value is divided by 2. In this way, the final value of s (computed in terms of x) does not change, but some analyses are confused. In particular, we can observe that if we statically analyze the parity of s, then in we precisely conclude that s at statement 6 is odd, while in the obfuscated program we cannot conclude anything on the parity of s at statement 6, due to the computation in terms of the integer division , whose parity is undetermined when we only know the parity of x. Indeed, we can observe that, we do confuse other program properties, such as the range of values for x (interval analysis [11]), but this is ignored since the two required constraints are satisfied (preserve the I/O of s and protect the parity of s). Note that, we could conceal the parity analysis of s also by changing only statement 4 (without changing statements 1 and 5), but in this way we would also change the I/O behavior, making this transformation not acceptable as obfuscation.
This example shows us that, in order to formally reason on what is protected, namely what is concealed, and what is preserved, namely what is revealed, by an obfuscating transformation, we need to focus on the effects that obfuscation has on the program semantics. In [14,15] it is shown that the semantic level allows us to understand the protection capabilities and the limits of an obfuscating strategy with respect to a particular program. We start from the approach of Dalla Preda and Giacobazzi in [14,15] (then developed also in [20,21]), where a formal framework for code obfuscation is provided by taking into account the effects that an obfuscating transformation has on trace semantics, and by modeling attackers as approximations of trace semantics. In order to reason on the semantic aspects of obfuscation, we refer to the formal framework introduced by Cousot and Cousot [12], where the relation between syntactic and semantic transformations is formalized in terms of abstract interpretation by considering programs as abstractions of their semantics. Indeed, the above mentioned semantics-based formalization of code obfuscation, has allowed us to qualitatively compare the efficiency of different obfuscating transformations, and in certain cases [13,16], the semantic understanding of the obfuscation strategies has allowed us to design an efficient attacker who may be able to de-obfuscate the program, thus proving the potentialities and limits of the considered transformations.
In this work, we propose a general methodology for characterizing (when it exists) a property-driven obfuscating transformation that protects a given semantic property (concealment transformer), and preserves another semantic property (revelation transformer). In particular, we observe that a concrete example of revelation transformer is the program slicing transformation, which transforms a program looking for the maximal subprogram preserving the I/O behavior on the criterion variables [41]. On the other hand, the concealment transformer adds any computation that may conceal the property to protect, potentially concealing, in this way, also the original program behavior. Interestingly, the combination of these transformers provides a systematic strategy for the characterization of obfuscating transformers parametric on the program properties to conceal and to reveal. More precisely, the proposed strategy, given a program and the specification of the properties to protect and preserve, provides a characterization of the maximal obfuscation strategy for that program, namely of the pool of all the computations (execution traces) that could be added to the program in order to obtain the desired semantic obfuscation. Hence, maximal here means that, any other computation (that we could add to the original program semantics) would change also the program semantics property that we aim at preserving. For instance, in the previous example, the I/O semantics of P, that we want to preserve, is (the only output is the variable s), while we aim at concealing the parity property of s. The intuition beyond the strategy is that, the abstract execution on the parity domain of any obfuscated program must be different from the abstract execution of P on the same domain, while preserving the I/O semantics. Formally, the fix-point of the abstract execution of any obfuscated program must be different from , while its I/O must be . Indeed, in the example, the program precisely satisfies this requirement, since it has the same I/O semantics of P, while the abstract execution is . Hence, any program transformation, whose semantics is obtained by adding some of the computation given by the strategy to the original program semantics, is an obfuscation method that protects and preserves precisely the considered semantic program properties.
Next, we discuss how the proposed characterization of maximal obfuscation strategy can be used for driving the design of property-based obfuscators. To this end, we consider a recent obfuscating technique [21], where obfuscators are implemented as abstract distorted interpreters. The idea of this obfuscating technique is that, whenever the property to reveal is precisely the I/O semantics, then we can use a suitable distorted interpreter, that by construction, must preserve the I/O semantics, and which is built in order to transform the syntax in a way that a specific property results concealed in the transformed code. It turns out that, in our framework, these distorted interpreters are precisely a way of adding, to the original program, some of the computations contained in the maximal obfuscation strategy for the program.
We validate our results by showing how common obfuscating transformations can be interpreted in the proposed formal framework. For example, we consider the obfuscation technique known as opaque predicate insertion, which adds branches to the control flow of the program that are never executed at run-time. This means that the control flow graph of the obfuscated program corresponds to the control flow graph of the original program, augmented with some spurious paths. In this case, as expected, we are able to show that the computations added by opaque predicate insertion belong to the pool characterized by the maximal obfuscation strategy concealing the original program control flow, and preserving the fact that the original control flow is a sub-graph of the obfuscated program control flow. All the examples proposed show that the obfuscating transformations add computations that obfuscate the property to protect. Indeed, the interpretation of existing code obfuscations in terms of behavioral properties to reveal and conceal allows us to deeply understand the obfuscating behavior of such transformations, namely the relation between what is protected/concealed and what is preserved/revealed.
To summarize, the contributions of the paper are the following:
Definition and extension of the semantic framework for code obfuscation developed in [14,15] in order to cope with the semantic specification of the properties, respectively, to protect and preserve (Section 3).
Definition, study and characterization of two function transformers: Revelation, which minimally (in the point-wise order) modifies a function in order to preserve a certain property of the input (Section 4), and concealment, which maximally modifies a function in order to hide a certain property of the input (Section 5).
Development of a general framework based on revelation and concealment transformers for characterizing the obfuscation strategy of a particular program with respect to the semantic property to protect and to preserve (Section 6).
Discussion of the applicability (by means of some examples) of the proposed framework by showing how existing obfuscators based on abstract distorted interpreters are indeed a particular way of instantiating property-driven obfuscation strategies (Section 7).
Preliminaries
Mathematical notation
If C and D are sets, then denotes the powerset of C, denotes the Cartesian product of C and D, denotes the set difference between C and D, denotes strict inclusion, and for a function , and , and . We will often denote as and use lambda notation for functions. Function composition is denoted .
A binary relation is a partial order if it is reflexive (), antisymmetric () and transitive (). A set C equipped with a partial order relation ⩽ is called a poset and it is denoted with . We define the point-wise ordering between functions as follows: given two poset and and two functions , if and only if we have that . Let be a poset and let . The least upper bound (lub) of X, denoted , is the smallest element such that . The greatest lower bound (glb) of X, denoted is the biggest element such that . We use and to denote respectively and . A poset is a lattice if we have that and . A lattice C is a complete lattice when each subset of elements has glb and sup, namely when and . As usual, denotes a complete lattice C with ordering ⩽, lub ∨, glb ∧, greatest element (top) ⊤, and least element (bottom) ⊥. Often, , , , and will be used to denote the underlying ordering, basic operations and elements of the complete lattice C. Consider a poset with ⊥. We say that , is an atom of C if with we have that implies . The notion of co-atom is dually defined. A poset C is a direct set if each non-empty finite subset of C has lub in C. Given a complete lattice , we define the identity and top functions on C as follows: and . Let C and A be complete lattices, function is (completely) additive if f preserves lub of all subsets of C (the empty set included), namely if we have that . Function f is continuous when it is monotone and it preserves lubs’s of direct sets, namely when for every direct set we have that . Co-additivity and co-continuity are dually defined. Let C be a poset with ⊥ and ⊤. Given an element we say that is the complement of x if and . A lattice is complemented if each one of its elements has complement in the lattice. A lattice is distributive if the glb distributes on the lub, i.e., . A complete Boolean algebra is a complete, complemented and distributive lattice [28]. Given an operator f on a poset we say that is a fix-point of f if . The least fix-point of f on starting from the lattice element s, when it exists, is denoted by .
Abstract interpretation
Abstract interpretation is a general theory for specifying and designing approximate semantics of programming languages [10]. Approximation can be equivalently formulated either in terms of Galois connections or closure operators [11]. An upper closure operator ( for short) on a poset is an operator that is monotone (), idempotent (), and extensive (). By duality, a lower closure operator ( for short) on a poset is an operator that is monotone, idempotent, and reductive (). Let and denote respectively the set of upper and lower operators on C. Consider a complete lattice C where elements are ordered according to their relative precision. An approximation of the elements of C can be defined as an operator that maps each element of in a less precise element that represents its approximation in C (recall that by definition of upper closure operator). Consider for example the complete lattice given by the power set of integers ordered according to set inclusion: . The operator , operating on the power set of integers, associates each set of integers with its sign: In the following picture we write for , 0 for and for .
Analogously, the operator associates each set of integers with its parity. In this case, in the picture, we write for and for .
From these examples,we can clearly see that the abstract elements, in general, correspond to the set of values satisfying the property they represent.
Let us recall the notion of Moore family of a complete lattice C: is a Moore family of C if , where . Observe that every closure operator where C is a complete lattice is uniquely determined by the set of its fix-points and that this set is a Moore family of C. For instance . Indeed, for a complete lattice C it holds that: if then is a Moore family of C, and for every Moore family we can define an operator as , and it is such that and . Given a closure we say that φ is meet-uniform on x if and only if , φ is meet-uniform if we have φ meet-uniform on x [26]. The notion of join-uniformity is dually defined.
If is a complete lattice then ordered point-wise is also a complete lattice: where for every , and : if and only if ; ; and , and . The complete lattice of upper closures of C is called the lattice of abstract interpretations of C [11]. As argued above, each defines a possible abstraction of C where each element is approximated by the best (more precise) element of that represents it. Hence, the complete lattice is precisely the domain of possible abstractions of C ordered according to their degree of precision. In the following, we will find particularly convenient to identify closure operators (and therefore abstract domains) with their sets of fix-points that represent the property of concrete elements that the closure is able to express.
Programming language: Syntax, semantics and control flow graphs
Let us partially follow the notation used in [21]. We use a toy imperative language for introducing the semantics on which we define our framework, and that we will use in some of the examples. Note that, the choice of the language is not important since our framework is semantic, and therefore language, independent. For abstract interpretation, we need a fine-grained small-step semantics containing program points or similar syntactic information to which abstract values can be bound. Consider a simple imperative language parameterized by an expression sub-language e that contains the usual arithmetic and boolean expressions and functions:
The single commands of the language are in the following denoted with . In Table 1 we omit the formal semantics of which corresponds to a dynamic assignment to x, and of which corresponds to the visualization of the output result. We consider a quite standard operational semantics of the language. Let be a set of programs in the language (in the following simply denoted when the language can be left implicit), the set of all the variables in , and be a set of program lines of containing a special notation ϵ for the empty program line, be the set of values, and be a set of possible program memories. When a command belongs to a program we write , and we define the auxiliary functions be such that if is the command in at program line l (denoted ) and such that if command is stored at location l, with the simple extension to sequences of instructions . Then, letting , we define the semantics of in Table 1, where x are variables, e are (potentially boolean) expressions, is the evaluation of expressions, and where we write for the execution of in the memory σ with continuation , and for termination in state σ. We can formally characterize the small-step operational semantics of programs. Let be the set of (nonterminating) states, containing the actual memory and the code to execute, and . The function is such that:
It is worth noting that for deterministic languages like , this set contains only one state or one memory.
Small-step semantics of
We use the transfer function to define the small-step operational semantics, denoted by , as the set of traces of terminating and infinite executions. Let denote the set of all the finite traces on Σ, and the set of all infinite traces, we denote by the set of all possible traces. Let and ,
where denotes the ith state in the trace τ, denotes the length of trace τ, , . See [9] for details on the fix-point construction of operational semantics. In [9] several kind of semantics are defined as abstractions of small-step semantics, let us recall denotational (I/O) semantics :
In the following, we abuse notation by denoting with and also the corresponding additive lift to set of states, i.e., given we have that and .
Programs are often analyzed by means of their control flow graph (CFG). A CFG models the flow of control between program instructions. We consider the CFG as given by a directed graph , where each node corresponds to an instruction, and each edge corresponds to a possible transfer of control from instruction to instruction [8]. Note that, the execution of a CFG corresponds to the set of state traces due to the execution of the commands on the considered CFG.
Consider the following program and corresponding single-command CFG [24]
In this case, since the program is deterministic, we have only one execution of the CFG for each input. For instance, the execution (denoted by the sequence of executed program points) starting from the state where is: .
Code obfuscation
In this section, we recall the standard definition of code obfuscation. Code obfuscation was first defined by Collberg et al. [7] as follows.
A syntactic program transformation is an obfuscator if:
the transformation is potent, namely it makes programs more difficult to analyse;
and have the same observational behavior, i.e., if fails to terminate or it terminates with an error condition then may or may not terminate; otherwise must terminate and produce the same result as .
The second point of the above (informal) definition requires the original and obfuscated program to behave equivalently whenever terminates, whereas no constraints are specified when diverges. This means that in order to classify a program transformation as an obfuscation, we have to analyze only the finite behaviors of the original and of the transformed program. This definition of code obfuscation makes it clear that the aim of code obfuscation is to obstruct program analysis, by making it more complex, thus concealing some information while preserving the observational behavior of programs (i.e., program denotational semantics). For instance, consider the following slight abstraction of the I/O semantics defined in Section 2.3: Let us consider with denoting the set of states denoting error conditions:
Namely, we precisely observe the final output only of terminating traces which do not terminate with an error condition. The definition of obfuscation of Collberg et al. requires that . In other words Definition 1 says that obfuscation has to reveal, which is an abstraction (i.e., a property) of concrete semantics [9].
A typical example of code obfuscation is the insertion of fake branches through opaque predicates [7]. A true (resp. false) opaque predicate is a predicate that always evaluates to true (resp. false). Program functionality is preserved by inserting the intended behavior in the always taken branch and buggy code in the never executed branch. This clearly confuses a static attacker that is not aware of the constant value of the opaque predicate and has to analyze both branches as possible program executions. In both cases the constant value of the predicate has to be difficult to deduce for an external observer that sees both branches as possible. However, the notions of transformation potency (or program complexity), of observational behavior and of attacker on which this definition relies are still informal. It is clear that in order to formally and deeply understand the potentiality and limits of code obfuscation it is necessary to formally specify the above mentioned notions.
In [14 ,15] the informal definition of code obfuscation of Collberg et al. has been generalized and placed in a theoretical framework based on program semantics and abstract interpretation. The idea is to introduce a formal model of malicious host attacks and of code transformations that allows a rigorous specification of the amount of “obscurity” added by a transformation to program semantics in the abstract interpretation framework.
The semantic framework for code obfuscation
When speaking of code obfuscation, we refer to program transformations that aim at obstructing reverse engineering by making it more difficult for attackers to obtain precise information regarding the original program. In general, we can say that any obfuscation technique is build for preserving some program features (in particular the I/O functionality) and for deceiving some other program features, namely for deceiving any feature that an attacker may exploit to perform reverse engineering. Let us explain our idea by means of examples.
An example, data-type obfuscation, has been given in the introduction, where the deceived feature is a semantic property, the parity of the computed values, that can be analyzed by static analysis, and in particular which is modeled by abstract interpretation [10,11]. But, what we are proposing here is not limited only to this kind of program features, since data-type obfuscation is not the only obfuscation technique where we can semantically characterize what is revealed and what is concealed. Consider, for instance, slicing obfuscation where we suppose that the attacker performs reverse engineering by extracting slices of the program. Hence, since standard algorithms for program slicing [29] are based on the computation of data dependencies, slicing obfuscation aims at deceiving precisely data dependencies, by adding dependencies, i.e., syntactic dependencies which do not corresponds to semantic ones [25]. In this way, any slicing algorithm based on data dependencies will compute imprecise slices.
Consider the word count program in Fig. 1 [35]. Note that the example is taken verbatim from [35] and it is written in C language.1
We decided not to translate this code in our toy imperative language for readability, for instance our language does not have an equivalent for getchar returning the next char in a file, or out as output instruction. However, this is not a problem since our framework is language independent.
It takes in a block of text and outputs the number of lines (), words () and characters (). The syntactic program transformation modifies line 7 by adding a true opaque predicate and adds lines 8 and 9 with false opaque predicates [35], i.e., . Let us consider both explicit and implicit data dependencies of the two programs. We say that there is a data dependence of variable from variable , denoted , when is defined in terms of or when there is a definition of in a branch of the program that is controlled by a condition in which appears . The table on the right of each program reports the dependencies among variables in the program. Suppose the slicing criterion looks for at the end of the program, then the slice of the original program is on the left in Fig. 2, while the slice of the obfuscated program is on the right. Hence, the above transformation conceals the real data dependencies of the program by adding fake dependencies between program variables, hence the obfuscation, while preserving the I/O semantics, has concealed the slicing computation on with respect to , due to the dependencies added by .
In this section, we describe the semantics-based approach to code obfuscation [14,15,21] in terms of the properties, respectively, to protect and to preserve.
Attacker model. In this context, the typical attacker aims at performing reverse engineering of programs for stealing or copying ideas. Reverse engineering attempts to understand the inner workings of programs by performing some kind of analysis. Reverse-engineering techniques include both static program analysis (e.g., data flow analysis, control flow analysis, alias analysis, program slicing) and dynamic program analysis which may also interact with the program execution (e.g., dynamic testing, profiling, program tracing). Moreover, reverse engineering techniques may be performed either automatically or manually, and may take advantage from the possibility of attacking a program several times on different environments. Due to the complexity and the diversity of the techniques that an attacker may perform, it is clear that when providing a formal framework we have necessarily to make some assumptions, obviously admitting some limitation in the model. In the particular context of the framework that we propose in this paper, we fix the model of the attacker at the semantic level, in order to allow a sufficient degree of generality, even if in this way we are not able to capture any possible attack. For instance, we cannot model manual attacks interacting with the execution. Nevertheless, we model any attack that extracts features through a specific observation of the program execution, i.e., of the program operational semantics (see Section 2.3).
In the following, we follow the ideas presented in [14,15,21,25]. Given a feature of interest, we model the corresponding semantic property as a function φ mapping any set of executions X to the sets of all the executions having the same feature of interest as the executions in X. This implies that φ is extensive (i.e., ), namely it approximates by adding noise, it is idempotent since the whole approximation is added in one shot, and finally it is monotone, preserving the approximation order. Namely it is an upper closure operator and the framework beneath is abstract interpretation [10,11]. As we have seen in the previous section, the set can be used to represent the set of possible program semantics, where Σ denotes the set of possible program states. Thus, we view attackers as the properties of program behaviors that they can observe/analyze, i.e., as over the domain of program operational semantics. In other words, the attacker is the map (uco) associating with each semantics S (set of executions) the smallest set of executions containing S and satisfying a corresponding invariant (the S observable property).
Hence, in the following of this paper, we only focus on program features, both preserved/revealed and protected/concealed, that are properties of the operational semantics, where by property we extensionally mean the set of all the executions having the desired particular feature. Hence, for instance, if we observe the parity of values, then a trace is approximated in the set of traces where only the parity of values is observable, i.e., . Analogously, if we observe data dependencies, the property of a trace is the set of all the computations inducing the same set of dependencies, that in this case can be identified simply with the information observed, i.e., the set of data dependencies. More formally, we can define the data dependency approximation which approximates any set of computations in the set of all the data dependencies among variables generated during the computation (e.g., by means of program dependence graphs), e.g., it approximates the operational semantics of programs in Fig. 1 in the set of dependencies represented in the tables on their right. In this case, we observe that , meaning that an external observer is not able to derive, from the analysis of the obfuscated program, the precise data dependencies holding in the original program, overestimating this information.
Note that, in this framework, the fact that we model attackers as the set of executions, or as an approximation of these executions, that they can observe, completely ignores the way in which the attacker performs the observation. In other words, what we propose depends on what the attacker observes, while it ignores how the attacker collects the observations of programs.
Obfuscation model. As argued above, an obfuscation is a program transformation that aims to confuse some properties of programs while preserving program behavior to some extent. This means that there are properties of the program behavior that are preserved by the obfuscation (typically the I/O semantics) and properties that are not preserved (the ones that the obfuscation aims at confusing). In [21], the authors specify when a program transformation is an obfuscator, in the semantic setting. Following this view, we obtain the following characterisation of an obfuscator:
An obfuscation transformer has to preserve at least the I/O functionality of programs, namely the I/O semantics of a program and of its obfuscated version have to be the same. More formally, let be the observation of the only I/O relation of the operational semantics of a program [9]
This is the closure operator corresponding to the semantics defined in Section 2.3. Then .
An obfuscator has to add confusion with respect to some properties that are revealed by the non-obfuscated program , thus generating an obfuscated program from which the confused properties cannot be precisely extracted. Namely, . In this case, we can say that the obfuscator is φ-potent for [14,15].
In the present framework, we say that the obfuscator preserves/reveals , while it conceals/protects φ, since φ cannot be extracted from the obfuscated program as it was on the original one. It is worth noting that in this way any property more abstract than is automatically preserved.
In the slicing obfuscation example we can prove that . Namely the I/O semantics of the original program can be precisely derived by the observation of its obfuscated version. Hence, the obfuscation conceals , while it preserves/reveals .
Note that this semantic approach to obfuscation allows us to characterize the obfuscating behavior of an obfuscation in terms of the most concrete property it preserves of program behavior, namely of program semantics. Let us denote with the most concrete property preserved by obfuscation on the semantics of programs [15]. The idea is that the most concrete semantic property can be used to characterize the semantic properties, namely the attackers, that are obstructed by the obfuscation and the ones that are not. Indeed, any attacker that is modeled as a property implied by the most concrete preserved property, i.e., , is not obstructed by obfuscation ; while any attacker interested in a property φ not implied by , i.e., , is defeated. Moreover, the mapping of syntactic program transformations to the lattice of abstract interpretations allows us to measure, reason and compare the potency and efficiency of different obfuscating transformations. For example, in [13,15,16] the proposed model of attackers and obfuscation has been used for understanding the class of attackers that are confused by opaque predicate insertion. In particular, the cited works consider the insertion of numerical opaque predicates such as . The attackers defeated by the opaque predicates are those that do not understand the always true value of these predicates. In fact, these attackers have to consider both the possible branches leading to a loss of precision of the static program analysis. These attackers are modeled by numerical abstract domains that do not allow precisely computing the invariant of the opaque predicate. For this reason, these abstract domains evaluate the predicate to ⊤, modeling that it can be either or . In this way, the attackers see both branches as possible. Consider, for example, an attacker interested in the Sign property, namely an attacker modeled by the abstract domain . This attacker would not be able to capture the opacity of the above mentioned opaque predicate . Indeed, consider such that . Computing the truth value of this opaque predicate for on means to compute , where , 2 and are the approximation of the module, square and addition operators on the domain of signs. Thus, since the square of any integer number returns a positive number. Next, because the sum of a positive number with a negative number can return either a positive or a negative number. Thus, because an integer number may or may not be a multiple of 2. Consider for example the following program and its obfuscation obtained through the insertion of the considered opaque predicate:
The attacker modeled by precisely infers that the output value of x in program is always positive, while it is not able to extract this information from the obfuscated program . Indeed, in the abstract semantics of the attacker has to consider both branches as possible and therefore the analysis of the sign of the output on returns meaning that the attacker is not able to understand that the program always outputs a positive number. This means that, the property of is not preserved on the transformed program by the considered obfuscation. In [16] these kind of observations have led to the design of an efficient opaque predicate detector for a class of numerical opaque predicates.
Syntactic vs semantic transformations. While code obfuscation is a syntactic program transformation and it transforms a program into another program, we have decided to model the obfuscating behavior of on the semantics of programs. This means that we are interested in the effects that obfuscation has on the semantics of programs. To this end, we find very useful the formal framework introduced by Cousot and Cousot [12], in which syntactic program transformations are related to their semantic counterpart and vice-versa.
The formal definition of the relation between syntactic and semantic program transformations given by Cousot and Cousot [12] allows us to reason on the effects that code transformations have on semantics. We consider to be the domain of programs up to semantic equivalence, where two programs and are semantically equivalent if , namely if they have the same semantics. In [12] programs are seen as abstractions of their semantics and this is formalized in the abstract interpretation framework. In particular, the semantic domain is abstracted in the syntactic domain , where ⊴ is the order induced on programs, namely , the abstraction of is the semantics of the simplest program (smallest number of instructions) that upper-approximates X. This means that, it is possible to associate to every syntactic program transformation its semantic counterpart , such that [12] and vice versa: and .
Observe that the equation expresses a syntactic transformation as an abstraction of the semantic transformation. Interestingly, starting from this formalization it is possible to derive a systematic methodology for designing syntactic transformations from semantic ones [12]. Of course, when the semantic transformation T relies on results of undecidable problems, any effective algorithm that tries to implement T would be an approximation of the ideal transformation . Thus, in general , or equivalently .
In the context of code obfuscation, the formal framework of Cousot and Cousot could be used to:
model obfuscation potency: reason on the effects that an obfuscation has on program semantics in order to deeply understand the semantic properties that are protected, i.e., concealed by the obfuscation, with respect to attackers modeled as semantic program properties;
specify property-driven obfuscation: given a semantic property to protect φ and a semantic property to preserve δ, develop a semantic transformation that conceals φ and reveals δ and uses this semantic characterization as a “measure” of optimality for any syntactic transformation implementing the corresponding semantic code obfuscation.
In other words, it provides a common ground, where syntactic obfuscations can be semantically analysed and compared with respect to the properties they are able to protect. Based on the investigation of point (1), presented in [14,15], we address here point (2).
From now on, we consider the semantic counterpart of code obfuscation, since it is at the semantic level that we can formally understand what is concealed and what is revealed. Indeed, studying obfuscation at the semantic level means studying its ideal behavior that will be necessarily approximated during the implementation process. In the following, an obfuscation of a property and a program is a semantic program transformation concealing φ on the semantics of , i.e., such that . We would say that a semantic program transformation is an obfuscation for φ if there exists a program for which T is an obfuscation, namely if .
The challenge: Property-driven obfuscations
The formal framework described above [14,15] models code obfuscation in terms of the most concrete semantic property preserved by the program transformation and this allows us to compare the potency of different obfuscating transformations and sometimes also their resilience [13,16], namely how well a transformation holds up under attack from a de-obfuscator [7]. However, this theoretical investigation does not provide any insight in the design of an efficient obfuscation. Indeed, what is still missing is a general strategy for designing and evaluating obfuscations given the specification of the property φ to protect, i.e., to conceal, and of the property δ to preserve, i.e., to reveal. This is exactly the high level goal of our investigation. More specifically, we investigate a general framework of function transformers that aim at minimally or maximally transform a function in order to reveal or conceal a given semantic property of a program. To this end, we first model and characterize the minimal transformations that preserve a certain property, later called revelation. To preserve means to leave unchanged and, therefore, to reveal the property of the original program in the transformed/obfuscated program. Moreover, we model and characterize the maximal transformation that protect a given property, later called concealment transformers. To protect a property of the original program means to change the property and, in this way, to conceal it. Next, we show how the combination of revelation and concealment can be used for characterizing an obfuscating transformation from the specification of the property φ to be concealed and the property δ to be revealed. What we obtain, in this way, is the characterization of the semantic transformations that exhibit the intended obfuscating behavior. This characterization could then be used to drive the design strategy for syntactic code obfuscations that implement the desired semantic behavior.
Modeling revelation
In this section, we investigate the semantic transformers minimally modifying a program semantics, in order to reveal a given property. We first provide a formal definition of the revelation transformers as a generic function transformer, and then we prove some of their interesting properties. Afterwards, we provide a characterization of the revelation transformers making more explicit how the construction of the revelation is based on the preservation of the property to be revealed. Then, we show how existing program transformations are indeed revelations. To conclude, we discuss how the revelation transformers relate to the semantics-based notion of code obfuscation.
Consider the complete lattice of functions over a complete lattice L, where functions are ordered point-wise. Given a function and a property δ modeled as abstraction of L, i.e., , we define the two function transformers and that aim at computing the two functions closest to f in the domain , respectively from above and from below, revealing the property δ. We consider both forms of revelation because there are two possible ways to minimally transform a function in order to make it reveal a property (from above and from below).
(Minimal revelation).
Let L be a complete lattice and . The operators are the minimal revelation from above and from below for δ:
Observe that, it is interesting to study the minimal revelation from above and from below for f when: (*) and do not trivially transform f, i.e., they do not transform f in the top () or the bottom () of the functional domain; (**) and reveal the property δ. In order to guarantee that the transformer always characterizes the minimal revelation, has to be monotone and extensive (approximating from above), and it has to be idempotent. Hence, has to be an upper closure operator on the lattice , and dually has to be a lower closure operator (). In the following, we consider the case when the revelation transformer do not trivially transform a function with respect to a property . For this reason we find it convenient to introduce the notion of not--trivial pair .
(Not--trivial).
Given a function on a complete lattice L and a property , we say that the pair is not--trivial when: and .
The following result proves that and are precisely the minimal transformers inducing the revelation of the property δ. Namely, and are respectively an upper and lower closure operator and they satisfy (**).
Let L be a complete lattice,and.
If δ is meet-uniform, then;
.
In this theorem, we use the notion of uniformity, in particular of meet-uniformity which means that the greatest lower bound (glb) operation preserves the property δ, namely the glb of elements with same property δ has the same property δ. This precisely models the fact that we can find the best approximation of f from below sharing the same property δ of f. Thus, given a function , we have that returns the closest function that is greater than f and that reveals the property δ. For this reason, we refer to as the minimal revelation (from above) of f w.r.t. δ. Dually, we have that given a function , then returns the closest function that is smaller than f and that reveals the property δ. Analogously, we refer to as the minimal revelation (from below) of f w.r.t. δ.
Characterizing revelation transformers
The following result provides an explicit characterization of , when the pair is not--trivial. In the following, given a property we define the kernel of δ with respect to an element as . Then we use the shorthand , and . Note that , since δ is an upper closure operator.
Let L be a complete lattice,andsuch that the pairis not--trivial. We have that:
If δ is meet-uniform then;
.
The above characterization says that the minimal revelation from above w.r.t. δ of f is the function that associates with each x the least upper bound between and the smallest element that preserves δ on x. Indeed, this corresponds to add the minimal amount of information to in order to make it preserve the property δ on x. An analogous reasoning holds for the minimal revelation from below.
Finally, we observe that by definition of and of , when we have that and therefore , and analogously which means that .
Examples of revelations
In this section, we discuss two examples of existing transformations that are indeed revelations. In particular, a property is a revelation, namely it is the most abstract transformation preserving the property itself, and slicing is a revelation, namely it is the most abstract subprogram preserving the criterion. We can show that both are instances of the minimal revelation from below of the function .
Property as revelation. First of all, let us observe that the minimal revelation from below of the function w.r.t. δ, i.e., , is δ itself.
.
Slicing as revelation. In this section, we show that, by changing the interpretation of the underlying domain and by computing the minimal revelation from below of the function , we obtain slicing. Slicing is a program transformation technique that extracts from programs sets of commands that are relevant for a particular behavior, called the criterion [41]. A slicing criterion specifies the variables of interest and the program points where these variables have to be observed. Several different kinds of slicing have been collected in a unique framework [3] which has been recently enriched with the semantic (abstract) form of slicing [37,38]. In this framework, slicing criteria are seen as projections on the trace semantics, and therefore are abstractions of the semantics. In this sense, we can observe that slicing produces the most abstract subprogram preserving the slicing criterion. Let be the abstraction corresponding to the slicing criterion, for instance in standard backward static slicing, it is the abstraction observing only the output values of the criterion variables. By Proposition 1 we have that . This means that, the abstraction corresponding to the slicing criterion is the semantics of the desired slice. At this point, we can use the Cousot and Cousot program transformer framework for characterizing the corresponding program slicing (see Section 3 and [12]). In particular, the idea is to look for the simplest program, i.e., containing the smallest number of instructions, with the given semantics, in the set of all the subprogram of . Namely, we do not consider the whole set of all programs , but only the subset of all the subprograms of , i.e., . Next example shows a simplification of the problem in order to provide an intuition of the relation between revelation and slicing. Namely, we show that the semantics of the slice of for a given criterion is precisely the revelation preserving the criterion, computed on the lattice of subprograms of .
Consider the program original in Fig. 1. Suppose that the slicing criterion is the variable nl at the end of the program execution, which can be modeled by the abstraction of traces taking all the traces providing the same output value for nl, , where denotes the output value for variable x of a trace τ. Hence, the semantics corresponding to this criterion, which is the semantics of the slice, is precisely the set of all the traces computing the same final value for nl, namely . At this point, the simplest subprogram of original that has this as semantics is the computable slice w.r.t. nl, namely it is:
Revelation vs semantic program obfuscation
It is interesting to observe how the minimal revelation from below is related to the notion of semantic obfuscation in [14,15]. The semantic notion of obfuscation of [14,15] characterizes the obfuscating behavior of a semantic transformation in terms of the most concrete property that it preserves on program semantics. Thus, it is not surprising that this characterization relates to the revelation transformer that minimally modifies transformation T in order to preserve property . In particular, since by hypothesis is preserved by T, we have that , which means that the minimal revelation of the code obfuscation T for property is exactly the obfuscation T itself.
Moreover, if we focus only on the property that an obfuscation T preserves we can maximize the obfuscating behavior by computing the revelation w.r.t. the property from the top semantic function . Indeed, consider a semantic transformation such that the most concrete property that it preserves is , namely such that . Since G preserves and since is monotone, we have that . Observe that all the transformations such that and actually hide the same set of semantic program properties . Hence, we can say that is the maximal transformation strategy, with respect to point-wise ordering of functions having the considered obfuscating behavior.
Modeling concealment
In this section, we define and investigate the existence of the semantics transformers that maximally modify a program semantics (in the function point-wise order), in order to conceal a given property. It is maximal since it adds all the possible noise to the program semantics in order to confuse the property to protect, while keeping enough, of the original functionality, to recover the original preserved program semantics by revelation. This constraint is fundamental for modeling obfuscation when combined with revelation.
Formally, we first provide the definition of the concealment transformers on generic functions, and then we prove some of their interesting properties. Afterwards, we provide a characterization of the concealment transformers that shows how their construction is based on the property to be concealed. Finally, we discuss how the concealment transformers relate to the semantics-based notion of code obfuscation. The idea is to find the farthest functions from f, on the lattice , that conceals a certain property φ, and these are characterized as the farthest functions among all the functions having the same revelation transformer of f w.r.t. φ. These transformers are precisely the adjoints [30] of the revelation ones, which while transforming, in order not to reach the top, keep the strong bind with the original function consisting in having the same revelation transformer. The following definition formalizes these transformers maximally concealing a certain given property.
(Maximal concealment).
Let L be a complete lattice, and . We define as:
This means that is the smallest among all the functions sharing the same revelation, from above w.r.t. φ of f, while is the biggest among all the functions having the same revelation from below w.r.t. φ of f. These transformers are interesting if they provide, as result, a function that has the same revelation, and this clearly is not always true. In particular, the maximal concealments are defined as the adjoints of the revelation transformers and these adjoint transformers do not always exists. We know that an upper closure admits adjoint, which is a lower closure [30], if it is meet-uniform [23,36], while, dually, a lower closure admits adjoint, which is an upper closure, if it is join-uniform. It is worth noting that uniformity is a local property, namely a function g may be uniform on a particular input x, i.e., , while failing uniformity on other inputs. In this case, we say that g is (meet)-uniform on x and w.r.t. x we can find the adjoint, i.e., . Hence, we have that if , on a function f, is meet-uniform, then is the minimum, meaning that it is an upper closure operator and behaves as adjoint of , and dually if is join-uniform on f, then is the maximum. For this reason, we have studied the properties of meet and join uniformity, respectively, of and .
Let L be a complete Boolean algebra,andsuch that the pairis not--trivial. We have that:
If φ is meet-uniform thenis meet-uniform on f;
is join-uniform on f.
Characterizing concealment transformers
As done for the revelation transformers in Theorem 2, the following results provide a characterization of the maximal concealment transformers in terms of the kernel of the property that we want to conceal.
Let L be a complete Boolean algebra,andsuch that the pairis not--trivial. We have that:
If φ is meet-uniform then;
.
This means that, the maximal concealment from below w.r.t. property φ of f, is the function that associates, with each element x, the greatest element, smaller than , which provides no information about the property φ on x, namely the greatest element that is the complement, w.r.t. ⊥, of the smallest element that preserves φ on x. Analogously, we have the dual characterization of the maximal concealment form above.
These characterizations turn out to be particularly meaningful when interpreted on a powerset domain ordered by set inclusion, as done in the following corollary.
Let D be a complete lattice,and. For eachsuch that the pairis not--trivial. We have that:
If φ is meet-uniform then;
.
Indeed, on the powerset domain, we have that, for any given set , the maximal concealment from below, namely , can be obtained by erasing from the smallest set in . Hence, we erase the minimal information that leads to the property . On the other hand, the maximal concealment from above, namely , can be obtained by adding to the biggest element in D that does not share the same property . It is clear that in this way we add the maximal noise to w.r.t. φ without any constraint on what we want to preserve/reveal.
Concealment vs code obfuscation
Let us understand the meaning of the concealment transformer, in the context of code obfuscation. To this end, we interpret the characterization given in Corollary 1 of the maximal concealment from above, in the code obfuscation scenario. Corollary 1 states that, given a semantic transformation , its maximal concealment from above for property φ is given by . This means that given a program semantics and a semantic program transformation T, if we want to characterize the maximal confusion added by T, with respect to property φ, we have to consider all the traces that do not belong to property . Observe that, in this way, there is no guarantee of preserving any semantic property of the original semantics X. Thus, the maximal concealment from above corresponds to the maximal noise that can be added, w.r.t. a certain property φ, but it is not an obfuscation since this transformation does not consider the need to preserve the semantics of the original program to some extent.
Consider, for instance, CFGs whose definition is recalled in Section 2. In [21] the authors showed that CFGs can be constructed as abstractions of the program semantics, hence let be the abstraction such that is the CFG of the program . According to Corollary 1, the concealment transformation , starting from the identity, takes the semantics , namely the maximal semantics (containing as much noise as possible) whose corresponding CFG is different from P’s CFG, and then adds the semantics of for being extensive. It is clear that this semantics corresponds to a program that still may behave like , but that can also show many other different behaviors. This is because the concealment simply characterizes the action of adding confusion and not the binding, in terms of the observational property to preserve, that we aim to keep with the original semantics of .
The concealment provides an important understanding of what we have to add in order to obfuscate a given property: it characterizes the pool of all the possible computations that we may add in order to gain confusion on the observation of φ, hence concealing φ. At this point, it is worth noting that the revelation transformer for δ would allow us to refine this information by avoiding all the computations that do not preserve a property δ.
Characterizing property-driven obfuscations
In this section, we combine revelation and concealment transformers for modeling maximal obfuscation strategies in the semantics-based code obfuscation framework. In this way, we aim at improving the global understanding of code obfuscation. In our formalization, we clearly separate the property to conceal from the property to reveal, and this leads to a better comprehension of the interplay of these two properties on the semantics of the programs we want to obfuscate.
As discussed in Section 3, in the code obfuscation scenario, by combining revelation and concealment we can characterize an obfuscation starting from the specification of what we want to reveal, i.e., , and of what we want to conceal, i.e., . The idea is to start from the identity function and maximally transform it in order to conceal the property that we want to protect and then to minimally transform the result in order to preserve the property that we want to reveal.
(Maximal property-driven obfuscation strategy).
Given . The maximal property-driven obfuscation strategy for revealing δ and concealing φ is defined as follows:
where is the identity function over the semantics of programs.
The following result provides a characterization of the maximal obfuscation strategy where we make more explicit its dependency on the properties, to conceal and to reveal. This is a direct consequence of the characterizations of concealment and revelation presented in the previous sections.
.
The above result shows that, in order to conceal φ and to reveal δ on X, we have to start from the identity, and first add noise to the information concerning by adding those traces with a different φ property w.r.t. X, and then we guarantee the preservation of by selecting only those traces having the same δ property of X. Finally, we add all the original traces in order to guarantee that the original semantics is preserved.
Coherently with what is proved in [14,15], we can have a property-driven obfuscator for a given semantics X that reveals δ and conceals φ, only if does not imply (namely reveal) . Observe that when the obfuscating transformation is acting as the identity function and therefore it is not protecting property φ of the semantics X. By definition we have that if and only if , namely if . This means that we cannot hide, and therefore protect, anything that is implied by what we reveal, as formally stated in the following corollary.
if and only if.
We can observe that the characterization of provided in Proposition 2 is quite strong in the context of code obfuscation because it adds the whole semantics to the semantics of the obfuscated program. This implies that the original semantics has to be contained in the obfuscated program, which is a strong requirement. Indeed, since we have to reveal only δ, it is sufficient that the obfuscated program contains, i.e., preserves, only the abstract semantics . This observation is important also to partially fill the gap between the proposed strategy and existing code obfuscations which transform also traces of , namely obfuscations where the semantics of the original programs are not contained in the semantics of the obfuscated programs, i.e., such that . We observe that, the obfuscating component of is the set we add to X, namely , while the preservation condition forces any obfuscated version of X to stay inside . Hence, to obfuscate for revealing δ and concealing φ, means to transform X into a set of traces that φ sees different from X, namely that does not belong to , while δ sees equivalent to X, namely that belongs to . This observation leads to the following weakened characterization of property-driven code obfuscation strategies.
(Weakened property-driven obfuscation strategy).
Let . is a weakened property-driven obfuscation strategy of X that conceals φ and reveals δ if and only if
Hence, in this case the maximal property-driven obfuscation strategy is the maximal element of contained in . Finally, the next result shows the relation between the existence condition of also this weakened version of property-driven obfuscation strategies and the semantic code obfuscation characterization provided in [14,15] (see Section 3).
A weakened property-driven obfuscation strategyexists if and only if
The next example shows a simple program where we want to conceal the property of signs and we want to preserve the I/O semantics. Moreover, in this example, we show how we can model a real code transformation by means of the weakened property-driven obfuscation strategy.
Where the abstract element + represents the set of positive numbers without 0, and − the set of negative numbers without 0.
be the property of signs that can be lifted on as follows: given we define , where we recall that is the ith state of the trace τ. Let be the I/O property. Let us consider, , and , where is given in Fig. 3. For simplicity, suppose . Then we have that
Then, an obfuscation , following the weakened property-driven strategy , has to be such that and . Hence, the semantics of the obfuscation must be a set of traces sharing the same I/O behavior with , but whose sign property changes. At this point any program whose semantics satisfies these conditions may be considered as an obfuscation of . For instance, consider on the right in Fig. 3, we have and
In this section, we provide an example of how our approach can be instantiated. It is worth noting that what we provide is not a particular obfuscation technique, but a framework for understanding existing obfuscations and for designing new ones depending on two properties: what we want to conceal and what we want to leave unchanged, namely to reveal of the original program semantics. In this way, as we will also explain later in Section 7.2, several existing techniques may be characterized precisely in terms of what they preserve and what they protect, but moreover we can even understand better how they work. For instance, the use of opaque predicates, which are used for adding fake branching in the CFG, confuses (and therefore protects) the control structure of the CFG but it releases (namely preserves) more than the I/O semantics, since the whole concrete executions are preserved.
As far as the design of new techniques is concerned, our framework can be used to understand, once we fix the two properties (φ to protect and δ to preserve), what kind of behaviors we have to add to the semantics of the program, and therefore how we have to transform the program, in order to guarantee that we change φ and we leave unchanged δ.
In the next section, we present recent work on obfuscation based on language interpreters. This is a particular technique that can be used whenever we aim at revealing the program semantics, in the sense that by construction the interpreter-based obfuscation always preserves the I/O program behavior. At the same time, this technique transforms the program code by including some syntactic fragments that are known for making imprecise the analysis we aim at concealing. The reason why this approach easily fits in our framework is that it is based on the same model of attacker. It is clear that, as we underlined more than once, this is an obfuscation technique working precisely for the intended model of attacker, namely parametric on the property we aim at concealing. In particular, this means that if we face an attacker able to perform less precise analyses, then surely our program is protected, while if we face an attacker performing more precise analyses, then it may be able to disclose (partial) information about the property we aim at concealing.
Property-driven obfuscation and distorted interpreters
Standard obfuscation aims at revealing precisely the observable program behavior while concealing other properties, such as the data dependencies or the control flow structure. In these cases, namely when we do not want to reveal anything different, more precise or more approximated than the observable semantics, we may use the obfuscation technique based on the design of distorted interpreters [21]. The aim, of this approach, is to obfuscate either by making its observation hard and imprecise, or by making protected information imperceptible.
In this section, we propose a systematic technique for building an obfuscator protecting/concealing a given property, when the property to reveal is simply the I/O behavior. The idea, is that of transforming the original program into an obfuscated program by specializing a distorted (but still correct) interpreter as in [31,32]. Let us explain the idea on an example [21,25]. Let be the simple command, , multiplying and , and storing the result in , as considered in [20]. An automated program sign analysis approximating values in the simple domain of signs is clearly able to catch, with no loss of precision, the intended behavior of with respect to sign, as the sign abstraction , is complete for integer multiplication. However, if we obtain the obfuscated program by replacing with:
The sign analysis is now unable to extract any information concerning the computed sign, because the rule of signs is incomplete for integer addition. In particular, we observe that when and , then (being ) and is the sum of a negative number with a positive one, and therefore we lose the sign information. is therefore an obfuscation of for the attack performed by sign analysis.
The main idea of this work is to obfuscate by specializing an obfuscating self-interpreter. An interpreter executes programs written in (potentially) another programming language. The interpreter is self if it is written precisely in the interpreted language. This choice is connected with the process of specialization. Indeed, specialization (better known as partial evaluation) means to partially evaluate a program on some known inputs, namely the specialized program [32] is precisely the same but with some instantiated computations. Hence, an interpreter specialized on a program is precisely the interpreter where all the computations concerning the input program are instantiated. Now, suppose that the program to obfuscate is written in the language , then if we want the obfuscated program (i.e., the specialized interpreter) to be written in the same language , then the interpreter has to be written precisely in , namely it has to be a self-interpreter.3
It is worth noting, anyway, that this is not mandatory in general, if we admit as part of the obfuscation process the possibility of changing also the programming language.
Hence, can be obtained by modifying a self-interpreter in order to force abstract interpretation to deal with operations that induce incompleteness in the attacker. Let us explain the technique on the sign example.
Suppose the distorted interpreter recursively implements multiplication by a sequence of additions in the obvious way (as above). This yields, by specialization, a modified program which is precisely . Moreover, in this work it has been easily observed that, in general, if is the result of specializing a self-interpreter to program , then it is immediate that , by simple equational reasoning [21]. Therefore function is a semantics-preserving program transformer [31]. However, even though and are semantically equivalent, they may be syntactically quite different (far more different than just by renaming). In particular [31]:
program inherits the programming style of ; but
program inherits the algorithm of program .
Simple self-interpreter.
The reason for Point (1) is that program is specialized code taken from the interpreter : consists of the operations of that depend on its dynamic input (and not only on , else they would have been “specialized away”). A general-purpose program transformer can thus be built by programming self-interpreter in a style appropriate to the desired transformation. In this style we can embed the property we aim to conceal. In the example above, the specialized interpreter can transform multiplication to sequences of additions because we aim to conceal the sign property for which the addition become imprecise. The design of the distorted interpreter starting from the property to conceal is still made ad hoc, but it is an ongoing work based on the analysis of the strong relation between obfuscation and incompleteness [24].
The reason for Point (2) is that, even though program may be a disguised form of , a correct interpreter must faithfully execute the operations that specifies. In usual practice, the transformed (by the specialized interpreter) program will perform the same computations in the same order as those performed by , guaranteeing the sound revelation of the program semantics.
The overall structure of the interpreter is traditional, with a “dispatch on syntax” loop: find the form of the current instruction at (program counter), and then execute it. The memory of program is held in variable [21]. In Fig. 4 is provided the algorithm for of a simple self-interpreter (implemented in scheme for experiments [21]). Assume input program has variables .
An example of specialization of this self interpreter is provided in Fig. 5 explained in Section 7.2.
Validating examples
In the following, we model three well-known examples of code obfuscation in our framework of revelation and concealment. In particular, we show how the framework proposed characterizes a pool of possible transformations that satisfy the fixed constraints of revelation and concealment. The following examples show that, whenever we want to preserve the behavior of programs and we are able to associate to an obfuscation strategy a family of syntactic elements that implement the obfuscation, then we can use the distorted interpreters introduced in the previous section [21], for choosing one precise element from the characterized pool.
Self-interpreter for data-type obfuscation.
Data-type obfuscation. Let us consider, in a more detailed way, data-type obfuscation cited in the introduction, namely the obfuscation techniques based on the encoding of data [18]. In this case, obfuscation is achieved by data refinement, namely by exploiting the complexity of more complex data structures or values in such a way that actual computations can be viewed as abstractions of the refined (obfuscated) ones. The idea is to choose a pair of commands and such that is equivalent to . This means that both and are commands of the form: and , for some function F and G. A program transformation is a data-type obfuscation for data-type x if and have the same denotational behavior on x, where adjusts the data-type computation for x on the refined type (see [18]). It is known that data-type obfuscations can be modeled as adjoint functions (Galois connections), where represents the program concretizing , viz. refining, the datum x and represents the program abstracting the refined datum x back to the original data-type. As proved in [20], this is precisely modeled as a pair of adjoint functions: and relating the standard data-type for x with its refined version . Consider the following program [21]: the small input program to the left is obfuscated into the equivalent program on the right (where constant expressions have been evaluated):
In this case, corresponds to the commands at the new lines 2 and 3 that multiply by 2 the values of x and y, then and are responsible for the adjustment of the other computations inside the program (for example in the guard of the while, the check is done on the value of x divided by 2), and corresponds to the command at line 7 where the value of y is divided by 2 before giving it in output. Let be the property of parity that can be lifted on as previously seen in Example 4. Let be the I/O property seen in Section 3.1. Let us consider, , and . On input , where the first value is the value of x and the second value the value of y, we have that (in sake of simplicity we ignore repetition of states, i.e., ):
Then a property-driven obfuscation has to be such that and that . Hence, the semantics of the obfuscation must be a set of traces such that the I/O behavior is precisely that of , but whose property changes. At this point, any program whose semantics satisfies these conditions may be considered as an obfuscation of . For instance, consider above on the right, we have that the I/O property is preserved, since , while the property is not:
Hence data-type obfuscation is a possible way to formalize a systematic technique for generating transformed code satisfying our obfuscation framework. Data-type obfuscation can be easily implemented by specializing an interpreter. In particular, for instance the interpreter automatically generating the obfuscation above [21] is provided in Fig. 5.
Opaque predicates. Let us consider the opaque predicates obfuscation. Opacity is an obfuscation technique based on the idea of confusing the control structure of a program by inserting predicates that are always true (false) independently of the memory, or (unknown) which may lead only to identical paths [7]. It is clear that our approach allows us to characterize which property each code obfuscation aims at preserving.
In the case, namely by using opaque predicates, we observe that the transformation preserves something more than the simple I/O behavior. Let us consider the domain of control flow graphs (CFG) (see Section 5.2), for each let us denote by and respectively its nodes and its edges. Let us define the following relation: Let , then we say that is sub-graph of , written , iff and . At this point it is clear that the obfuscation based on the insertion of opaque predicates consists in adding nodes and edges without changing in other ways the original CFG. Hence, let us define the following properties:
In particular, we have that is the set of all graphs of which is a sub-graph, and are the set of graphs that have the same set of executions of . Hence, we can say that the property that we want to preserve is , which is additive by construction. Indeed, precisely collects all the graphs containing and having the same set of executions of .
At this point, we aim at confusing the control flow graph, which is the domain we are starting from, hence the property φ we want to conceal is simply the identity. The following result is a slight rewriting of a theorem in [21].
Given, then we have thatiffdiffers fromonly for the insertion of opaque predicates (true, false or unknown).
At this point note that, in order to obtain a weakened property-driven obfuscation of a control flow graph we observe that
Hence, following Definition 6, a weakened property-driven obfuscation of a control flow graph G based on the insertion of opaque predicates is any obfuscating transformation in
Namely, an obfuscation is precisely any CFG preserving the structure and the program semantics as determined by , but which is simply different from . This means that any strategy building such a graph starting from is an obfuscation of and it must contain opaque predicates by Proposition 3. For instance, the obfuscation technique proposed in [21] could be considered as a way for systematically building such an obfuscation. Namely, also in this case we could implement a syntactic transformation which can be implemented by means of an obfuscating interpreter that adds opaque predicates in specific program points.4
Slicing obfuscation. Consider the slicing obfuscation example with programs in Fig. 1. We already observed that while . This means that there exists and such that .
As explained before, a technique has been proposed [21] which is based on the specialization of self-interpreters in order to obtain an obfuscator preserving , while protecting the dependency abstraction. This is obtained by transforming programs by adding fake dependencies. The semantics is instead preserved by construction, since interpretation and specialization cannot change the program semantics.
Discussion and future work
In this paper, we provide a formal framework where functions can be maximally transformed in order to change a given property φ (concealment transformer), or can be minimally transformed preserving a given property δ (revelation transformer). We motivate this formal framework in the context of semantic code obfuscation, where indeed we have to make impractical the observation of information to protect while preserving the I/O functionality of a program. In particular, we combine the revelation and concealment transformers into a characterization of an obfuscating transformations based, precisely, on the properties φ and δ. We obtain, so far, a general characterization of a semantic code obfuscations which allows us to further reason about the existing relations between the different actors occurring in the obfuscation scenario, and therefore to deeply understand their interplay. In particular, it may seem that, since we treat concealment and revelation properties separately, we are supposing that they are independent one from the other. Indeed, this is not the case, in fact, it is precisely the possible existing interaction between the property φ to protect and other program properties that induces us to combine a concealment transformation, transforming a program for protecting φ, with a revelation transformation recovering the functionality that could have been transformed as side effect of the concealment. In other words, a concealment transformer is a transformer that conceals at least the given property φ but which, being maximal, surely transforms many other program properties. Then it is revelation that allows us to recover at least the property we want to preserve. In this way, the noise that remain in the maximal obfuscation strategy is the pool of all computations that, while changing φ, preserves the desired functionality.
In the following, we discuss some further research directions arising form this work and that we plan to address in the future. By instantiating this characterization to a specific program , we obtain a specification of the obfuscated version of the considered program. Namely, represents the semantics of the obfuscation of program . As future work, we plan to investigate the existence of this obfuscated program in terms of the interplay between the two considered properties φ and δ in the semantics of . Our intuition is that, given a program , it is possible to obfuscate it in order to reveal φ while concealing δ only if these two properties are somehow independent in the semantics of . More specifically, we aim at formalizing the notion of independence of those two properties in terms of abstract non-interference [17,22].
In Section 6, we have defined the maximal property-driven obfuscation strategy as , namely starting from the identity function on program semantics. It would be interesting to study what happens when we apply revelation and concealment to existing obfuscations. Consider an obfuscation and its semantic counterpart . Then, minimally transforms obfuscation T in order to make it reveal property δ. Whereas, somehow optimizes obfuscation T to make it reveal δ and conceal φ. Moreover, it may be worth studying how different concealments can be composed, namely understanding the composition of abstractions distributes on the concealments. We plan to validate these observations on existing obfuscation techniques.
Another future research direction regards the investigation of the potency and resilience of the obfuscation transformers based on the proposed strategy, with respect to an attacker. In this work, we have implicitly assumed that the attacker is modeled by the property φ that we want to protect. In this way the transformation corresponding to provides a characterization of the obfuscation that is able to defeat such an attacker. In general, we can think of an attacker as modeled by a further abstraction of program semantics and then study the relations between the attacker β, the protected property φ and the released property δ in order to understand the efficiency of the proposed obfuscation with respect to β.
In the literature, there exists another formalization of the notion of obfuscation where the potency of the transformation is characterized on the abstract program semantics, by means of completeness,5
Completeness is a well-known notion in abstract interpretation that characterizes program analysis precision [10].
instead of on the abstraction of the concrete semantics, as we have done in this work. In particular, [21,24,25] provide a description of obfuscations as transformers preserving and protecting fixpoint abstract semantics and the potency of a program transformation as the properties for which the transformation is incomplete, namely imprecise. In this case, a property is revealed if and only if it is revealed by each step of the fixpoint semantic computation. Hence, the idea is to generalize revelation transformers, and therefore concealments, by considering this more general notion of revelation instead of the simple abstraction of the whole semantics.
From the theoretical point of view, this can be related with other property-driven function transformers, such as the complete shells and cores [27], the incomplete transformers [24], the transformers towards additivity [1], obtaining so far a framework for property-driven transformers parametric on the property to guarantee.
Finally, we are clearly interested in designing and developing obfuscation techniques based on the proposed strategy and therefore parametric on the properties to reveal and to conceal.
Footnotes
Acknowledgments
We would like to thank Roberto Giacobazzi for the inspiring discussions at the beginning of this work. We are also grateful to the anonymous reviewers for their useful comments and suggestions. This work has been supported by the MIUR FIRB project FACE (Formal Avenue for Chasing malwarE) RBFR13AJFT.
Technical results
The following result identifies the conditions on the relation between f and δ that guarantees that and do not trivially transform f, i.e., they do not transform f in the top () of the bottom () of the functional domain.
This means that we can find a non trivial simplification of function f revealing property δ, if and only if function f “loses” something of the property δ of the original element. Analogous for the refinement of f.
In the following we report the proofs of the results stated in the paper.
References
1.
H.Andéka, R.J.Greechie and G.E.Strecker, On residuated approximations, in: Categorical Methods in Computer Science, Lecture Notes in Computer Science, Vol. 393, 1989, pp. 333–339.
2.
B.Barak, O.Goldreich, R.Impagliazzo, S.Rudich, A.Sahai, S.P.Vadhan and K.Yang, On the (im)possibility of obfuscating programs, in: CRYPTO ’01: Proceedings of the 21st Annual International Cryptology Conference on Advances in Cryptology, Springer-Verlag, 2001, pp. 1–18.
3.
D.Binkley, S.Danicic, T.Gyimóthy, M.Harman, Á.Kiss and B.Korel, A formalisation of the relationship between forms of program slicing, Sci. Comput. Program.62(3) (2006), 228–252. doi:10.1016/j.scico.2006.04.007.
4.
S.Chow, Y.Gu, H.Johnson and V.A.Zakharov, An approach to the obfuscation of control-flow of sequential computer programs, in: ISC ’01: Proceedings of the 4th International Conference on Information Security, Springer-Verlag, 2001, pp. 144–155.
5.
C.Collberg and C.Thomborson, Watermarking, tamper-proofing, and obfuscation-tools for software protection, IEEE Trans. Software Eng.28(8) (2002), 735–746. doi:10.1109/TSE.2002.1027797.
6.
C.Collberg and C.D.Thomborson, Software watermarking: Models and dynamic embeddings, in: Proceedings of the 26th ACM SIGPLAN–SIGACT Symposium on Principles of Programming Languages (POPL ’99), ACM, 1999, pp. 311–324. doi:10.1145/292540.292569.
7.
C.Collberg, C.D.Thomborson and D.Low, Manufactoring cheap, resilient, and stealthy opaque constructs, in: Proceedings of Conference Record of the 25st ACM Symposium on Principles of Programming Languages (POPL ’98), ACM, 1998, pp. 184–196.
8.
K.D.Cooper and L.Torczon, Engineering a Compiler, 2nd edn, Elsevier, 2012.
9.
P.Cousot, Constructive design of a hierarchy of semantics of a transition system by abstract interpretation, Theor. Comput. Sci.277(1–2) (2002), 47–103. doi:10.1016/S0304-3975(00)00313-3.
10.
P.Cousot and R.Cousot, Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints, in: Conference Record of the 4th ACM Symposium on Principles of Programming Languages (POPL ’77), ACM, 1977, pp. 238–252. doi:10.1145/512950.512973.
11.
P.Cousot and R.Cousot, Systematic design of program analysis frameworks, in: Conference Record of the 6th ACM Symposium on Principles of Programming Languages (POPL ’79), ACM, 1979, pp. 269–282. doi:10.1145/567752.567778.
12.
P.Cousot and R.Cousot, Systematic design of program transformation frameworks by abstract interpretation, in: Conference Record of the Twentyninth Annual ACM SIGPLAN–SIGACT Symposium on Principles of Programming Languages, ACM, 2002, pp. 178–190.
13.
M.Dalla Preda and R.Giacobazzi, Control code obfuscation by abstract interpretation, in: Proceedings of the Third IEEE International Conference on Software Engineering and Formal Methods (SEFM ’05), IEEE Computer Society, 2005, pp. 301–310. doi:10.1109/SEFM.2005.13.
14.
M.Dalla Preda and R.Giacobazzi, Semantic-based code obfuscation by abstract interpretation, in: Proceedings of the 32nd International Colloquium on Automata, Languages and Programming (ICALP ’05), Lecture Notes in Computer Science, Vol. 3580, Springer-Verlag, 2005, pp. 1325–1336.
15.
M.Dalla Preda and R.Giacobazzi, Semantics-based code obfuscation by abstract interpretation, Journal of Computer Security17(6) (2009), 855–908. doi:10.3233/JCS-2009-0345.
16.
M.Dalla Preda, M.Madou, K.De Bosschere and R.Giacobazzi, Opaque predicates detection by abstract interpretation, in: Proceedings of the 11th International Conference on Algebraic Methodology and Software Technology (AMAST ’06), Lecture Notes in Computer Science, Vol. 4019, Springer-Verlag, 2006, pp. 81–95. doi:10.1007/11784180_9.
17.
M.Dalla Preda, I.Mastroeni and R.Giacobazzi, Analyzing program dependencies for malware detection, in: 3rd ACM SIGPLAN Program Protection and Reverse Engineering Workshop (PPREW 2014), ACM, 2014, Article No. 6.
18.
S.Drape, C.Thomborson and A.Majumdar, Specifying imperative data obfuscations, in: ISC – Information Security, J.A.Garay, A.K.Lenstra, M.Mambo and R.Peralta, eds, Lecture Notes in Computer Science, Vol. 4779, Springer-Verlag, 2007, pp. 299–314. doi:10.1007/978-3-540-75496-1_20.
19.
S.Garg, C.Gentry, S.Halevi, M.Raykova, A.Sahai and B.Waters, Candidate indistinguishability obfuscation and functional encryption for all circuits, in: 54th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2013, 26–29 October 2013, Berkeley, CA, USA, IEEE Computer Society, 2013, pp. 40–49.
20.
R.Giacobazzi, Hiding information in completeness holes – New perspectives in code obfuscation and watermarking, in: Proceedings of the 6th IEEE International Conferences on Software Engineering and Formal Methods (SEFM’08), IEEE Press, 2008, pp. 7–20.
21.
R.Giacobazzi, N.D.Jones and I.Mastroeni, Obfuscation by partial evaluation of distorted interpreters, in: Proceedings of the ACM SIGPLAN Symposium on Partial Evaluation and Semantics-Based Program Manipulation (PEPM’12), O.Kiselyov and S.Thompson, eds, ACM, 2012, pp. 63–72.
22.
R.Giacobazzi and I.Mastroeni, Abstract non-interference: Parameterizing non-interference by abstract interpretation, in: Proceedings of the 31st Annual ACM SIGPLAN–SIGACT Symposium on Principles of Programming Languages (POPL ’04), ACM, 2004, pp. 186–197. doi:10.1145/964001.964017.
23.
R.Giacobazzi and I.Mastroeni, Transforming abstract interpretations by abstract interpretation, in: Proceedings of the 15th International Static Analysis Symposium (SAS’08), M.Alpuente, ed., Lecture Notes in Computer Science, Vol. 5079, Springer-Verlag, 2008, pp. 1–17.
24.
R.Giacobazzi and I.Mastroeni, Making abstract interpretation incomplete – Modeling the potency of obfuscation, in: 19th International Static Analysis Symposium (SAS ’12), A.Miné and D.Schmidt, eds, Lecture Notes in Computer Science, Vol. 7460, 2012, pp. 129–145.
25.
R.Giacobazzi, I.Mastroeni and M.Dalla Preda, Maximal incompleteness as obfuscation potency, Formal Aspects of Computing29(1) (2017), 3–31. doi:10.1007/s00165-016-0374-2.
26.
R.Giacobazzi and F.Ranzato, Uniform closures: Order-theoretically reconstructing logic program semantics and abstract domain refinements, Inform. Comput.145(2) (1998), 153–190. doi:10.1006/inco.1998.2724.
27.
R.Giacobazzi, F.Ranzato and F.Scozzari, Making abstract interpretation complete, Journal of the ACM47(2) (2000), 361–416. doi:10.1145/333979.333989.
28.
G.Gierz, K.H.Hofmann, K.Keimel, J.D.Lawson, M.Mislove and D.S.Scott, A Compendium of Continuous Lattices, Springer-Verlag, 1980.
29.
S.Horwitz, T.W.Reps and D.Binkley, Interprocedural slicing using dependence graphs, ACM Trans. Program. Lang. Syst.12(1) (1990), 26–60. doi:10.1145/77606.77608.
N.D.Jones, C.K.Gomard and P.Sestoft, Partial Evaluation and Automatic Program Generation, Prentice-Hall, Inc., 1993.
33.
C.Linn and S.Debray, Obfuscation of executable code to improve resistance to static disassembly, in: CCS ’03: Proceedings of the 10th ACM Conference on Computer and Communications Security, ACM, 2003, pp. 290–299. doi:10.1145/948109.948149.
34.
M.Madou, B.Anckaert, B.De Bus, K.De Bosschere, J.Cappert and B.Preneel, On the effectiveness of source code transformations for binary obfuscation, in: Proceedings of the International Conference on Software Engineering Research and Practice (SERP06), CSREA Press, 2006, pp. 527–533.
35.
A.Majumdar, S.J.Drape and C.D.Thomborson, Slicing obfuscations: Design, correctness, and evaluation, in: DRM ’07: Proceedings of the 2007 ACM Workshop on Digital Rights Management, ACM, 2007, pp. 70–81. doi:10.1145/1314276.1314290.
36.
I.Mastroeni and R.Giacobazzi, Weakening additivity in adjoining closures, Order33(3) (2016), 503–516. doi:10.1007/s11083-015-9381-9.
37.
I.Mastroeni and D.Nikolic, Abstract program slicing: From theory towards an implementation, in: 12th International Conference on Formal Engineering Methods (ICFEM’10), Lecture Notes in Computer Science, Vol. 6447, Springer, 2010, pp. 452–467.
38.
I.Mastroeni and D.Zanardini, Abstract program slicing: An abstract interpretation-based approach to program slicing, arXiv:1605.05104, 2016.
39.
T.Ogiso, Y.Sakabe, M.Soshi and A.Miyaji, Software obfuscation on a theoretical basis and its implementation, IEICE TRANSACTIONS on Fundamentals of Electronics, Communications and Computer SciencesE86-A(1) (2003), 176–186.
40.
I.V.Popov, S.K.Debray and G.R.Andrews, Binary obfuscation using signals, in: Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium, SS’07, USENIX Association, 2007, pp. 19:1–19:16.
41.
M.Weiser, Program slicing, in: ICSE ’81: Proceedings of the 5th International Conference on Software Engineering, IEEE Press, 1981, pp. 439–449.