Abstract
Dynamic taint analysis is often used as a defense against low-integrity data in applications with untrusted user interfaces. An important example is defense against XSS and injection attacks in programs with web interfaces. Data sanitization is commonly used in this context, and can be treated as a precondition for endorsement in a dynamic integrity taint analysis. However, sanitization is often incomplete in practice. We develop a model of dynamic integrity taint analysis for Java that addresses imperfect sanitization with an in-depth approach. To avoid false positives, results of sanitization are endorsed for access control (aka prospective security), but are tracked and logged for auditing and accountability (aka retrospective security).
We show how this heterogeneous prospective/retrospective mechanism can be specified as a uniform policy, separate from code. We then use this policy to establish correctness conditions for a program rewriting algorithm that instruments code for the analysis. These conditions synergize our previous work on the semantics of audit logging with explicit integrity which is an analogue of noninterference for taint analysis. A technical contribution of our work is the extension of explicit integrity to a high-level functional language setting with structured data, vs. previous systems that only address low level languages with unstructured data. Our approach considers endorsement which is crucial to address sanitization. An implementation of our rewriting algorithm is presented that hardens the OpenMRS medical records software system with in-depth taint analysis, along with an empirical evaluation of the overhead imposed by instrumentation. Our results show that this instrumentation is practical.
Introduction
Dynamic taint analysis implements a “direct” or “explicit” information flow analysis to support a variety of security mechanisms [41]. Similar to information flow, taint analysis can be used to support either confidentiality or integrity properties. An important application of integrity taint analysis is to prevent the execution of security sensitive operations on untrusted data, in particular to combat cross-site scripting (XSS) and SQL injection attacks in web applications [31]. Any untrusted user input is marked as tainted, and then taint is tracked and propagated through data flow to ensure that tainted data is not used by security sensitive operations.
Of course, since web applications aim to be interactive, user input is needed for certain security sensitive operations such as database calls. To combat this, sanitization is commonly applied in practice to analyze and possibly modify data. From a taint analysis perspective, sanitization is a precondition for integrity endorsement, i.e. subsequently viewing sanitization results as high integrity data. However, while sanitization is usually endorsed as “perfect” by taint analysis, in fact it is not. Indeed, previous work has identified a number of flaws in existing sanitizers in a variety of applications [31,48], including the exploits to inject commands in web application, and the work here is in fact inspired by discovery of an XSS attack vector in the OpenMRS medical records software systems due to incomplete sanitization, discussed below in Section 1.1. We have demonstrated the practicality of exploiting the discovered XSS vulnerability in OpenMRS. We call such incomplete sanitizers partially trusted or imperfect throughout the paper.
Thus, a main challenge we address is how to mitigate imperfect sanitization in taint analysis. An important feature of our approach is an in-depth [22] security policy, that combines the typical blocking (prospective) behavior of taint-based access control with audit logging (retrospective) features. In the presence of imperfect sanitization, this allows false positives to be avoided, while still providing retrospective security measures via audit logs in case of attacks that leverage this imperfection. We are concerned with both efficiency and correctness – we develop a language model intended to capture the essence of Phosphor [11,12], an existing Java taint analysis system with empirically demonstrated efficiency. To be applicable to legacy systems such as OpenMRS, we propose a program rewriting approach. Our program rewriting algorithm takes as input a heterogeneous (prospective and retrospective) taint analysis policy specification and input code, and instruments the code to support the policy. The policy allows user specification of taint sources, secure sinks, and sanitizers. A distinct feature of our system is that results of sanitization are considered “maybe tainted” data, which are allowed to flow into security sensitive operations but in such cases are entered in a log to support auditing and accountability.
A contribution of our approach is a uniform expression of an in-depth security policy that combines prospective (taint analysis) and retrospective (audit logging) policy features, and proof that our rewriting algorithm enforces this policy. To characterize retrospective correctness we leverage our previous work on the semantics of retrospective security [2]. To characterize prospective correctness, we aim to go deeper than operational definitions [29,41] and characterize correctness as a higher level semantic property, in particular as a hyperproperty [19]. To this end we propose a semantic framework called explicit integrity, that is an extension of explicit secrecy [39] to a high-level (Java) language model with structured data. Explicit integrity is analogous to (integrity) noninterference, but applies to taint analysis that is concerned only with direct aka explicit information flows. Intuitively, in programs that enjoy explicit integrity, low-integrity (tainted) data does not flow directly into high-integrity sinks during program execution. Both explicit secrecy and explicit integrity are defined independently of language-level instrumentation, and (like noninterference) are hyperproperties as formulated in [39] and in this work. Furthermore, we consider the variant explicit integrity modulo endorsement, since endorsement is necessary in the taint analysis to accurately reflect the results of sanitization.
Practical motivations
While our work is based on formal foundations it is inspired by practical concerns, in particular a security flaw we discovered in a previous version (2.4) of OpenMRS originally reported in [3].1
We responsibly disclosed the vulnerabilities we found to the OpenMRS development community, and they have been corrected in current versions.
OpenMRS uses a set of validators to enforce expected data formats by implementation of the
Integrity taint tracking is a well-recognized solution against these sorts of attacks that deals direct information flow analysis (and hence explicit integrity). In our example, using taint analysis the tainted
The security problem we consider is about the integrity of data being passed to security sensitive operations (SSOs) in a direct manner. An important example is a string entered by an untrusted user that is passed to a database method for parsing and execution as a SQL command. The security mechanism should guarantee that low-integrity data cannot be passed to SSOs without previous sanitization.
In contrast to standard information flow which is concerned with both direct (aka explicit) and indirect (aka implicit) flows, taint analysis is only concerned with direct flow. Direct flows transfer data directly between variables, e.g.,
More precisely, we posit that top-level programs
We assume that our program rewriting algorithm is trusted. Input code is trusted to be not malicious, though it may contain errors. We note that this assumption is important for application of taint analysis that disregards indirect flows, since there is confidence that the latter will not be actively exploited by non-malicious code. We assume that untrusted data sources provide low integrity data, though in this work we only consider tainted “static” values, e.g., strings, not tainted code that may be run as part of the main program execution. However, the latter does not preclude hardening against XSS or injection attacks in practice, if we consider an evaluation method to be an SSO.
Overview by example
As a running example for the paper, consider the following code, where we imagine a tainted input string
In order to define the logging policy, we will define an operational (trace) semantics of programs where direct information flow is defined as a property of traces. This property correlates taint labels with values in traces – in the above expression the tainted label ∙ is correlated with
In all cases, correlation of labels with values in structured data is accomplished via “shadows” of expressions, which are shape-conformant with source language expressions and carry taint labels. For example, in the trace of the above expression, the result of sanitization is the following:
Subsequently, we develop a rewriting algorithm called
Technical overview
The technical development of the paper proceeds as follows. In Section 2 we describe a formal semantics of auditing, and the conditions for correctness of audit rewriting algorithms. That is, we define what it means for a program instrumentation to correctly log information. In Section 2.1, we introduce information algebra [27] as the basis of our model for correct audit log generation. We characterize logging specifications and correctness conditions for audit logs in a high-level manner using information algebra, and show how information elements and operations can be instantiated using first order logic.
In Section 3 we develop a source language model based on featherweight Java (Section 3.1), called FJ. We show how to logically specify an in-depth taint analysis policy separately from code in Section 4 via safety property and logging specifications. In Section 3.2 we develop a target language model
While Theorem 4.1 establishes correctness conditions for information in audit logs in an operational sense, Section 5 focuses on the high level security property of dynamic integrity taint analysis that is tailored for direct information flow. In Section 5.4, we show that our enforcement mechanism satisfies the hyperproperty of explicit integrity modulo endorsement (Theorem 5.1). In Section 6 we discuss our implementation of the in-depth taint analysis specification presented in Section 4 for the OpenMRS medical records system. We also describe experiments for empirical evaluation of this implementation and discuss results. In Section 7 we discuss related work and conclude the paper.
For the sake of brevity, we have omitted the proofs of all Lemmas and Theorems. Readers are referred to our accompanying Technical Report [43] for these details in full.
Foundations for in-depth policy specification
In this section we establish formal foundations for a semantics of prospective and retrospective policy features. and the correctness of audit instrumentation. An appeal of our approach is that both safety properties [38] and logging correctness can be formulated, so we are able to uniformly characterize operational correctness conditions for in-depth integrity taint analysis in our framework. This framework was initially developed in previous work [2] where we studied so-called “break-the-glass” policies for medical records software. In that work we justified the generality of our framework and discuss its details at length. Here we reiterate the main technical points of the framework to allow a standalone formal presentation.
We leverage ideas from the theory of information algebra [27,28], which is an abstract mathematical framework for information systems. In short, we interpret program traces as information, and logging specifications as functions from traces to information. This separates logging specifications from their implementation in code, and defines exactly the information that should be in an audit log. This in turn establishes correctness conditions for audit logging implementations.
Introduction to information algebra
Information algebra is an algebraic theory of information where information is seen as a collection of information elements with fundamental aggregation and refinement operations. The algebra consists of two domains, an information domain and a query domain. The information domain Φ is the set of information elements that can be aggregated in order to build more inclusive information elements. The query domain E is a lattice of querying sublanguages in which the partial order relation among these sublanguages represents the granularity of the queries. The information and query domains are left abstract in the general theory – instantiation examples include relational algebra and first order logic as we discuss below. By definition any instantiation must include basic operations for combining information and for focusing on components of information.
Any information algebra Combination Focusing
Using the combination operator we can define a partial order relation on Φ to compare the information contained in the elements of Φ. A partial ordering is induced on Φ by the so-called information ordering relation ⩽, where intuitively for
X is contained in Y, denoted as
We say that X and Y are information equivalent, and write
For a more detailed account of information algebra, the reader is referred to a definitive survey paper [28].
Relational algebra is a well-recognized instance of information algebra. The formulation of relational algebra as an information algebra by Kohlas [28] is an illustrative example of this information theoretic framework. Here we reproduce this example.
Let
Function
Instantiation. Let
A general model for logging specifications
Following [38], an execution trace
We assume a given function
We let
We assume that auditing is implementable, requiring at least that all conditions for logging any piece of information must be met in a finite amount of time. As we will show, this restriction implies that correct logging instrumentation is a safety property [38].
We require of any logging specification
It is crucial to observe that some logging specifications may add information not contained in traces to the auditing process. Security information not relevant to program execution (such as ACLs), interpretation of event data (statistical or otherwise), etc., may be added by the logging specification. For example, in the OpenMRS system [46], logging of sensitive operations includes a human-understandable “type” designation which is not used by any other code. Thus, given a trace τ and logging specification
A logging specification defines what information should be contained in an audit log. In this section we develop formal notions of soundness and completeness as audit log correctness conditions. We use metavariable
An audit log Audit log Audit log
In case program executions generate audit logs, we write
For all logging specifications
We observe that the restriction imposed on logging specifications by Definition 2.4, implies that ideal instrumentation of any logging specification is a safety property in the sense defined by Schneider [38].
For all logging specifications
This result implies that e.g. edit automata can be used to enforce instrumentation of logging specifications [1]. However, theory related to safety properties and their enforcement by execution monitors [9,38] does not provide an adequate semantic foundation for audit log generation, nor an account of soundness and completeness of audit logs.
The above-defined correctness conditions for audit logs provide a foundation on which to establish correctness of logging implementations. Here we consider program rewriting approaches. Since rewriting concerns specific languages, we introduce an abstract notion of programs
A rewriting algorithm
We use metavariable
Ideally, a rewriting algorithm should preserve the semantics of the program it instruments. That is,
Let T be a set of program traces. Rewriting algorithm For all traces For all traces τ such that
In addition to preserving program semantics, a correctly rewritten program constructs a log in accordance with the given logging specification. More precisely, if
The following definitions then establish correctness conditions for rewriting algorithms in conjunction with semantics preservation. Like semantics preservation, we define soundness and completeness with respect to a given set of traces.
Let T be a set of traces. Rewriting algorithm
Let T be a set of traces. Rewriting algorithm
Note also that without semantics preservation, soundness and completeness could be satisfied trivially in case
Logics have been used in several well-developed auditing systems [16,23], for the encoding of both audit logs and queries. FOL in particular is attractive due to readily available implementation support, e.g. Datalog and Prolog. We have shown in previous work that FOL is an information algebra, and useful for e.g. break the glass policy specification [2]. Here we summarize important definitions for the remainder of this paper.
Let Greek letters ϕ and ψ range over FOL formulas and let capital letters X, Y, Z range over sets of formulas. We posit a sound and complete proof theory supporting judgements of the form
Elements of our algebra are sets of formulas closed under logical entailment. Intuitively, given a set of formulas X, the closure of X is the set of formulas that are logically entailed by X, and thus represents all the information contained in X. In spirit, we follow the treatment of sentential logic as an information algebra explored in related foundational work [27], however our definition of closure is syntactic, not semantic.
We define a closure operation C, and a set
Let
Now we can define the focusing and combination operators, which are the fundamental operators of an information algebra. Focusing isolates the component of a closed set of formulas that is in a given sublanguage. Combination closes the union of closed sets of formulas. Intuitively, the focus of a closed set of formulas X to sublanguage L is the refinement of the information in X to the formulas in L. The combination of closed sets of formulas X and Y combines the information of each set.
Define:
Focusing: Combination:
Properties of the algebra ensure that ⩽ is a partial ordering by defining
The following Theorem establishes that the construction is an information algebra – for a complete proof the reader is directed to [1].
Structure
In addition, to interpret traces and logs as elements of this algebra, i.e. to define the function
Formally, we define logging specifications in a logic programming style by using combination and focusing. Any logging specification is parameterized by a sublanguage S that identifies the predicate(s) to be resolved and Horn clauses X that define it/them, and can be defined via the functional
The function
We will formulate a particular example of a logging specification for maybe tainted data in Definition 3.3.
In this section we present a basic object-oriented calculus as the foundation of our language model. We also show how the in-depth integrity taint analysis model described in Section 1.2 can be specified as a logical property of program traces in this model, independent of program instrumentation. This allows us to define retrospective taint analysis as a logging specification in the style introduced in Section 2. Subsequently in Section 4 we will show how this specification can be correctly instrumented via program rewriting into a target language, hence we refer to the language introduced in this section as our source language.
Source language
Our source language model is essentially Featherweight Java (FJ) [26] with minor extensions including base types and an abstract notion of library methods for base types. The latter is important for an adequate consideration of taint propagation (e.g. on strings) in our model. FJ is a functional core calculus that includes class hierarchy definitions, subtyping, dynamic dispatch, and other basic features of Java. An FJ program is an expression
Syntax
The syntax of FJ is defined in Fig. 1. We let

FJ syntax.
Conditional expressions are an important feature of the language for this presentation, since they are a control flow operation that should not be considered in a direct flow analysis. We assume that in any program setting true and false values, denote
For brevity in this syntax, we use vector notations. Specifically we write
The class table
Method type lookup
Just as we’ve defined a function for looking up method bodies in the class table, we also define a function

Object field, method body, and method type lookup.
Now, we can define the operational semantics of FJ. The reduction relation is binary, of the form

Operational semantics for FJ.
The definition of → assumes given a class table
We define top-level programs
Library methods
In order to study dynamic integrity taint analysis in FJ, we extend the semantics for library methods that allow specification of operations on base values (such as strings and integers). Consideration of these features is important for a thorough modeling of Phosphor-style taint analysis, and important related issues such as string- vs. character-based taint [18] which have not been considered in previous formal work on taint analysis [41]. Since static analysis is not a topic of this paper, for brevity we omit the standard FJ type analysis which is described in [26].
The abstract calculus described above is not particularly interesting with respect to direct information flow and integrity propagation, especially since method dispatch and conditional expressions are control flows that are discounted in direct data flow. More interesting is the manner in which taint propagates through base values and library operations, since direct flows propagate through some of these methods. Also, for run-time efficiency and ease of coding some Java taint analysis tools treat even complex library methods as “black boxes” that are instrumented at the top level for efficiency [25], rather than relying on instrumentation of lower-level operations.
Note that treating library methods as “black boxes” introduces a potential for over- and under-tainting – for example in some systems all string library methods that return strings are instrumented to return tainted results if any of the arguments are tainted, regardless of any direct flow from the argument to result [25]. Clearly this strategy introduces a potential for over-taint. Other systems do not propagate taint from strings to their component characters when decomposed [18], which is an example of under-taint. Part of our goal here is to develop an adequate language model to consider these approaches.
We therefore extend our basic definitions to accommodate base values and their manipulation. Let a primitive field be a field containing a base value. We call a base type any class with primitive fields only, and a library method is any method that operates on base type objects, defined in a primitive class. We expect primitive objects to be object wrappers for primitive values (e.g.,
We posit a special set of field names
We define the meaning of operations
A boolean class
In-depth integrity analysis specified logically
In this section, we demonstrate how in-depth integrity taint analysis for FJ can be expressed as a single uniform policy separate from code. To accomplish this we interpret program traces as information represented by a logical fact base in the style of Datalog. We then define a predicate called
Java-based taint analyses naturally tend to be object-based, i.e. low-integrity values are objects conceptually, and objects have an assigned taint level in the implementation. The types of tainted objects vary depending on the analysis, but most emphasize taint of base values. We will likewise focus on taint of base values, though we will support taint labeling of all objects. This is partly to generalize the representation, but also for formal convenience – In our logical specification of taint analysis, a shadow expression has a syntactic structure that matches the configuration expression, and associates integrity levels (including “high” ∘ and “low” ∙) with particular objects via shape conformance.
Suppose a method
On the basis of shadow expressions that correctly track integrity, we can logically specify prospective taint analysis as a property of shadowed trace information, and retrospective taint analysis as a function of shadowed trace information. An extended example of a shadowed trace is presented in Section 3.2.4.
In order to specify taint tracking, we define the mapping

Interpreting expressions as formulas via
We define
Integrity identifiers. We introduce an integrity identifier t that denotes the integrity level associated with objects. To support a notion of “partial endorsement” for partially trusted sanitizers, we define three taint labels, to denote high integrity (∘), low integrity (∙), and uncertain integrity (⊙). We refer to these levels as tainted, untainted, and maybe tainted, respectively.
We posit the usual meet ∧ and join ∨ operations on taint lattice elements, and introduce logical predicates
Shadow traces reflect taint information of objects as they are passed around programs. Shadow traces are comprised of shadow expressions and contexts which are terms in the logic with the following syntax. Note the structural conformance with closed
Shadow method bodies are defined by the function
The predicate


Next, in Fig. 6, we define a predicate
Some notational liberties are taken in Fig. 6 regarding expression and context substitutions, which are defined using predicates elided for brevity.
With respect to control flow, the most notable rules of
Taint propagation and endorsement. The propagation of taint in the model described in Section 1.2 is embedded in the definition of
Now we can logically specify an in-depth policy for integrity taint analysis, as proposed originally in Section 1.2. In particular we assume a set
The in-depth policy has both prospective and retrospective component – the former is defined as a safety property [38], while the latter is defined as a logging specification. The prospective component of the policy must identify traces where a tainted value is passed to a secure method. To this end, in Fig. 7 we define the predicate
Let X be the set of rules in Figs 5, 6, and 7 and the set of user-defined rules for
This latter condition is not necessary for the specification, and may seem extraneous, but it is in place to allow a clean proof correspondence with the implementation as detailed in Section 4.2. In the implementation, taint checks occur one execution step after an SSO is called, which in the case of an unsafe call will thus occur one step after the
We immediately observe that
Finally we define a program as being safe iff it does not produce a bad trace.
We call a program

Predicates for specifying prospective and retrospective properties.
To illustrate the major points of our construction for source program traces and their shadows, we consider an example of program that contains an
Assume that sanitizer and SSO methods

Example 3.2: source trace.

Example 3.2: shadow expressions.
Now we define an object-based dynamic integrity taint analysis in a more familiar operational style. Taint analysis instrumentation is added automatically by a program rewriting algorithm
In Section 4.2 we follow the methods developed in Section 2 and show that
In-depth taint analysis instrumentation
The target language
The
algorithm
Now we define the program rewriting algorithm
For any expression
As discussed in Section 1, sanitization is typically taken to be “ideal” for integrity flow analyses, however in practice sanitization is imperfect, which creates an attack vector. To support retrospective measures specified in Definition 3.3, we define
Another important element of taint analysis is instrumentation of library methods that propagate taint – the propagation must be made explicit to reflect the interference of arguments with results. The approach to this in taint analysis systems is often motivated by efficiency as much as correctness [25]. We assume that library methods are instrumented to propagate taint as intended (i.e. in accordance with the user defined predicate
Here is how addition, string concatenation, and equality test, can be modified to propagate taint. Note the taint of arguments will be propagated to results by taking the meet of argument taint, thus reflecting the degree of integrity corruption:

Axioms for rewriting algorithm.
The operational semantics of
Audit logs

Operational semantics of
We write

Example 3.2: target trace.
Revisiting the example introduced in Section 3.2.4, we show execution of the rewritten program
Now we can leverage machinery developed previously to demonstrate in-depth operational correctness of
Recalling our definitions of semantics preservation, soundness, and completeness from Section 2, we state our main results as follows. These results tie together our relevant logging specification
Regarding our main result for retrospective security, we note that our definition is general with respect to
For all
Since the safety property
An
Operational correctness of the prospective component of
The FJ program
The semantics of information flow has been well studied and is typically characterized via noninterference properties, but surprisingly little work has been done to develop similar properties for taint analysis. In recent years it has been shown that direct flow of data confidentiality is not comparable with noninterference [39], i.e., there are both noninterfering programs with direct leakage of secret data to public domain, and programs without such direct leakages, but interfering. For instance consider the following two statements in a core imperative language, in which s and p are respectively secret and public variables:
Formal definitions of taint analysis implementations do exist, but they are usually operational in nature. For example, in Section 4.2, we have established an operational correctness result for the prospective enforcement of direct integrity flow. In this section, we propose a hyperproperty to characterize the security property enforced by integrity taint analysis techniques. This hyperproperty is defined in a general, language-agnostic way, though in this section we also show that the instrumentation of
Direct integrity flow semantics: Explicit integrity
We define explicit integrity as a semantic hyperproperty that builds on (dualizes) the notions of explicit secrecy [39] and attacker power [5]. Similar to explicit secrecy, explicit integrity is language-agnostic. In later sections, we discuss instantiation of this model for
Intuitively, a program enjoys explicit secrecy if execution of its state transformation components does not affect the knowledge of a low confidentiality user. By formally specifying state transformation components, control flow operations (such as conditional expressions) can be omitted to only consider direct aka explicit program flows. Knowledge [6] is defined as the set of initial states configurable by a low confidentiality user that generate a particular sequence of observables – the smaller the set, the greater the knowledge. Explicit knowledge [39] restricts this concept to direct program data flow. In this section, we demonstrate how explicit knowledge can be “dualized” for direct integrity flow analysis and applied as a semantic framework for dynamic integrity taint analysis tools, particularly in functional languages with hierarchical data structures (
Attacker power [5] is introduced as a counterpart to attacker knowledge in the context of integrity, as the set of low integrity inputs that generate the same sequence of high integrity events. Each high integrity event could be a simple assignment to a predefined high integrity variable, a method that manipulates trusted data (secure sinks), etc. according to the language model. The more refined the attacker power is, the more powerful the low integrity attacker becomes, as she becomes more capable to distinguish between the effects of different attacks on high integrity data.
We define explicit attacker power as the attacker power constrained on direct integrity flows. Then, explicit integrity is defined as the property of preserving explicit attacker power during program execution. In order to limit flows to direct ones, we have followed the techniques introduced in [39] to define state transformers. State transformers extract direct flows semantically by specifying the ways in which program state is modified in each step of execution, along with direct-flow events that are generated.
Model specification
We formulate our explicit integrity semantics following [39]. We first define the interface for our framework. Let
Each configuration is considered to include two segments: control (code) and state (data). These segments are not necessarily disjoint and could overlap in some language models. In this regard, let mappings
We assume the existence of an entry point
We define extracted state transformers as follows. A consideration of state transformation, rather than complete program execution, allows us to focus only on direct program flow, rather than indirect control flow e.g. via conditionals. State transformers play the same role that explicit flow statements do in Weak Secrecy [47]. We note that this definition is a slight refinement of the analogous definition in [39] – in their work, a command is assumed to be compatible with all states, whereas we require compatibility of commands and states. This refinement is necessary due to structured expressions in HLLs such as Java, vs. lower level languages. However, we add a completeness condition expressed in Definition 5.3 that ensures we can compare all trust equivalent states via state transformation functions.
Let
We now define the power an attacker obtains by observing high integrity events. We capture this by defining a set of high integrity equivalent states that generate the same sequence of high integrity events. We posit the binary relation
We define explicit attacker power with respect to state
All state transformers must be complete in the following sense for this definition to be coherent:
A state transformer f is complete iff for all
A control then satisfies explicit integrity for some state iff no state can be excluded from observing the high integrity events generated by the extracted state transformer.
A control
We can now consider explicit integrity in the presence of endorsement in the style of gradual release [6]. We assume that there exists a set of integrity events A control
In this section, we instantiate explicit integrity for
First, we define the required interface specified in Section 5.1, beginning with the definition of extracted state transformers for all features. These are extracted from the definition of → – notably, the extracted state transformers for conditional expressions inline conditional branching, disregarding the actual
The state transformers for
Our treatment of library methods,

Fundamental state transformers extracted from

Definition of
Next, we define
We define
These definitions clearly satisfy the model requirements.
For any
Trust equivalence and state transformation. Now we define trust equivalence
Also, since endorsement may allow trust equivalent states to transform into non-structural equivalence, to satisfy the completeness requirement of Definition 5.3 we need to show that transformation preserves a weaker structural conformance relation
The trust equivalence

Definition of

Definition of trust equivalence and shape conformance relations on expressions.
We define two sanity conditions for library methods: not undertainting and not overtainting. The former condition is required in the implementation in order to meet explicit integrity modulo endorsement, whereas the latter is a good practice in the implementation of taint analysis tools. Hereafter we will assume that library methods are not undertainting.
We say
For example,
Not overtainting refines the precision of taint tracking with respect to a given state. Intuitively, a library method that only directly depends on its high integrity inputs is not overtainting if its results are untainted. We say
Assume given a class table
In contrast, the state transformer associated with the actual execution of
Since f and
Enforcement of explicit integrity modulo endorsement by
Our general strategy is to show that non-endorsement events do not change attacker power as required by the definition of explicit integrity modulo endorsement. The complete proof details are given in our Technical Report [43].
If
(Sketch) Proof by contradiction. If it does not enjoy explicit integrity modulo endorsement, then explicit attacker power is refined, i.e., different integrity events can be generated starting from trust equivalent states. This contradicts with an intermediary result reflecting on the preservation of integrity events by state transformers being applied on trust equivalent states. □
In Section 1.1 we discussed an XSS vulnerability in the OpenMRS system (corrected in the current version) that inspired our interest in an in-depth taint analysis to better track data flow into secure operations and to enforce some level of sanitization. To explore and evaluate our proposed methods in practice, we have developed an automated analysis for OpenMRS by direct modification of the Phosphor system [11]. Our modification supports dynamic integrity taint analysis both prospectively and retrospectively. Our implementation is based on the formal model developed in previous sections, which enjoys a correctness guarantee. In this section we describe our implementation and our evaluation of it.
Modifications to phosphor
Out of the box, Phosphor provides a binary taint labeling scheme, with no support for endorsement. Users specify their security policy by identifying high integrity sinks, which are then automatically instrumented at the bytecode level with checks for low integrity inputs, by a combination of program rewriting and runtime mechanisms. Thus, to implement our in-depth taint analysis specification we needed to generalize the taint labeling scheme, add an endorsement mechanism, and add support for audit logging to the existing Phosphor codebase. This yielded our
Phosphor distinguishes only between two types of data – tainted and untainted. To support a generalized labeling scheme, in
As for ordinary Phosphor, in
We also allow specification of a set of sanitizers in the same manner as secure sink specification, i.e. specific methods are identified in an initial configuration file provided when rewriting a program. These methods will have return values

To apply
The lists of sources, sinks, and sanitizers we identified in OpenMRS are provided in our implementation on GitHub [44]. Our method for identifying sinks and sanitizers was to leverage our knowledge of the Hibernate API. Specifically, to identify sinks, we searched the OpenMRS codebase for methods that employ Hibernate database write functionality. To identify sanitizers, we searched the OpenMRS codebase for methods that employ Hibernate sanitization functionality. The list of sources was determined by searching for methods that use
Another subtle but important detail of our integration of
Implementation evaluation
To evaluate our instrumentation of OpenMRS with
Phosphor initialization overhead. When instrumenting Java classes with Phosphor, one can either run the software manually, specifying all of the source files to the program, or Phosphor can automatically detect and instrument uninstrumented code as the program runs. The latter option was chosen for this project due to the nature of OpenMRS and the onerous overhead of manual instrumentation.
As a consequence of instrumenting Java classes dynamically, an instrumentation overhead occurs as new uninstrumented source code is discovered. Thus, an initial run through a Phosphor-modified OpenMRS would be slower than consecutive runs, which was indeed observed by testing – the initial run of a particular method typically took about twice as long as subsequent runs.

Phosphor instrumentation memory overhead. The x-axis denotes the fraction completion of a test run, and the y-axis denotes memory used in MB.
Since our main concern in evaluation was to compare the overhead of instrumented vs. unmodified OpenMRS, and initialization overhead is arguably amortized to insignificance over long use sessions, the results we report here are only for pre-initialized testing runs. However we do note this additional overhead with the use of Phosphor’s dynamic instrumentation feature.
Actions and page loads. The OpenMRS system has a web-based user interface, so we can partition its functionality into two major categories. The first is called action functionality, which results from submitting a form. Since form submission introduces tainted data that is potentially sanitized and destined for a secure sink, we can expect actions to incur overhead in instrumented OpenMRS and so are clearly important to consider.
OpenMRS also offers the potential for page loads, where users navigate between pages in the system by clicking links (but not submitting data). Since the underlying code is also instrumented in these situations, and continues to track taint, some overhead can also be observed. Thus we also evaluated overhead associated with page loads.
To evaluate
An initial concern of our evaluation was determining whether the system worked correctly, and whether data reaching sinks was maybe tainted, indicating sanitization, as well as being logged properly. We confirmed this, and did not discover instances of unsanitized data reaching sinks. Subsequently, we considered timing and memory consumption.
Timing. Our timing results are summarized in Table 1. Here we show the average time to complete each action and page load for the unmodified OpenMRS baseline, as well as OpenMRS instrumented with our implementation of
Average timing and overhead for unmodified OpenMRS (baseline), versus instrumented with Phosphor and with
Average timing and overhead for unmodified OpenMRS (baseline), versus instrumented with Phosphor and with
Figure 17 shows more detailed results, comparing times for the OpenMRS baseline and the
Memory. Figure 18 shows baseline memory consumption during test runs of unmodified OpenMRS, versus pre-initialized OpenMRS instrumented with
Some of the results in this paper were discussed in a preliminary manuscript [3], but the current work provides a fully developed metatheory, a formulation of the high-level security policy enforced by our system (explicit integrity modulo endorsement), and a complete implementation and empirical evaluation.
Taint analysis is an established solution to enforce confidentiality and integrity policies through direct data flow control. Various systems have been proposed for both low and high level languages. Our policy language and semantics are based on a well-developed formal foundation, where we interpret Horn clause logic as an instance of information algebra [28] in order to specify and interpret retrospective policies. The work presented in this paper supersedes a previous presentation [3] – in the current paper we extend our language model, provide more rigorous proofs of correctness of policy enforcement, consider the hyperproperty of taint analysis in a model of Java, and report on a prototype implementation.
Schwartz et al. [41] define a general model for runtime enforcement of policies using taint tracking for an intermediate language. In Livshits et al. [29], taint analysis is expressed as part of operational semantics, similar to Schwartz et al., and a taxonomy of taint tracking is defined. Livshits et al. [31] propose a solution for a range of vulnerabilities regarding Java-based web applications, including SQL injections, XSS attacks and parameter tampering, and formalize taint propagation including sanitization. The work uses PQL [32] to specify vulnerabilities. However, these works are focused on operational definitions of taint analysis for imperative languages. In contrast we have developed a logical specification of taint analysis for a functional OO language model that is separate from code, and is used to establish correctness of an implementation. Our work also comprises a unique retrospective component to protect against incomplete input sanitization. According to earlier studies [31,48], incomplete input sanitization makes a variety of applications susceptible to injection attacks.
Another related line of work is focused on the optimization of integrity taint tracking deployment in web-based applications. Sekar [42] proposes a taint tracking mechanism to mitigate injection attacks in web applications. The work focuses on input/output behavior of the application, and proposes a lower-overhead, language-independent and non-intrusive technique that can be deployed to track taint information for web applications by blackbox taint analysis with syntax-aware policies. In our work, however, we propose a deep instrumentation technique to enforce taint propagation in a layered in-depth fashion. Wei et al. [49] attempt to lower the memory overhead of TaintDroid taint tracker [21] for Android applications. The granularity of taint tracking places a significant role in the memory overhead. To this end, TaintDroid trades taint precision for better overhead, e.g., by having a single taint label for an array of elements. Our work reflects a more straightforward object-level taint approach in line with existing Java approaches.
Saxena et al. [37] employ static techniques to optimize dynamic taint tracking done by binary instrumentation, through the analysis of registers and stack frames. They observe that it is common for multiple local memory locations and registers to have the same taint value. A single taint tag is used for all such locations. A shadow stack is employed to retain the taint of objects in the stack. Cheng et al. [17] also study the solutions for taint tracking overhead for binary instrumentation. They propose a byte to byte mapping between the main and shadow memory that keeps taint information. Bosman et al. [15] propose a new emulator architecture for the x86 architecture from scratch with the sole purpose of minimizing the instructions needed to propagate taint. Similar to Cheng et al. [17], they use shadow memory to keep taint information, with a fixed offset from user memory space. Zhu et al. [50] track taint for confidentiality and privacy purposes. In case a sensitive input is leaked, the event is either logged, prohibited or replaced by some random value. We have modeled a similar technique for an OO language, through high level logical specification of shadow objects, so that each step of computation is simulated for the corresponding shadow expressions.
Particularly for Java, Chin et al. [18] propose taint tracking of Java web applications in order to prohibit injection attacks. To this end, they focus on strings as user inputs, and analyze the taint in character level. For each string, a separate taint tag is associated with each character of the string, indicating whether that character was derived from untrusted input. The instrumentation is only done on the string-related library classes to record taint information, and methods are modified in order to propagate taint information. Haldar et al. [25] propose an object-level tainting mechanism for Java strings. They study the same classes as the ones in Chin et al. [18], and instrument all methods in these classes that have some string parameters and return a string. Then, the returned value of an instrumented method is tainted if at least one of the argument strings is tainted. However, in contrast to our work, only strings are endowed with integrity information, whereas all values are assigned integrity labels in our approach. Recently Bodei et al. [14] have proposed a static enforcement mechanism for taint analysis in IoT devices which predicts the propagation of taint in the system according to the flow of control. These previous works lack retrospective features.
Recent work has also considered static analysis for ensuring proper context-based sanitization of user input data to defend against XSS attacks, in the JSPChecker system [45]. While this work refines what is meant by “correct” sanitization, it relies on static analysis and thus introduces false positives. In contrast, we propose a runtime tool that marks data generated by imperfect sanitizers for postfacto analysis. Our work is more general in the sense that it can be used for any category of integrity data flow vulnerabilities including XSS.
Phosphor [11 ,12] is an attempt to apply taint tracking more generally in Java, to any primitive type and object class. Phosphor instruments the application and libraries at bytecode level based on a given list of taint source and sink methods. Input sanitizers with endorsement are not directly supported, however. As Phosphor avoids any modifications to the JVM, the instrumented code is still portable. Our work is an attempt to formalize Phosphor in FJ extended with input sanitization and in-depth enforcement. Our larger goal is to develop an implementation of in-depth dynamic integrity analysis for Java by leveraging the existing Phosphor system.
Secure information flow [20] and its interpretation as the well-known hyperproperty [19] of noninterference [24] is challenging to implement in practical settings [36] due to implicit flows. Taint analysis is thus an established solution to enforce confidentiality and integrity policies since it tracks only direct data flow control. Various systems have been proposed for both low and high level languages. The majority of previous work, however, has been focused on taint analysis policy specification and enforcement (e.g., [29,30,41,49]), rather than capturing the essence of direct information flow which could provide an underlying framework to study numerous taint analysis tools.
Knowledge-based semantics has been introduced by Askarov et al. [6] as a general model for information flow of confidential data, concentrated on cryptographic computations and key release (declassification [35]) and later employed in other data secrecy analyses [4,5,7]. Schoepe et al. [39] have proposed the semantic notion of correctness for taint tracking that enforces confidentiality policies of direct information flow, called explicit secrecy. To this end, they propose a knowledge-based semantics, influenced by Volpano’s weak secrecy [47] and gradual release [6]. Explicit secrecy is defined as a property of a program, where the program execution does not change the explicit knowledge of public user. The authors show that noninterference is not comparable to explicit secrecy. However, rather than restricting the discussion to direct information flow in a low level language, we model a high level OO language with a functional flavor to represent generality of our framework.
Schoepe et al. [40] have recently employed explicit secrecy to study correctness results for dynamic confidentiality taint analysis in a core imperative setting with pointers and I/O, and deployed a Java-based tool, called DroidFace. A recent framework by Balliu et al. [8] attempts to bring together the general information flow and direct flow analyses using a security condition that models indirect flows which are observable by a low confidentiality user.
A counterpart for attacker knowledge in the realm of general flow of information integrity, called attacker power [5], is introduced as the set of low integrity inputs that generate the same observables. In this regard, Askarov et al. [5] use holes in the syntax of program code for injection points, influenced by [33]. However, their attack model is different as the low integrity and low confidentiality user is able to inject program code in the main program, by which she could gain more knowledge. We have tailored attacker power for explicit flows using state transformers, in order to interpret integrity taint analysis.
Birgisson et al. [13] give a unified framework to capture different flavors of integrity, in particular integrity via information flow and via different types of invariance. Similar to other works in this line, they give a simple imperative language with labeled operational semantics in order to enforce integrity policies through communication with a monitor. In contrast, we use program rewriting techniques to enforce policies regarding flow of data integrity, which are applicable to legacy systems.
In addition to formal properties of direct information flow, our formulation of correctness conditions also considers a formalization of audit logging based on our previous work [2], which considered a safety property unrelated to taint analysis. Other authors have recently considered formal characterizations of auditing based on logics of justification [10,34]. In contrast, we consider a specific security application of auditing in combination with taint analysis where audit logs are “extralinguistic” vestiges of program computation, whereas these related works consider programs that are able to reflect on their own audit trails, which is a distinct theoretical problem.
Conclusion
In this paper we considered integrity taint analysis in a pure object-oriented language model. Our security model accounts for sanitization methods that may be incomplete, a known problem in practice and one inspired by our study of the OpenMRS medical records software system. We proposed an in-depth security mechanism based on combining prospective measures (to support access control) and retrospective measures (to support auditing and accountability) that address incomplete sanitization. More precisely, we propose treating the results of sanitization as “partially” endorsed, or “maybe tainted”, and allow maybe tainted values to be used in security sensitive operations but record such events in the audit log.
We developed a uniform security policy of dynamic integrity taint analysis that specifies both prospective and retrospective measures, separate from code. The specification is defined in terms of a logical interpretation of program traces and leverages techniques from information algebra, allowing prospective and retrospective measures to be characterized in a uniform and integrated manner. Since the specification is defined separate from code, we use it to establish provable correctness conditions for a rewriting algorithm that instruments in-depth integrity taint analysis. A rewriting approach supports development of tools that can be applied to legacy code without modifying language implementations.
Although our specification of dynamic integrity taint analysis with endorsement establishes correctness conditions for implementations, it is still operational in nature. We therefore developed the hyperproperty of explicit integrity modulo endorsement to characterize the security property of integrity taint analysis in a non-operational manner. It is important to note that this formulation was not simply the dualization of previous formulations of explicit secrecy [39], since these formulations address only low-level code with unstructured data. We subsequently demonstrated that the image of our rewriting algorithm enjoys this security property.
Since our broader goal is to support well-founded practical tools for hardening software, we developed an instrumented version of OpenMRS that integrates our in-depth taint analysis formally specified in our model. Results from our evaluation of this implementation suggest that it is correct and practically feasible. We have made the implementation available on a public GitHub repository [44].
Footnotes
Acknowledgments
This work was supported by the National Science Foundation under Grant No. 1408801. Thanks to Ramy Koudsi and Adam Barson for their work on the implementation of
