Maybe tainted data: Theory and a case study

Abstract

Dynamic taint analysis is often used as a defense against low-integrity data in applications with untrusted user interfaces. An important example is defense against XSS and injection attacks in programs with web interfaces. Data sanitization is commonly used in this context, and can be treated as a precondition for endorsement in a dynamic integrity taint analysis. However, sanitization is often incomplete in practice. We develop a model of dynamic integrity taint analysis for Java that addresses imperfect sanitization with an in-depth approach. To avoid false positives, results of sanitization are endorsed for access control (aka prospective security), but are tracked and logged for auditing and accountability (aka retrospective security).

We show how this heterogeneous prospective/retrospective mechanism can be specified as a uniform policy, separate from code. We then use this policy to establish correctness conditions for a program rewriting algorithm that instruments code for the analysis. These conditions synergize our previous work on the semantics of audit logging with explicit integrity which is an analogue of noninterference for taint analysis. A technical contribution of our work is the extension of explicit integrity to a high-level functional language setting with structured data, vs. previous systems that only address low level languages with unstructured data. Our approach considers endorsement which is crucial to address sanitization. An implementation of our rewriting algorithm is presented that hardens the OpenMRS medical records software system with in-depth taint analysis, along with an empirical evaluation of the overhead imposed by instrumentation. Our results show that this instrumentation is practical.

Keywords

Auditing dynamic taint analysis program rewriting

1. Introduction

Dynamic taint analysis implements a “direct” or “explicit” information flow analysis to support a variety of security mechanisms [41]. Similar to information flow, taint analysis can be used to support either confidentiality or integrity properties. An important application of integrity taint analysis is to prevent the execution of security sensitive operations on untrusted data, in particular to combat cross-site scripting (XSS) and SQL injection attacks in web applications [31]. Any untrusted user input is marked as tainted, and then taint is tracked and propagated through data flow to ensure that tainted data is not used by security sensitive operations.

Of course, since web applications aim to be interactive, user input is needed for certain security sensitive operations such as database calls. To combat this, sanitization is commonly applied in practice to analyze and possibly modify data. From a taint analysis perspective, sanitization is a precondition for integrity endorsement, i.e. subsequently viewing sanitization results as high integrity data. However, while sanitization is usually endorsed as “perfect” by taint analysis, in fact it is not. Indeed, previous work has identified a number of flaws in existing sanitizers in a variety of applications [31,48], including the exploits to inject commands in web application, and the work here is in fact inspired by discovery of an XSS attack vector in the OpenMRS medical records software systems due to incomplete sanitization, discussed below in Section 1.1. We have demonstrated the practicality of exploiting the discovered XSS vulnerability in OpenMRS. We call such incomplete sanitizers partially trusted or imperfect throughout the paper.

Thus, a main challenge we address is how to mitigate imperfect sanitization in taint analysis. An important feature of our approach is an in-depth [22] security policy, that combines the typical blocking (prospective) behavior of taint-based access control with audit logging (retrospective) features. In the presence of imperfect sanitization, this allows false positives to be avoided, while still providing retrospective security measures via audit logs in case of attacks that leverage this imperfection. We are concerned with both efficiency and correctness – we develop a language model intended to capture the essence of Phosphor [11,12], an existing Java taint analysis system with empirically demonstrated efficiency. To be applicable to legacy systems such as OpenMRS, we propose a program rewriting approach. Our program rewriting algorithm takes as input a heterogeneous (prospective and retrospective) taint analysis policy specification and input code, and instruments the code to support the policy. The policy allows user specification of taint sources, secure sinks, and sanitizers. A distinct feature of our system is that results of sanitization are considered “maybe tainted” data, which are allowed to flow into security sensitive operations but in such cases are entered in a log to support auditing and accountability.

A contribution of our approach is a uniform expression of an in-depth security policy that combines prospective (taint analysis) and retrospective (audit logging) policy features, and proof that our rewriting algorithm enforces this policy. To characterize retrospective correctness we leverage our previous work on the semantics of retrospective security [2]. To characterize prospective correctness, we aim to go deeper than operational definitions [29,41] and characterize correctness as a higher level semantic property, in particular as a hyperproperty [19]. To this end we propose a semantic framework called explicit integrity, that is an extension of explicit secrecy [39] to a high-level (Java) language model with structured data. Explicit integrity is analogous to (integrity) noninterference, but applies to taint analysis that is concerned only with direct aka explicit information flows. Intuitively, in programs that enjoy explicit integrity, low-integrity (tainted) data does not flow directly into high-integrity sinks during program execution. Both explicit secrecy and explicit integrity are defined independently of language-level instrumentation, and (like noninterference) are hyperproperties as formulated in [39] and in this work. Furthermore, we consider the variant explicit integrity modulo endorsement, since endorsement is necessary in the taint analysis to accurately reflect the results of sanitization.

1.1. Practical motivations

While our work is based on formal foundations it is inspired by practical concerns, in particular a security flaw we discovered in a previous version (2.4) of OpenMRS originally reported in [3].1

¹
We responsibly disclosed the vulnerabilities we found to the OpenMRS development community, and they have been corrected in current versions.

This flaw allows an attacker to launch persistent XSS attacks. When a web-based software receives and stores user input without proper sanitization, and later retrieves this information for (other) users, persistent XSS attacks could take place.

OpenMRS uses a set of validators to enforce expected data formats by implementation of the $Validator$ interface (e.g., $PersonNameValidator$ , $VisitTypeValidator$ , etc.). For some of these classes the implementation is strict enough to reject script tags by enforcing data to match a particular regular expression, e.g., $PersonNameValidator$ . However, $VisitTypeValidator$ lacks such restriction and only checks for object fields to avoid being null, empty or whitespace, and their lengths to be correct. Thus the corresponding webpage that receives user inputs to construct $VisitType$ objects (named $VisitTypeForm . jsp$ ) is generally not able to perform proper sanitization through the invocation of the validator implemented by $VisitTypeValidator$ . A $VisitType$ object is then stored in the MySQL database, and could be retrieved later based on user request. For instance, $VisitTypeList . jsp$ queries the database for all defined $VisitType$ objects, and sends $VisitType$ names and descriptions to the client side. Therefore, the attacker can easily inject scripts as part of $VisitType$ name and/or description, and the constructed object would be stored in the database and possibly in a later stage retrieved and executed in the victim’s client environment.

Integrity taint tracking is a well-recognized solution against these sorts of attacks that deals direct information flow analysis (and hence explicit integrity). In our example, using taint analysis the tainted $VisitType$ object would be prevented from retrieval and execution. The addition of sanitization methods would also be an obvious step, and commensurate with an integrity taint analysis approach – sanitized objects would be endorsed for the purposes of prospective security. However, many attack scenarios demonstrate degradation of taint tracking effectiveness due to unsound or incomplete input sanitization [31,48]. Hence our introduction of “maybe tainted” data, which is allowed to flow into security sensitive operations but in such cases are entered in a log to support auditing and accountability.

1.2. The security and threat model

The security problem we consider is about the integrity of data being passed to security sensitive operations (SSOs) in a direct manner. An important example is a string entered by an untrusted user that is passed to a database method for parsing and execution as a SQL command. The security mechanism should guarantee that low-integrity data cannot be passed to SSOs without previous sanitization.

In contrast to standard information flow which is concerned with both direct (aka explicit) and indirect (aka implicit) flows, taint analysis is only concerned with direct flow. Direct flows transfer data directly between variables, e.g., $n_{1}$ and $n_{2}$ directly affect the result of $n_{1} + n_{2}$ . Indirect flows are realized when data can affect the result of code dispatch – the standard example is a conditional expression $if v then e_{1} else e_{2}$ where the data v indirectly affects the valuation of the expression by guarding dispatch.

More precisely, we posit that top-level programs $p$ in this security setting are parameterized by a low integrity data source a, and an arbitrary number of secure sinks, aka security sensitive operations (SSOs), and sanitizers which are specified externally to the program by a security administrator. For simplicity we assume that SSOs are unary operations over primitive objects, so there is no question about which argument may be tainted. Since we define a Java based model, each SSO or sanitizer is identified as a specific method $m$ in a class $C$ . That is, there exists a set of $Sanitizers$ containing class, method pairs $C . m$ which are assumed to return high-integrity data, though they may be passed low-integrity data. Likewise, there exists a set $SSOs$ of the same form. As a sanity condition we require $SSOs \cap Sanitizers = \emptyset$ . For simplicity of our formal presentation we assume that only one tainted source will exist. Explicit integrity, as a high-level property, is instantiated for this model.

We assume that our program rewriting algorithm is trusted. Input code is trusted to be not malicious, though it may contain errors. We note that this assumption is important for application of taint analysis that disregards indirect flows, since there is confidence that the latter will not be actively exploited by non-malicious code. We assume that untrusted data sources provide low integrity data, though in this work we only consider tainted “static” values, e.g., strings, not tainted code that may be run as part of the main program execution. However, the latter does not preclude hardening against XSS or injection attacks in practice, if we consider an evaluation method to be an SSO.

1.3. Overview by example

As a running example for the paper, consider the following code, where we imagine a tainted input string ” $hello$ ” is concatenated with an untainted string ” $world$ ”, and then sanitized by a method $sanitize$ of a $Sec$ object that implements security functionality. The sanitized result is then passed to an $sso$ called $secureMeth$ , also in the $Sec$ class. $\begin{array}{l} new Sec () . secureMeth (new Sec () . sanitize \\ (” hello ” . concat (new String (” world ”))) \end{array}$

In order to define the logging policy, we will define an operational (trace) semantics of programs where direct information flow is defined as a property of traces. This property correlates taint labels with values in traces – in the above expression the tainted label ∙ is correlated with ” $hello$ ”, the untainted label ∘ is correlated with ” $world$ ”. And in the trace of this programs execution, the tainted label ∙ is correlated with the concatenated string ” $hello world$ ”, due to the propagation of taint. Sanitization is typically associated with endorsement in taint tracking systems [39]. However, since sanitization can often only be partially trusted, we propose to consider such sanitization results to be “maybe tainted”, which is indicated by correlation with a maybe tainted label ⊙. The logging policy should then specify that maybe tainted data entering any $sso$ should be allowed, but logged.

In all cases, correlation of labels with values in structured data is accomplished via “shadows” of expressions, which are shape-conformant with source language expressions and carry taint labels. For example, in the trace of the above expression, the result of sanitization is the following: $\begin{matrix} TopLevel . main (new Sec () . secureMeth (new String (” hello world ”))) \end{matrix}$ This expression has the following shadow: $\begin{matrix} TopLevel . main (shadow Sec (\circ) . secureMeth (shadow String (⊙, δ))) \end{matrix}$ Note that the shadow replaces distinct lexical values with a dummy value δ, but can be “overlaid” on the expression to obtain the taint correlation. Thus, our logging policy would specify that the maybe tainted string ” $hello world$ ” would be allowed to enter the $sso$ $secureMeth$ , but logged. This example is revisited later in Example 3.2, where we go into detail about its evaluation, and its shadow expressions that capture the operational semantics of taint analysis.

Subsequently, we develop a rewriting algorithm called $Phos$ that instruments programs to implement taint analysis, as well as to implement logging of maybe tainted data, that is correct with respect to the shadow specification. We revisit this example again in Section 4.1.4 to show how taint labels, taint propagation, and logging of maybe tainted data is made explicit by $Phos$ .

1.4. Technical overview

The technical development of the paper proceeds as follows. In Section 2 we describe a formal semantics of auditing, and the conditions for correctness of audit rewriting algorithms. That is, we define what it means for a program instrumentation to correctly log information. In Section 2.1, we introduce information algebra [27] as the basis of our model for correct audit log generation. We characterize logging specifications and correctness conditions for audit logs in a high-level manner using information algebra, and show how information elements and operations can be instantiated using first order logic.

In Section 3 we develop a source language model based on featherweight Java (Section 3.1), called FJ. We show how to logically specify an in-depth taint analysis policy separately from code in Section 4 via safety property and logging specifications. In Section 3.2 we develop a target language model ${FJ}_{taint}$ with instrumentation for operationally enforcing an in-depth taint analysis, which we show to be correct according to our formal condition in Section 4.2, with our main result being Theorem 4.1.

While Theorem 4.1 establishes correctness conditions for information in audit logs in an operational sense, Section 5 focuses on the high level security property of dynamic integrity taint analysis that is tailored for direct information flow. In Section 5.4, we show that our enforcement mechanism satisfies the hyperproperty of explicit integrity modulo endorsement (Theorem 5.1). In Section 6 we discuss our implementation of the in-depth taint analysis specification presented in Section 4 for the OpenMRS medical records system. We also describe experiments for empirical evaluation of this implementation and discuss results. In Section 7 we discuss related work and conclude the paper.

For the sake of brevity, we have omitted the proofs of all Lemmas and Theorems. Readers are referred to our accompanying Technical Report [43] for these details in full.

2. Foundations for in-depth policy specification

In this section we establish formal foundations for a semantics of prospective and retrospective policy features. and the correctness of audit instrumentation. An appeal of our approach is that both safety properties [38] and logging correctness can be formulated, so we are able to uniformly characterize operational correctness conditions for in-depth integrity taint analysis in our framework. This framework was initially developed in previous work [2] where we studied so-called “break-the-glass” policies for medical records software. In that work we justified the generality of our framework and discuss its details at length. Here we reiterate the main technical points of the framework to allow a standalone formal presentation.

We leverage ideas from the theory of information algebra [27,28], which is an abstract mathematical framework for information systems. In short, we interpret program traces as information, and logging specifications as functions from traces to information. This separates logging specifications from their implementation in code, and defines exactly the information that should be in an audit log. This in turn establishes correctness conditions for audit logging implementations.

2.1. Introduction to information algebra

Information algebra is an algebraic theory of information where information is seen as a collection of information elements with fundamental aggregation and refinement operations. The algebra consists of two domains, an information domain and a query domain. The information domain Φ is the set of information elements that can be aggregated in order to build more inclusive information elements. The query domain E is a lattice of querying sublanguages in which the partial order relation among these sublanguages represents the granularity of the queries. The information and query domains are left abstract in the general theory – instantiation examples include relational algebra and first order logic as we discuss below. By definition any instantiation must include basic operations for combining information and for focusing on components of information.

Definition 2.1.
Any information algebra $(Φ, E)$ includes two basic operators:
Combination $\otimes : Φ \times Φ \to Φ$ : The operation $X \otimes Y$ combines (or, aggregates) the information in elements $X, Y \in Φ$ .

Focusing $\Rightarrow : Φ \times E \to Φ$ : The operation $X^{\Rightarrow S}$ isolates the elements of $X \in Φ$ that are relevant to a sublanguage $S \in E$ , i.e. the subpart of X specified by S.

Using the combination operator we can define a partial order relation on Φ to compare the information contained in the elements of Φ. A partial ordering is induced on Φ by the so-called information ordering relation ⩽, where intuitively for $X, Y \in Φ$ we have $X ⩽ Y$ iff Y contains at least as much information as X, though its precise meaning depends on the particular algebra.
Definition 2.2.
X is contained in Y, denoted as $X ⩽ Y$ , for all $X, Y \in Φ$ iff $X \otimes Y = Y$ .
Definition 2.3.
We say that X and Y are information equivalent, and write $X = Y$ , iff $X ⩽ Y$ and $Y ⩽ X$ .

For a more detailed account of information algebra, the reader is referred to a definitive survey paper [28].
2.1.1. Illustrative example: Relational algebras

Relational algebra is a well-recognized instance of information algebra. The formulation of relational algebra as an information algebra by Kohlas [28] is an illustrative example of this information theoretic framework. Here we reproduce this example.

Let $A$ denote the set of attributes, $A_{i} \subseteq A$ for $i \in {1, 2, 3}$ , $A_{2} \subseteq A_{1}$ , and assume that $A_{1} = {a_{1}, \dots, a_{n}}$ . Each tuple $((a_{1} : x_{1}), \dots, (a_{n} : x_{n}))$ can be formulated as a function $f : A_{1} \to {x_{1}, \dots, x_{n}}$ , where $f (a_{i}) = x_{i}$ . $x_{i}$ s are values from potentially different domains.

Function $f [A_{2}] : A_{2} \to {x_{1}, \dots, x_{n}}$ is the restriction of f to $A_{2}$ , defined as $f [A_{2}] (a) = f (a)$ , for all $a \in A_{2}$ . A relation R over $A_{1}$ is a set of functions f defined on a specific set of attributes $A_{1}$ . Then, the projection of R on $A_{2}$ is defined as $π_{A_{2}} (R) = {f [A_{2}] ∣ f \in R}$ . The natural join of relation R over $A_{1}$ and $R^{'}$ over $A_{3}$ is defined as $R ⋈ R^{'} = {f ∣ dom (f) = A_{1} \cup A_{3}, f [A_{1}] \in R, f [A_{3}] \in R^{'}}$ .

Instantiation. Let $R$ be the universe of all relations R. Then, $(R, P (A))$ is an information algebra with following definitions for combination and focusing: $\begin{matrix} R \otimes R^{'} ≜ R ⋈ R^{'} R^{\Rightarrow A_{1}} ≜ π_{A_{1}} (R) \end{matrix}$ Note that in this formulation, the restriction operator is defined partially on the set of attributes [28]. And, according to Definition 2.2, for all relations R and $R^{'}$ , $R ⩽ R^{'}$ iff $R ⋈ R^{'} = R^{'}$ . For example, $π_{A_{1}} (R) ⩽ R$ . Moreover, the set of querying sublanguages $P (A)$ is a lattice induced by subset containment relation ⊆.

2.2. A general model for logging specifications

Following [38], an execution trace $τ = κ_{0} κ_{1} κ_{2} \dots$ is a possibly infinite sequence of configurations κ that describe the state of an executing program. We deliberately leave configurations abstract, but examples abound and we explore a specific instantiation for FJ-based calculus in Section 3. Note that an execution trace τ may represent the partial execution of a program, i.e. the trace τ may be extended with additional configurations as the program continues execution. We use metavariables τ and σ to range over traces, and use ∅ to denote an empty trace.

We assume a given function $⌊ \cdot ⌋$ that is an injective mapping from traces to Φ. This mapping interprets a given trace as information, where the injective requirement ensures that information is not lost in the interpretation. For example, if σ is a proper prefix of τ and thus contains strictly less information, then formally $⌊ σ ⌋ ⩽ ⌊ τ ⌋$ . We intentionally leave both Φ and $⌊ \cdot ⌋$ underspecified for generality, though application of our formalism to a particular logging implementation requires instantiation of them.

We let $LS$ range over logging specifications, which are functions from traces to Φ. As for Φ and $⌊ \cdot ⌋$ , we intentionally leave the language of specifications abstract, but consider a particular instantiation in Section 2.6. Intuitively, $LS (τ)$ denotes the information that should be recorded in an audit log during the execution of τ given specification $LS$ , regardless of whether τ actually records any log information, correctly or incorrectly. We call this the semantics of the logging specification $LS$ .

We assume that auditing is implementable, requiring at least that all conditions for logging any piece of information must be met in a finite amount of time. As we will show, this restriction implies that correct logging instrumentation is a safety property [38].

Definition 2.4.
We require of any logging specification $LS$ that for all traces τ and information $X ⩽ LS (τ)$ , there exists a finite prefix σ of τ such that $X ⩽ LS (σ)$ .

It is crucial to observe that some logging specifications may add information not contained in traces to the auditing process. Security information not relevant to program execution (such as ACLs), interpretation of event data (statistical or otherwise), etc., may be added by the logging specification. For example, in the OpenMRS system [46], logging of sensitive operations includes a human-understandable “type” designation which is not used by any other code. Thus, given a trace τ and logging specification $LS$ , it is not necessarily the case that $LS (τ) ⩽ ⌊ τ ⌋$ . Audit logging is not just a filtering of program events.
2.3. Correctness conditions for audit logs

A logging specification defines what information should be contained in an audit log. In this section we develop formal notions of soundness and completeness as audit log correctness conditions. We use metavariable $L$ to range over audit logs. Again, we intentionally leave the language of audit logs unspecified, but assume that the function $⌊ \cdot ⌋$ is extended to audit logs, i.e. $⌊ \cdot ⌋$ is an injective mapping from audit logs to Φ. Intuitively, $⌊ L ⌋$ denotes the information in $L$ , interpreted as an element of Φ.

An audit log $L$ is sound with respect to a logging specification $LS$ and trace τ if the log information is contained in $LS (τ)$ . Similarly, an audit log is complete with respect to a logging specification if it contains all of the information in the logging specification’s semantics. Crucially, both definitions are independent of the implementation details that generate $L$ .

Definition 2.5.
Audit log $L$ is sound with respect to logging specification $LS$ and execution trace τ iff $⌊ L ⌋ ⩽ LS (τ)$ .
Definition 2.6.
Audit log $L$ is complete with respect to logging specification $LS$ and execution trace τ iff $LS (τ) ⩽ ⌊ L ⌋$ .

2.4. Correct logging instrumentation is a safety property

In case program executions generate audit logs, we write $τ ⇝ L$ to mean that trace τ generates $L$ , i.e. $τ = κ_{0} \dots κ_{n}$ and $logof (κ_{n}) = L$ where $logof (κ)$ denotes the audit log in configuration κ, i.e. the residual log after execution of the full trace. Ideally, information that should be added to an audit log, is added to an audit log, immediately as it becomes available. Using the term “instrumentation” to refer to program elements for audit log generation, this idea is formalized as follows.

Definition 2.7.
For all logging specifications $LS$ , the trace τ is ideally instrumented for $LS$ iff for all finite prefixes σ of τ we have $σ ⇝ L$ where $L$ is sound and complete with respect to $LS$ and σ.

We observe that the restriction imposed on logging specifications by Definition 2.4, implies that ideal instrumentation of any logging specification is a safety property in the sense defined by Schneider [38].
Theorem 2.1.
For all logging specifications $LS$ , the set of ideally instrumented traces is a safety property.

This result implies that e.g. edit automata can be used to enforce instrumentation of logging specifications [1]. However, theory related to safety properties and their enforcement by execution monitors [9,38] does not provide an adequate semantic foundation for audit log generation, nor an account of soundness and completeness of audit logs.
2.5. Implementing logging specifications with program rewriting

The above-defined correctness conditions for audit logs provide a foundation on which to establish correctness of logging implementations. Here we consider program rewriting approaches. Since rewriting concerns specific languages, we introduce an abstract notion of programs p with an operational semantics that can produce a trace. We write $p ⇓ σ$ iff program p can produce execution trace τ, either deterministically or non-deterministically, and σ is a finite prefix of τ.

A rewriting algorithm $R$ is a (partial) function that takes a program p in a source language and a logging specification $LS$ and produces a new program, $R (p, LS)$ , in a target language.2

²
We use metavariable p to range over programs in either the source or target language; it will be clear from context which language is used.

The intent is that the target program is the result of instrumenting p to produce an audit log appropriate for the logging specification

LS

. A rewriting algorithm may be partial, in particular because it may only be intended to work for a specific set of logging specifications.

Ideally, a rewriting algorithm should preserve the semantics of the program it instruments. That is, $R$ is semantics-preserving if the rewritten program simulates the semantics of the source code, modulo logging steps. We assume given a correspondence relation ≅ on execution traces. A coherent definition of correspondence should be similar to a bisimulation, but is not necessarily a bisimulation, since the instrumented target program may be in a different language than the source program. We deliberately leave the correspondence relation underspecified, as its definition will depend on the instantiation of the model. Possible definitions are that traces produce the same final value, or that traces when restricted to a set of memory locations are equivalent up to stuttering (i.e., different numbers of “internal” execution steps that do not affect memory). Furthermore, because rewriting will often add blocking checks for unsafe behaviors (as in the case we will study), semantics preservation is defined up to simulation of sets of program traces that will typically be defined as a safety property. We provide a definition of correspondence for FJ-calculus source and target languages in Section 4.2, that illustrates these concepts.

Definition 2.8.

Let T be a set of program traces. Rewriting algorithm $R$ is semantics preserving up to T iff for all programs p and logging specifications $LS$ such that $R (p, LS)$ is defined, all of the following hold:

For all traces $τ \in T$ such that $p ⇓ τ$ there exists $τ^{'}$ with $τ ≅ τ^{'}$ and $R (p, LS) ⇓ τ^{'}$ .

For all traces τ such that $R (p, LS) ⇓ τ$ there exists a trace $τ^{'} \in T$ such that $τ^{'} ≅ τ$ and $p ⇓ τ^{'}$ .

In addition to preserving program semantics, a correctly rewritten program constructs a log in accordance with the given logging specification. More precisely, if $LS$ is a given logging specification and a trace τ describes execution of a source program, rewriting should produce a program with a trace $τ^{'}$ that corresponds to τ (i.e., $τ ≅ τ^{'}$ ), where the log $L$ generated by $τ^{'}$ ideally contains the same information as $LS (τ)$ . Some definitions of ≅ may allow several target-language traces to correspond to source-language traces. Hence we write $simlogs (p, τ)$ to denote a nonempty set of logs $L$ such that, given source language trace τ and target program p, there exists some trace $τ^{'}$ where $p ⇓ τ^{'}$ and $τ ≅ τ^{'}$ and $τ^{'} ⇝ L$ . The name $simlogs$ evokes the relation to logs resulting from simulating executions in the target language.

The following definitions then establish correctness conditions for rewriting algorithms in conjunction with semantics preservation. Like semantics preservation, we define soundness and completeness with respect to a given set of traces.

Definition 2.9.

Let T be a set of traces. Rewriting algorithm $R$ is sound up to T iff for all programs p, logging specifications $LS$ , and finite traces $τ \in T$ where $p ⇓ τ$ , for all $L \in simlogs (R (p, LS), τ)$ it is the case that $L$ is sound with respect to $LS$ and τ.

Definition 2.10.

Let T be a set of traces. Rewriting algorithm $R$ is complete up to T iff for all programs p, logging specifications $LS$ , and finite traces $τ \in T$ where $p ⇓ τ$ , for all $L \in simlogs (R (p, LS), τ)$ it is the case that $L$ is complete with respect to $LS$ and τ.

Note also that without semantics preservation, soundness and completeness could be satisfied trivially in case $simlogs (R (p, LS), τ)$ is empty.

2.6. A first order logic (FOL) specification language

Logics have been used in several well-developed auditing systems [16,23], for the encoding of both audit logs and queries. FOL in particular is attractive due to readily available implementation support, e.g. Datalog and Prolog. We have shown in previous work that FOL is an information algebra, and useful for e.g. break the glass policy specification [2]. Here we summarize important definitions for the remainder of this paper.

Let Greek letters ϕ and ψ range over FOL formulas and let capital letters X, Y, Z range over sets of formulas. We posit a sound and complete proof theory supporting judgements of the form $X ⊢ ϕ$ . In this text we assume without loss of generality a natural deduction proof theory.

Elements of our algebra are sets of formulas closed under logical entailment. Intuitively, given a set of formulas X, the closure of X is the set of formulas that are logically entailed by X, and thus represents all the information contained in X. In spirit, we follow the treatment of sentential logic as an information algebra explored in related foundational work [27], however our definition of closure is syntactic, not semantic.

Definition 2.11.
We define a closure operation C, and a set $Φ_{FOL}$ of closed sets of formulas: $\begin{matrix} C (X) = {ϕ ∣ X ⊢ ϕ} Φ_{FOL} = {X ∣ C (X) = X} \end{matrix}$ Note in particular that $C (\emptyset)$ is the set of logical tautologies.

Let $Preds$ be the set of all predicate symbols, and let $S \subseteq Preds$ be a set of predicate symbols. We define sublanguage $L_{S}$ to be the set of well-formed formulas over predicate symbols in S (including boolean atoms $true$ and $false$ , and closed under the usual first-order connectives and binders). We will use sublanguages to define refinement operations in our information algebra. Subset containment induces a lattice structure, denoted $S$ , on the set of all sublanguages, with $F = L_{Preds}$ as the top element.

Now we can define the focusing and combination operators, which are the fundamental operators of an information algebra. Focusing isolates the component of a closed set of formulas that is in a given sublanguage. Combination closes the union of closed sets of formulas. Intuitively, the focus of a closed set of formulas X to sublanguage L is the refinement of the information in X to the formulas in L. The combination of closed sets of formulas X and Y combines the information of each set.
Definition 2.12.
Define:
Focusing: $X^{\Rightarrow S} = C (X \cap L_{S})$ where $X \in Φ_{FOL}$ , $S \subseteq Preds$

Combination: $X \otimes Y = C (X \cup Y)$ where $X, Y \in Φ_{FOL}$

Properties of the algebra ensure that ⩽ is a partial ordering by defining $X ⩽ Y$ iff $X \otimes Y = Y$ , which in the case of our logical formulation means that for all $X, Y \in Φ_{FOL}$ we have $X ⩽ Y$ iff $X \subseteq Y$ , i.e. ⩽ is subset inclusion over closed sets of formulas.

The following Theorem establishes that the construction is an information algebra – for a complete proof the reader is directed to [1].
Theorem 2.2.
Structure $(Φ_{FOL}, S)$ with focus operation $X^{\Rightarrow S}$ and combination operation $X \otimes Y$ forms an information algebra.

In addition, to interpret traces and logs as elements of this algebra, i.e. to define the function $⌊ \cdot ⌋$ , we assume existence of a function $toFOL (\cdot)$ that injectively maps traces and logs to sets of FOL formulas, and then take $⌊ \cdot ⌋ = C (toFOL (\cdot))$ . To define the range of $toFOL (\cdot)$ , that is, to specify how trace information will be represented in FOL, we assume the existence of configuration description predicates P which are each at least unary. Each configuration description predicate fully describes some element of a configuration κ, and the first argument is always a natural number n, indicating the time at which the configuration occurred. A set of configuration description predicates with the same timestamp describes a configuration, and traces are described by the union of sets describing each configuration in the trace. We will fully define $toFOL (\cdot)$ when we discuss particular source and target languages for program rewriting.

Formally, we define logging specifications in a logic programming style by using combination and focusing. Any logging specification is parameterized by a sublanguage S that identifies the predicate(s) to be resolved and Horn clauses X that define it/them, and can be defined via the functional $spec$ from pairs $(X, S)$ to specifications $LS$ , where we use λ as a binder for function definitions in the usual manner:
Definition 2.13.
The function $spec$ is given a pair $(X, S)$ and returns a FOL logging specification, i.e. a function from traces to elements of $Φ_{FOL}$ : $\begin{matrix} spec (X, S) = λ τ . {(⌊ τ ⌋ \otimes C (X))}^{\Rightarrow S} . \end{matrix}$

We will formulate a particular example of a logging specification for maybe tainted data in Definition 3.3.
3. Direct information flow: Dynamic integrity taint analysis

In this section we present a basic object-oriented calculus as the foundation of our language model. We also show how the in-depth integrity taint analysis model described in Section 1.2 can be specified as a logical property of program traces in this model, independent of program instrumentation. This allows us to define retrospective taint analysis as a logging specification in the style introduced in Section 2. Subsequently in Section 4 we will show how this specification can be correctly instrumented via program rewriting into a target language, hence we refer to the language introduced in this section as our source language.

3.1. Source language

Our source language model is essentially Featherweight Java (FJ) [26] with minor extensions including base types and an abstract notion of library methods for base types. The latter is important for an adequate consideration of taint propagation (e.g. on strings) in our model. FJ is a functional core calculus that includes class hierarchy definitions, subtyping, dynamic dispatch, and other basic features of Java. An FJ program is an expression $e$ which is executed given a static class table $CT$ which maintains class definitions. To describe program execution we will define a small step operational semantics relation on expressions $e$ which we will take as synonymous with configurations as defined previously.

3.1.1. Syntax

The syntax of FJ is defined in Fig. 1. We let $A$ , $B$ , $C$ , $D$ range over class names, $x$ range over variables, $f$ range over field names, and $m$ range over method names. Values, denoted $v$ or $u$ , are objects, i.e. expressions of the form $new C (v_{1}, \dots, v_{n})$ . We assume given an $Object$ value that has no fields or methods. In addition to the standard expressions of FJ, we introduce a new form $C . m (e)$ . This form is used to identify the method $C . m$ associated with a current evaluation context (aka the “activation frame”). This does not really change the semantics, but is a useful feature for our specification of sanitizer endorsement since return values from sanitizers need to be endorsed – see the $Invoke$ and $Return$ rules in the operational semantics below for its usage.

Fig. 1.

FJ syntax.

Conditional expressions are an important feature of the language for this presentation, since they are a control flow operation that should not be considered in a direct flow analysis. We assume that in any program setting true and false values, denote T and F, will be specified. When we consider base values and library methods below in Section 3.1.6, we will define a particular boolean value that we will use in this presentation.

For brevity in this syntax, we use vector notations. Specifically we write $\overline{f}$ to denote the sequence $f_{1}, \dots, f_{n}$ , similarly for $\overline{C}$ , $\overline{m}$ , $\overline{x}$ , $\overline{e}$ , etc., and we write $\overline{M}$ as shorthand for $M_{1} \dots M_{n}$ . We write the empty sequence as ∅, we use a comma as a sequence concatenation operator. If and only if $m$ is one of the names in $\overline{m}$ , we write $m \in \overline{m}$ . Vector notation is also used to abbreviate sequences of declarations; we let $\overline{C} \overline{f}$ and $\overline{C} \overline{f}$ ; denote $C_{1} f_{1}, \dots, C_{n} f_{n}$ and $C_{1} f_{1}; \dots; C_{n} f_{n}$ ; respectively. The notation $this . \overline{f} = \overline{f}$ ; abbreviates $this . f_{1} = f_{1}; \dots; this . f_{n} = f_{n}$ ;. Sequences of names and declarations are assumed to contain no duplicate names.

3.1.2. The class table and field and method body lookup

The class table $CT$ maintains class definitions. The manner in which we look up field and method definitions implements inheritance and override, which allows fields and methods to be redefined in subclasses. Given a class table $CT$ , the definitions of ${mbody}_{CT} (m, C)$ and ${fields}_{CT} (C)$ are given in Fig. 2.

3.1.3. Method type lookup

Just as we’ve defined a function for looking up method bodies in the class table, we also define a function ${mtype}_{CT} (C, m)$ that will look up types of a method $C . m$ in a class table in Fig. 2. Although we omit FJ type analysis from this presentation, method type lookup will be useful for taint analysis instrumentation (Definition 4.1).

Fig. 2.

Object field, method body, and method type lookup.

3.1.4. Operational semantics

Now, we can define the operational semantics of FJ. The reduction relation is binary, of the form $κ \to κ^{'}$ , and is defined via the inference rules in Fig. 3.

Fig. 3.

Operational semantics for FJ.

The definition of → assumes given a class table $CT$ which is typically clear from context, but we will write $CT ⊢ κ \to κ^{'}$ to disambiguate class tables used in reductions when necessary. The definition also assumes that boolean values T and F are specified. We use $\to^{*}$ to denote the reflexive, transitive closure of →, and we use $\to^{n}$ to denote an n-step reduction. We will also use the notion of an execution trace τ to range over sequences of configurations $κ_{0} \dots κ_{n}$ where $κ_{i} \to κ_{i + 1}$ for all $0 ⩽ i < n$ . Note that an execution trace τ may represent the partial execution of a program, i.e. the trace τ may be extended with additional configurations as the program continues execution. In general we will write $CT ⊢_{\to} τ$ to disambiguate the class table $CT$ and reduction relation → used for a trace τ when it is not clear from context.

3.1.5. Top-level programs

We define top-level programs $p (a)$ as programs of the form: $\begin{matrix} new TopLevel () . main (a) \end{matrix}$ where a is a primitive object $new C (\overline{ν})$ . We assume that all class tables $CT$ include an entry point $TopLevel . main$ with formal parameter $attack$ , where $TopLevel$ objects have no fields. We write $p (a) ⇓ τ$ iff trace τ begins with the configuration $p (a)$ .

3.1.6. Library methods

In order to study dynamic integrity taint analysis in FJ, we extend the semantics for library methods that allow specification of operations on base values (such as strings and integers). Consideration of these features is important for a thorough modeling of Phosphor-style taint analysis, and important related issues such as string- vs. character-based taint [18] which have not been considered in previous formal work on taint analysis [41]. Since static analysis is not a topic of this paper, for brevity we omit the standard FJ type analysis which is described in [26].

The abstract calculus described above is not particularly interesting with respect to direct information flow and integrity propagation, especially since method dispatch and conditional expressions are control flows that are discounted in direct data flow. More interesting is the manner in which taint propagates through base values and library operations, since direct flows propagate through some of these methods. Also, for run-time efficiency and ease of coding some Java taint analysis tools treat even complex library methods as “black boxes” that are instrumented at the top level for efficiency [25], rather than relying on instrumentation of lower-level operations.

Note that treating library methods as “black boxes” introduces a potential for over- and under-tainting – for example in some systems all string library methods that return strings are instrumented to return tainted results if any of the arguments are tainted, regardless of any direct flow from the argument to result [25]. Clearly this strategy introduces a potential for over-taint. Other systems do not propagate taint from strings to their component characters when decomposed [18], which is an example of under-taint. Part of our goal here is to develop an adequate language model to consider these approaches.

We therefore extend our basic definitions to accommodate base values and their manipulation. Let a primitive field be a field containing a base value. We call a base type any class with primitive fields only, and a library method is any method that operates on base type objects, defined in a primitive class. We expect primitive objects to be object wrappers for primitive values (e.g., $Int (5)$ wrapping primitive value $5$ ), and library methods to be object-oriented wrappers over primitive operations (e.g., $Int plus (Int)$ wrapping primitive operation $+$ ), allowing the latter’s embedding in FJ. As a sanity condition we only allow library methods to select primitive fields or perform primitive operations. Let $LibMeths$ be the set of library method names paired with their corresponding base classes in $BaseTypes$ .

We posit a special set of field names $PrimField$ that access primitive values ranged over by ν that may occur in objects, and a set of operations ranged over by $Op$ that operate on primitive values. We require that special field name selections only occur as arguments to $Op$ , which can easily be enforced in practice by a static analysis. Similarly, primitive values ν may only occur in special object fields and be manipulated there by any $Op$ . $\begin{array}{l} f^{*} \in PrimField \\ e : : = ν ∣ e . f^{*} \\ e : : = \dots ∣ Op (\overline{e}) \\ v : : = new C (\overline{v}) ∣ ν \\ E : : = \dots ∣ Op (\overline{ν}, E, \overline{e}) \end{array}$ The body of any library method is required to be of the form $return new C ({\overline{e}}_{1}, \dots, {\overline{e}}_{n})$ where $C$ is a primitive class.

We define the meaning of operations $Op$ via an “immediate” big-step semantic relation ≈ where the rhs of the relation is required to be a primitive value, and we identify expressions up to ≈. For example, to define a library method for integer addition, where $Int$ objects contain a primitive numeric $val$ , field we would define a + operation as follows: $\begin{matrix} + (n_{1}, n_{2}) \approx n_{1} + n_{2} \end{matrix}$ Then we can add to the definition of $Int$ in $CT$ a method $Plus$ to support arithmetic in programs: $\begin{matrix} Int plus (Int x) {return (new Int (+ (this . val, x . val)));} \end{matrix}$ Similarly, to define string concatenation, we define a concatenation operation @ on primitive strings: $\begin{matrix} @ (s_{1}, s_{2}) \approx s_{1} s_{2} \end{matrix}$ and we extend the definition of $String$ in $CT$ with the following method, where we assume all $String$ objects maintain their primitive representation in a $val$ field: $\begin{matrix} String concat (String x) {return (new String (@ (this . val, x . val)));} \end{matrix}$

A boolean class $Bool$ can be defined on the basis of constants $true$ and $false$ and standard boolean connectives – we will subsequently use this encoding for values T and F and conditional guards: $\begin{array}{l} b \in {true, false} T ≜ new Bool (true) F ≜ new Bool (false) \land (b_{1}, b_{2}) \approx b_{1} \land b_{2} \\ \lor (b_{1}, b_{2}) \approx b_{1} \lor b_{2} \neg (b) \approx \neg b \end{array}$ These boolean values can represent the results of base object comparison operators such as a string equality test: $\begin{array}{l} eq (s_{1}, s_{2}) \approx b = \{\begin{matrix} true & if s_{1} = s_{2} \\ false & otherwise \end{matrix} \\ String eq (String x) {return (new Bool (eq (this . val, x . val)));} \end{array}$

3.2. In-depth integrity analysis specified logically

In this section, we demonstrate how in-depth integrity taint analysis for FJ can be expressed as a single uniform policy separate from code. To accomplish this we interpret program traces as information represented by a logical fact base in the style of Datalog. We then define a predicate called $Shadow$ that inductively constructs a “shadow” of configurations reflecting the taint of values.

Java-based taint analyses naturally tend to be object-based, i.e. low-integrity values are objects conceptually, and objects have an assigned taint level in the implementation. The types of tainted objects vary depending on the analysis, but most emphasize taint of base values. We will likewise focus on taint of base values, though we will support taint labeling of all objects. This is partly to generalize the representation, but also for formal convenience – In our logical specification of taint analysis, a shadow expression has a syntactic structure that matches the configuration expression, and associates integrity levels (including “high” ∘ and “low” ∙) with particular objects via shape conformance.

Example 3.1.
Suppose a method $m$ of an untainted $C$ object with no fields is invoked on a pair of tainted $s_{1}$ and untainted $s_{2}$ strings: $\begin{matrix} new C () . m (new String (s_{1}), new String (s_{2})) \end{matrix}$ Its proper shadow is: $\begin{matrix} shadow C (\circ) . m (shadow String (∙), shadow String (\circ)) . \end{matrix}$

On the basis of shadow expressions that correctly track integrity, we can logically specify prospective taint analysis as a property of shadowed trace information, and retrospective taint analysis as a function of shadowed trace information. An extended example of a shadowed trace is presented in Section 3.2.4.
3.2.1. Taint tracking as a logical trace property

In order to specify taint tracking, we define the mapping $toFOL (\cdot)$ that shows how we concretely model execution traces in FOL. We develop $toFOL (\cdot)$ that interprets FJ traces as sets of logical facts (a fact base). Intuitively, in the interpretation each configuration is represented by a $Context$ predicate representing the evaluation context, and a predicate representing the redex (e.g. $Call$ ). Each of these predicates has an initial natural number argument denoting a “timestamp” that orders configurations in a trace.

Fig. 4.

Interpreting expressions as formulas via $toFOL (\cdot)$ .

Definition 3.1.

We define $toFOL (\cdot)$ as a mapping on traces and configurations: $\begin{array}{l} toFOL (τ) = ⋃_{σ \in prefix (τ)} toFOL (σ) \end{array}$ such that $toFOL (σ) = ⋃_{i} toFOL (κ_{i}, i)$ for $σ = κ_{1} \dots κ_{k}$ . We define $toFOL (κ, n)$ in Fig. 4.

Integrity identifiers. We introduce an integrity identifier t that denotes the integrity level associated with objects. To support a notion of “partial endorsement” for partially trusted sanitizers, we define three taint labels, to denote high integrity (∘), low integrity (∙), and uncertain integrity (⊙). We refer to these levels as tainted, untainted, and maybe tainted, respectively. $\begin{matrix} t : : = \circ ∣ ⊙ ∣ ∙ \end{matrix}$ We specify an ordering ⩽ on these labels denoting their integrity relation: $\begin{matrix} ∙ ⩽ ⊙ ⩽ \circ \end{matrix}$ For simplicity in this presentation we will assume that all $Sanitizers$ are partially trusted and cannot raise the integrity of a tainted or maybe tainted object beyond maybe tainted. It would be possible to include both trusted and untrusted sanitizers without changing the formalism.

We posit the usual meet ∧ and join ∨ operations on taint lattice elements, and introduce logical predicates $meet$ and $join$ such that $meet (t_{1} \land t_{2}, t_{1}, t_{2})$ and $join (t_{1} \lor t_{2}, t_{1}, t_{2})$ hold.

3.2.2. Shadow traces, taint propagation, and sanitization

Shadow traces reflect taint information of objects as they are passed around programs. Shadow traces are comprised of shadow expressions and contexts which are terms in the logic with the following syntax. Note the structural conformance with closed $e$ and $E$ , but with primitive values replaced with a single dummy value δ that is omitted for brevity in examples, but is necessary to maintain proper arity for field selection. Shadow expressions most importantly assign integrity identifiers t to values in objects – structural conformance is necessary since multiple values can occur in the same structured expression, and labels in shadows need to line up with their corresponding values in expressions. This is illustrated above in Example 3.1, and is discussed at more length in Section 1.3 in an example that we flesh out here in Section 3.2.4. $\begin{array}{l} sv : : = shadow C (t, \overline{sv}) ∣ δ \\ se : : = sv ∣ se . f ∣ se . m (\overline{se}) ∣ shadow C (t, \overline{se}) ∣ C . m (s e) ∣ Op (\overline{se}) ∣ if se then se else se \\ \begin{matrix} SE & : : = [] ∣ SE . f ∣ SE . m (\overline{se}) ∣ sv . m (\overline{sv}, SE, {\overline{se}}^{'}) ∣ shadow C (t, \overline{sv}, SE, {\overline{se}}^{'}) ∣ C . m (SE) ∣ \\ Op (\overline{sv}, SE, \overline{se}) ∣ if SE then se else se \end{matrix} \end{array}$ The shadowing specification requires that shadow expressions evolve in a shape-conformant way with the original configuration. To this end, we define a metatheoretic function for shadow method bodies, $smbody$ , that imposes untainted tags on all method bodies, defined a priori, and removes primitive values.

Definition 3.2.
Shadow method bodies are defined by the function $smbody$ . $\begin{array}{l} {smbody}_{CT} (m, C) = \overline{x} . srewrite (e), \end{array}$ where ${mbody}_{CT} (m, C) = \overline{x} . e$ and the shadow rewriting function, $srewrite$ , is defined as follows, where $srewrite (\overline{e})$ denotes a mapping of $srewrite$ over the vector $\overline{e}$ : $\begin{array}{l} srewrite (x) = x \\ srewrite (new C (\overline{e})) = shadow C (\circ, srewrite (\overline{e})) \\ srewrite (e . f) = srewrite (e) . f \\ srewrite (e . m ({\overline{e}}^{'})) = srewrite (e) . m (srewrite ({\overline{e}}^{'})) \\ srewrite (C . m (e)) = C . m (srewrite (e)) \\ srewrite (Op (\overline{e})) = Op (srewrite (\overline{e})) \\ srewrite (if e_{1} then e_{2} else e_{3}) = if srewrite (e_{1}) then srewrite (e_{2}) else srewrite (e_{3}) \\ srewrite (ν) = δ \end{array}$

The predicate $Match$ defined in Fig. 5 allows deconstruction of a shadow expression $se$ into its constituent shadow context $SE$ and shadow expression ${se}^{'}$ in the hole – that is, if $Match (se, SE, s e^{'})$ then $se = SE [s e^{'}]$ .

Fig. 5.
$Match$ predicate definition.

Fig. 6.
$Shadow$ predicate definition.

Next, in Fig. 6, we define a predicate $Shadow (n, se)$ where $se$ is the relevant shadow expression at execution step n, establishing an ordering for the shadow trace. $Shadow$ has as its precondition a “current” shadow expression, and as its postcondition the shadow expression for the next step of evaluation (with the exception of the rule for shadowing $Op$ s on primitive values which reflects the “immediate” valuation due to the definition of ≈ – note the timestamp is not incremented in the postcondition in that case). We set the shadow of the initial configuration at timestamp 1, and then $Shadow$ inductively shadows the full trace. $Shadow$ is defined by case analysis on the structure of shadow expression in the hole. The shadow expression in the hole and the shadow evaluation context are derived from $Match$ predicate definition.3
³
Some notational liberties are taken in Fig. 6 regarding expression and context substitutions, which are defined using predicates elided for brevity.

With respect to control flow, the most notable rules of $Shadow$ are those governing conditional branching, which ignore the taint of the guard, and method dispatch, which ignore the taint of the object associated with the dispatched method. Since we focus on base value taint, method dispatch is essentially a non-issue, however conditional branching is directly dependent on base values so ignoring the taint of the guard explicitly ignores indirect data flow.

Taint propagation and endorsement. The propagation of taint in the model described in Section 1.2 is embedded in the definition of $Shadow$ , in particular we assume a set of $Sanitizers$ . For elements of $Sanitizers$ , if the input is tainted then the result is considered to be only partially endorsed (maybe tainted). For library methods, taint is propagated given a user-defined predicate $Prop (t, ι)$ where ι is a compound term of the form $C . m (\overline{t})$ with $\overline{t}$ the given integrity of $this$ followed by the integrity of the arguments to method $C . m$ , and t is the integrity of the result. For example, one could define: $\begin{array}{l} meet (t, t_{1}, t_{2}) \Rightarrow Prop (t, String . concat (t_{1}, t_{2})) \\ meet (t, t_{1}, t_{2}) \Rightarrow Prop (t, String . eq (t_{1}, t_{2})) \end{array}$ Later in Section 5.2.1 we will discuss formal semantic conditions on library methods that ensure sound taint propagation.
3.2.3. In-depth integrity taint analysis policies

Now we can logically specify an in-depth policy for integrity taint analysis, as proposed originally in Section 1.2. In particular we assume a set $Sanitizers$ and a set $SSOs$ . Since objects may inherit a sanitizer or SSO from a superclass, we require that $Sanitizers$ and $SSOs$ are closed under inheritance as a sanity condition, as follows: $\begin{array}{l} \frac{CT (C) = class C extends D {\overline{C} \overline{f}; K \overline{M}} m \notin \overline{M} D . m \in SSOs}{C . m \in SSOs} \\ \frac{CT (C) = class C extends D {\overline{C} \overline{f}; K \overline{M}} m \notin \overline{M} D . m \in Sanitizers}{C . m \in Sanitizers} \end{array}$

The in-depth policy has both prospective and retrospective component – the former is defined as a safety property [38], while the latter is defined as a logging specification. The prospective component of the policy must identify traces where a tainted value is passed to a secure method. To this end, in Fig. 7 we define the predicate $BAD$ which identifies traces that should be rejected as unsafe – a bad trace is any in which an SSO is executed with a tainted argument. The retrospective component specifies that data of questionable integrity that is passed to a secure method should be logged. The relevant logging specification is specified in terms of a predicate $MaybeBAD$ also defined in Fig. 7.

Definition 3.3.
Let X be the set of rules in Figs 5, 6, and 7 and the set of user-defined rules for $Prop$ . The prospective integrity taint analysis policy is defined as the set of traces that are either free from or end in $BAD$ configurations.4
⁴
This latter condition is not necessary for the specification, and may seem extraneous, but it is in place to allow a clean proof correspondence with the implementation as detailed in Section 4.2. In the implementation, taint checks occur one execution step after an SSO is called, which in the case of an unsafe call will thus occur one step after the $BAD$ configuration, but before anything bad actually happens, which blocking checks prevent.

$\begin{array}{l} {SP}_{taint} = {τ κ ∣ {(⌊ τ ⌋ \otimes C (X))}^{\Rightarrow {BAD}} = C (\emptyset)} . \end{array}$ The retrospective integrity taint analysis policy is the following logging specification, that uses $spec$ as introduced in Definition 2.13. This logging policy maps traces to the set of maybe tainted objects that enter SSOs. $\begin{array}{l} {LS}_{taint} = spec (X, MaybeBAD) \end{array}$

We immediately observe that ${SP}_{taint}$ is a safety property:
Lemma 3.1.
${SP}_{taint}$ is a safety property.

Finally we define a program as being safe iff it does not produce a bad trace.
Definition 3.4.
We call a program $p (a)$ safe iff for all τ it is the case that $p (a) ⇓ τ$ implies $τ \in {SP}_{taint}$ . We call the program unsafe iff there exists some trace τ such that $p (a) ⇓ τ$ and $τ \notin {SP}_{taint}$ .

Fig. 7.
Predicates for specifying prospective and retrospective properties.
3.2.4. Extended example: Reduction and shadowing

To illustrate the major points of our construction for source program traces and their shadows, we consider an example of program that contains an $sso$ call on a string that has been constructed from a sanitized low integrity input.

Example 3.2.
Assume that sanitizer and SSO methods $Sec . sanitize$ and $Sec . secureMeth$ are identity functions for the sake of brevity, i.e.: $\begin{matrix} {mbody}_{CT} (Sec, sanitize) = x, x {mbody}_{CT} (Sec, secureMeth) = x, x \end{matrix}$ and let ${mbody}_{CT} (main, TopLevel)$ be: $\begin{array}{l} attack, \\ new Sec () . secureMeth (new Sec () . sanitize \\ (attack . concat (new String (” world ”))) \end{array}$ Assume also that an input string ” $hello$ ” is tainted with low integrity – Fig. 8 depicts a source trace given the initial configuration: $\begin{matrix} new TopLevel () . main (new String (” hello ”)) \end{matrix}$ with some reduction steps elided to highlight calls to $Sec . sanitize$ and $Sec . secureMeth$ . In Fig. 9 we show shadows of configurations highlighted (depicted) in the source trace. We note this trace is in ${SP}_{taint}$ and hence is safe.

Fig. 8.
Example 3.2: source trace.

Fig. 9.
Example 3.2: shadow expressions.
4. Correct instrumentation via program rewriting

Now we define an object-based dynamic integrity taint analysis in a more familiar operational style. Taint analysis instrumentation is added automatically by a program rewriting algorithm $Phos$ that models the Phosphor rewriting algorithm, defined in Section 4.1. It adds taint label fields to all objects, and operations for appropriately propagating taint along direct flow paths. In addition to blocking behavior to enforce prospective checks, we incorporate logging instrumentation to support retrospective measures in the presence of partially trusted sanitization. We illustrate computation of instrumented code via an extended example in Section 4.1.4, which continues the (now running) example introduced in Section 3.2.4.

In Section 4.2 we follow the methods developed in Section 2 and show that $Phos$ is semantics preserving, and that instrumented code generates sound and complete audit logs with respect to the logging specification ${LS}_{taint}$ defined in Section 3.2.3. We will also show that instrumented code respects the safety property ${SP}_{taint}$ defined in the latter section.

4.1. In-depth taint analysis instrumentation

The target language ${FJ}_{taint}$ of the rewriting algorithm $Phos$ has the same syntax as FJ except we add taint labels t as a form of primitive value ν, the type of which we posit as $Taint$ . For the semantics of taint values operations we define: $\begin{array}{l} \lor (t_{1}, t_{2}) \approx t_{1} \lor t_{2} \land (t_{1}, t_{2}) \approx t_{1} \land t_{2} \end{array}$ In addition we introduce a “check” operation ? such that $? t \approx t$ iff $t > ∙$ . We also add a convenient sequencing operation of the form $e; e$ to target language expressions, and evaluation contexts of the form $E; e$ .

4.1.1. The $Phos$ algorithm

Now we define the program rewriting algorithm $Phos$ as follows. It incorporates a rewriting function μ that assigns an untainted label to every object in an FJ source program. The class table is manipulated by $Phos$ to specify a $taint$ field for all objects, a $check$ object method that blocks if the argument is tainted, and an $endorse$ method for any object returned by a sanitizer.

Definition 4.1.
For any expression $e$ , the expression $μ (e)$ is syntactically equivalent to $e$ except every subexpression $new C (\overline{e})$ is replaced with $new C (\circ, \overline{e})$ . Given $SSOs$ and $Sanitizers$ , define: $\begin{matrix} Phos (e, CT) = (μ (e), Phos (CT)) \end{matrix}$ where $Phos (CT)$ is the smallest class table satisfying the axioms given in Fig. 10. Furthermore, to correctly mark low integrity input as tainted, given class table $CT$ and top-level program $p (a)$ where $a = new C (\overline{ν})$ we define: $\begin{matrix} Phos (p (a)) = Phos (p, CT) (new C (∙, \overline{ν})) \end{matrix}$

As discussed in Section 1, sanitization is typically taken to be “ideal” for integrity flow analyses, however in practice sanitization is imperfect, which creates an attack vector. To support retrospective measures specified in Definition 3.3, we define $endorse$ so it takes object taint t to the join of t and ⊙. The algorithm also adds a $\log$ method call to the beginning of $SSOs$ , which will log objects that are maybe tainted or worse. The semantics of $\log$ are defined directly in the operational semantics of ${FJ}_{taint}$ below.
4.1.2. Taint propagation of library methods

Another important element of taint analysis is instrumentation of library methods that propagate taint – the propagation must be made explicit to reflect the interference of arguments with results. The approach to this in taint analysis systems is often motivated by efficiency as much as correctness [25]. We assume that library methods are instrumented to propagate taint as intended (i.e. in accordance with the user defined predicate $Prop$ ).

Here is how addition, string concatenation, and equality test, can be modified to propagate taint. Note the taint of arguments will be propagated to results by taking the meet of argument taint, thus reflecting the degree of integrity corruption: $\begin{array}{l} Int plus (Int x) \\ {return (new Int (\land (this . taint, x . taint), + (this . val, x . val)));} \\ String concat (String x) \\ {return (new String (\land (this . taint, x . taint), @ (this . val, x . val)));} \\ String eq (String x) \\ {return (new Bool (\land (this . taint, x . taint), eq (this . val, x . val)));} \end{array}$

Fig. 10.

Axioms for rewriting algorithm.

4.1.3. Operational semantics of

{FJ}_{taint}

The operational semantics of ${FJ}_{taint}$ are defined in Fig. 11. Configurations in FJ are of the form $(e, L)$ where reductions are defined in terms of a labeled transition relation $\overset{α}{\to}$ on configurations, where α is a possibly empty sequence ϵ of security events. These events are either integrity events $iev (v)$ emitted when a check succeeds during evaluation as defined in the $CheckPassed$ rule, or endorsement events $eev (v)$ , emitted when a value is endorsed as defined in the $Endorsed$ rule. Labels are needed for our formulation of explicit integrity modulo endorsement, discussion in Section 5.

Audit logs $L$ are added to configurations to support the retrospective security via audit logging, and are defined as sets of objects (values). The $\log$ method is the only one that interacts with the log in any way, and its semantics are specified in the $Log$ and $NoLog$ rules, where possibly tainted values are logged, and untainted ones are not. Note that we strip taint tags from values for logging – this is mainly to simplify the correspondence with ${LS}_{taint}$ semantics for our technical development (where taint tags don’t exist). We otherwise “inherit” the reduction semantics of FJ via the $Reduce$ rule.

Fig. 11.

Operational semantics of ${FJ}_{taint}$ .

We write $κ_{0} \to_{n}^{α_{0} \dots α_{n - 1}} κ_{n}$ iff $κ_{i} \overset{α_{i}}{\to} κ_{i + 1}$ for all $0 ⩽ i < n$ , and write $κ \to_{*}^{α} κ^{'}$ iff $κ \to_{n}^{α} κ^{'}$ for some n. We may omit transition labels in cases where they are empty (ϵ) or not relevant to discussion, abusing notation →, $\to^{n}$ , and $\to^{*}$ as defined for FJ. We define traces as for FJ, and we write $e ⇓ τ$ iff τ begins with the configuration $(e, \emptyset)$ .

4.1.4. Extended example: Target trace

Fig. 12.

Example 3.2: target trace.

Revisiting the example introduced in Section 3.2.4, we show execution of the rewritten program $Phos (p (a))$ in Fig. 12. By definition the rewritten top-level program is: $\begin{matrix} new TopLevel (\circ) . main (new String (∙, ” hello ”)) \end{matrix}$ We note that additional reduction steps are necessary to evaluate instrumentation code in the target program, and that $eev$ and $iev$ events mark points during reduction when a value is endorsed and when it is checked.

4.2. Operational properties of

Phos

Now we can leverage machinery developed previously to demonstrate in-depth operational correctness of $Phos$ , i.e. both prospective and retrospective operational correctness.

Recalling our definitions of semantics preservation, soundness, and completeness from Section 2, we state our main results as follows. These results tie together our relevant logging specification ${LS}_{taint}$ and safety property ${SP}_{taint}$ defined in Section 3.2.3. We note that in this section we will ignore transition labels α in target language reduction since they are irrelevant to the properties of interest, and will use → exclusively to refer to the reduction relation in ${FJ}_{taint}$ defined in Section 4 .

Regarding our main result for retrospective security, we note that our definition is general with respect to $SSOs$ and $Sanitizers$ defined at the top-level, which fix ${LS}_{taint}$ and ${SP}_{taint}$ . Soundness and completeness as defined in Section 2.5 require definition of the notation $τ ⇝ L$ , which for ${FJ}_{taint}$ means that $L$ is the log in the last configuration of τ. We define $toFOL (L) = {MaybeBAD (v) ∣ v \in L}$ , and thus $⌊ L ⌋ = C (toFOL (L))$ . Also as required, we need to define the relation ≅, establishing a semantic correspondence between FJ and ${FJ}_{taint}$ traces. Intuitively, the relation holds on source, target trace pairs if the taint shadow of configurations in the source trace match up with the structure of configurations in the target trace modulo security instrumentation. This definition along with the detailed proofs of $R$ ’s semantics preservation, soundness, and completeness (Theorem 4.1), and prospective correctness (Theorem 4.2) are given in the accompanying Technical Report [43].

Theorem 4.1.
For all $p (a)$ , $SSOs$ , and $Sanitizers$ , let $R (p (a), {LS}_{taint}) = Phos (p (a))$ . Then $R$ is semantics preserving, sound, and complete up to ${SP}_{taint}$ .

Since the safety property ${SP}_{taint}$ has been defined for FJ, operational correctness for prospective security means that any rewritten unsafe programs are blocked by instrumentation. We can formalize this property as follows, noting it is a consequence of semantics preservation under ≅. This occurs because bad shadows in source code correspond to values that fail security checks in the target.
Definition 4.2.
An ${FJ}_{taint}$ program $e$ causes a security failure iff $\begin{matrix} (e, \emptyset) \to^{} (E [v . check (new C (∙, \overline{v}))], L) \end{matrix}$ for some $E$ , $v$ , $new C (∙, \overline{v})$ , and $L$ .

Operational correctness of the prospective component of $Phos$ can then be stated as follows: Theorem 4.2.
The FJ program* $p (a)$ is unsafe iff $Phos (p (a))$ causes a security failure.

5. The security property of $Phos$

The semantics of information flow has been well studied and is typically characterized via noninterference properties, but surprisingly little work has been done to develop similar properties for taint analysis. In recent years it has been shown that direct flow of data confidentiality is not comparable with noninterference [39], i.e., there are both noninterfering programs with direct leakage of secret data to public domain, and programs without such direct leakages, but interfering. For instance consider the following two statements in a core imperative language, in which s and p are respectively secret and public variables: $\begin{array}{l} if s = 0 then p : = s else p : = 0 if s = 0 then p : = 1 else p : = 2 \end{array}$ The first statement is noninterfering, but direct flow of information from s to p exists, whereas the second statement is interfering due to the indirect flow from s to p, but there are no direct flows from s to p.

Formal definitions of taint analysis implementations do exist, but they are usually operational in nature. For example, in Section 4.2, we have established an operational correctness result for the prospective enforcement of direct integrity flow. In this section, we propose a hyperproperty to characterize the security property enforced by integrity taint analysis techniques. This hyperproperty is defined in a general, language-agnostic way, though in this section we also show that the instrumentation of ${FJ}_{taint}$ programs by $Phos$ enjoys this property as a correctness condition. We illustrate key points in Section 5.3.

5.1. Direct integrity flow semantics: Explicit integrity

We define explicit integrity as a semantic hyperproperty that builds on (dualizes) the notions of explicit secrecy [39] and attacker power [5]. Similar to explicit secrecy, explicit integrity is language-agnostic. In later sections, we discuss instantiation of this model for ${FJ}_{taint}$ .

Intuitively, a program enjoys explicit secrecy if execution of its state transformation components does not affect the knowledge of a low confidentiality user. By formally specifying state transformation components, control flow operations (such as conditional expressions) can be omitted to only consider direct aka explicit program flows. Knowledge [6] is defined as the set of initial states configurable by a low confidentiality user that generate a particular sequence of observables – the smaller the set, the greater the knowledge. Explicit knowledge [39] restricts this concept to direct program data flow. In this section, we demonstrate how explicit knowledge can be “dualized” for direct integrity flow analysis and applied as a semantic framework for dynamic integrity taint analysis tools, particularly in functional languages with hierarchical data structures ( ${FJ}_{taint}$ ).

Attacker power [5] is introduced as a counterpart to attacker knowledge in the context of integrity, as the set of low integrity inputs that generate the same sequence of high integrity events. Each high integrity event could be a simple assignment to a predefined high integrity variable, a method that manipulates trusted data (secure sinks), etc. according to the language model. The more refined the attacker power is, the more powerful the low integrity attacker becomes, as she becomes more capable to distinguish between the effects of different attacks on high integrity data.

We define explicit attacker power as the attacker power constrained on direct integrity flows. Then, explicit integrity is defined as the property of preserving explicit attacker power during program execution. In order to limit flows to direct ones, we have followed the techniques introduced in [39] to define state transformers. State transformers extract direct flows semantically by specifying the ways in which program state is modified in each step of execution, along with direct-flow events that are generated.

5.1.1. Model specification

We formulate our explicit integrity semantics following [39]. We first define the interface for our framework. Let K be the set of program configurations for a given object language where κ ranges over configurations. Configurations consist of control and state segments. Control refers to code and state refers to data. Let C be the set of controls with c ranging over the elements of C. Moreover, let S denote the set of states and s represent a given state. We also define a set of high integrity events, E. A high integrity event e may refer to different computations in different language models and settings. For example, it could be as simple as assigning low integrity data to a high integrity variable, or invoking a method with low integrity data as its parameter to store that parameter in a database. We let α range over elements of $E^{*}$ . We assume the existence of the evaluation relation $\to \subseteq K \times E^{*} \times K$ where $(κ, α, κ^{'}) \in \to$ is denoted as $κ \overset{α}{\to} κ^{'}$ . We use $κ \to κ^{'}$ if α is empty (ϵ) or could be elided in the discussion. Notation $\to^{*}$ is used for reflexive and transitive closure of →.

Each configuration is considered to include two segments: control (code) and state (data). These segments are not necessarily disjoint and could overlap in some language models. In this regard, let mappings $state : K \to S$ and $com : K \to C$ extract the state and control segments of configurations, and $⟨ \cdot, \cdot ⟩ : C \times S \to K$ construct a configuration from its control and state segments. These mappings need to satisfy the following property, for any κ: $\begin{matrix} ⟨ com (κ), state (κ) ⟩ = κ . \end{matrix}$

We assume the existence of an entry point $[\cdot]$ in the controls denoted by $c [\cdot]$ by which the attacker can inject low integrity input. The attacker input is denoted by a. Then $c [a]$ represents a control in which the attacker has injected input a. Note that an attack a is a data piece itself, i.e., a is a value.

We define extracted state transformers as follows. A consideration of state transformation, rather than complete program execution, allows us to focus only on direct program flow, rather than indirect control flow e.g. via conditionals. State transformers play the same role that explicit flow statements do in Weak Secrecy [47]. We note that this definition is a slight refinement of the analogous definition in [39] – in their work, a command is assumed to be compatible with all states, whereas we require compatibility of commands and states. This refinement is necessary due to structured expressions in HLLs such as Java, vs. lower level languages. However, we add a completeness condition expressed in Definition 5.3 that ensures we can compare all trust equivalent states via state transformation functions.

Definition 5.1.
Let $κ \to κ^{'}$ and $com (κ) = c$ for some c. $f : S \to S \times E^{}$ is the function where $f (s) = (state (κ^{″}), α)$ for all s such that $⟨ c, s ⟩$ is defined and for the unique $κ^{″}$ and α such that $⟨ c, s ⟩ \overset{α}{\to} κ^{″}$ . We write $κ \to_{f} κ^{'}$ to associate the state transformer f with the reduction $κ \to κ^{'}$ . This definition is then extended to multiple evaluation steps by composing state transformers at each step. Let $f (s) = (s^{'}, α)$ and $g (s^{'}) = (s^{″}, α^{'})$ . Then, $(g f) (s) = (s^{″}, α α^{'})$ .

We now define the power an attacker obtains by observing high integrity events. We capture this by defining a set of high integrity equivalent states that generate the same sequence of high integrity events. We posit the binary relation $=_{\circ}$ on S to denote high integrity equivalent (or trust equivalent) states. The general sense of this relation is that $s =_{\circ} s^{'}$ if s and $s^{'}$ agree on high integrity data. The instantiation of the relation depends on the language model in which the states are defined. For a state s and some state transformer f, the state $s^{'}$ is considered as an element of the explicit attacker power if $s =_{\circ} s^{'}$ and $s^{'}$ agrees with s on the generated high integrity events.
Definition 5.2.
We define explicit attacker power with respect to state s and state transformer f as follows, where projection on the ith element of a tuple is denoted by $π_{i}$ . $\begin{array}{l} p_{e} (s, f) = {s^{'} ∣ s =_{\circ} s^{'}, π_{2} (f (s)) = π_{2} (f (s^{'}))} . \end{array}$

All state transformers must be complete in the following sense for this definition to be coherent:
Definition 5.3.
A state transformer f is complete iff for all $s_{1}$ , $s_{2}$ where $s_{1} =_{\circ} s_{2}$ we have $f (s_{1})$ is defined iff $f (s_{2})$ is defined.

A control then satisfies explicit integrity for some state iff no state can be excluded from observing the high integrity events generated by the extracted state transformer.
Definition 5.4.
A control c satisfies explicit integrity for state s, iff $⟨ c, s ⟩ \to_{f}^{} κ^{'}$ implies that for any $s^{'}$ and $s^{″}$ , if $s^{'} =_{\circ} s^{″}$ then we have $s^{″} \in p_{e} (s^{'}, f)$ . A control c satisfies explicit integrity* iff for any s, c satisfies explicit integrity for s.

We can now consider explicit integrity in the presence of endorsement in the style of gradual release [6]. We assume that there exists a set of integrity events $E_{e n} \subseteq E$ that are generated when endorsements occur. Explicit attacker power is only allowed to change for such events. Definition 5.5.
A control c satisfies explicit integrity modulo endorsement for state s iff $⟨ c, s ⟩ \to_{f}^{} κ^{'} {\overset{α}{\to}}_{g}^{} κ^{″}$ and $α \notin {E_{e n}}^{}$ imply that $p_{e} (s, f) = p_{e} (s, g f)$ .

5.2. An instantiation with ${FJ}_{taint}$

In this section, we instantiate explicit integrity for ${FJ}_{taint}$ . Because audit logging and retrospective features are irrelevant to the technical development in this section, we omit them and elide ${FJ}_{taint}$ configurations to just expressions $e$ , and take $\log$ to be the identity function.

First, we define the required interface specified in Section 5.1, beginning with the definition of extracted state transformers for all features. These are extracted from the definition of → – notably, the extracted state transformers for conditional expressions inline conditional branching, disregarding the actual T or F value of the guard, and eliminating the effects of indirect flow from state transformation functions.

Definition 5.6.
The state transformers for ${FJ}_{taint}$ are composed of commands of the form ${select}_{f}$ for all fields f (selection), ${call}_{C . m}$ for all class, method pairs $C . m$ (method dispatch), $return$ (method return), $endorse$ (endorsement), $check$ (successful taint check within an SSO), $sequence$ (sequencing), and ${if}_{T}$ and ${if}_{F}$ (branch inlining). The behavior of these fundamental extracted state transformers are defined in Fig. 13.

Our treatment of library methods, $check$ , and $endorse$ bear discussion since they consider these in an atomic, “big step” manner. As noted in Section 3.2.2, when taint is propagated by library methods, for efficiency or implementation convenience it may be the case that taint propagation is not correctly applied until computed results are returned. This includes $check$ and $endorse$ , since technically these are library methods as per the definition in Section 3.1.6. Thus we specify that the extracted state transformer of any library method treat it atomically with respect to internal computations. In addition to $check$ and $endorse$ , for library methods where no security related events can occur we define a class of state transformers ${call}_{C . m}$ for $C . m \in LibMeths$ . This definition will also significantly simplify our proofs, and is irrelevant from a formal perspective since this definition yields the same observable events that a strict “small-step” definition of state transformers would for a given top-level program in the image of $Phos$ .

Fig. 13.
Fundamental state transformers extracted from $\overset{α}{\to}$ .

Fig. 14.
Definition of $com$ for ${FJ}_{taint}$ .

Next, we define $com$ , $state$ , and $⟨ \cdot, \cdot ⟩$ for ${FJ}_{taint}$ . The command associated with a particular configuration $e$ can be determined from its redex. We take the state of a configuration $e$ to just be $e$ itself, and combining a command and a state to obtain a configuration requires that the given command matches the form of the redex in the state – i.e. compatibility of the command and the state.
Definition 5.7.
We define $state (e) = e$ and define $com$ as in Fig. 14. We define $⟨ \cdot, \cdot ⟩$ as in Fig. 15.

These definitions clearly satisfy the model requirements.
Lemma 5.1.
For any ${FJ}_{taint}$ configuration κ, we have $⟨ com (κ), state (κ) ⟩ = κ$ .

Trust equivalence and state transformation. Now we define trust equivalence $=_{\circ}$ on ${FJ}_{taint}$ expressions as required. This definition requires structural conformance of related states (expressions), and requires agreement of base values except in the case of tainted base objects. Aside from satisfying the model definition, the definition of trust equivalence will be crucial in our proof of explicit integrity modulo endorsement, as it defines the necessary inductive invariant on extracted state transformations for this result.

Also, since endorsement may allow trust equivalent states to transform into non-structural equivalence, to satisfy the completeness requirement of Definition 5.3 we need to show that transformation preserves a weaker structural conformance relation $=_{}$ on states (expressions). These relations are very similar with $=_{}$ strictly weaker than $=_{\circ}$ , and in proofs we will generally consider them together. Hence we define the metavariable $=_{⊛}$ to range over $=_{}$ and $=_{\circ}$ .
Definition 5.8.
The trust equivalence* $=_{\circ}$ and shape conformance $=_{}$ relations on expressions are defined as the least relations inductively satisfying the rules in Fig. 16, where $=_{⊛}$ is a metavariable that ranges over $=_{\circ}$ and $=_{}$ .

Fig. 15.
Definition of $⟨ \cdot, \cdot ⟩$ for ${FJ}_{taint}$ .

Fig. 16.
Definition of trust equivalence and shape conformance relations on expressions.
5.2.1. Sanity conditions on library methods

We define two sanity conditions for library methods: not undertainting and not overtainting. The former condition is required in the implementation in order to meet explicit integrity modulo endorsement, whereas the latter is a good practice in the implementation of taint analysis tools. Hereafter we will assume that library methods are not undertainting.

Definition 5.9.
We say $C . m \in LibMeths$ is not undertainting iff for all ${\overline{v}}_{1}$ , ${\overline{u}}_{1}$ , ${\overline{v}}_{2}$ , ${\overline{u}}_{2}$ where: $\begin{array}{l} {\overline{v}}_{1}, {\overline{u}}_{1} =_{\circ} {\overline{v}}_{2}, {\overline{u}}_{2} {call}_{C . m} (new C ({\overline{v}}_{1}) . m ({\overline{u}}_{1})) = (v_{1}, ϵ) \\ {call}_{C . m} (new C ({\overline{v}}_{2}) . m ({\overline{u}}_{2})) = (v_{2}, ϵ) \end{array}$ we have $v_{1} =_{\circ} v_{2}$ .

For example, $String . concat$ is not undertainting if the taint propagation policy is defined as in Section 3.2.2 where the taint of a concatenated string is the meet of its operands’ taints, but it would be e.g. if its results were always untainted.

Not overtainting refines the precision of taint tracking with respect to a given state. Intuitively, a library method that only directly depends on its high integrity inputs is not overtainting if its results are untainted. Definition 5.10.
We say $C . m \in LibMeths$ is not overtainting with respect to input ${\overline{v}}_{1}$ , ${\overline{u}}_{1}$ iff for all ${\overline{v}}_{2}$ , ${\overline{u}}_{2}$ where: $\begin{array}{l} {\overline{v}}_{1}, {\overline{u}}_{1} =_{\circ} {\overline{v}}_{2}, {\overline{u}}_{2} {call}_{C . m} (new C ({\overline{v}}_{1}) . m ({\overline{u}}_{1})) = (v_{1}, ϵ) \\ {call}_{C . m} (new C ({\overline{v}}_{2}) . m ({\overline{u}}_{2})) = (v_{2}, ϵ) \end{array}$ if $v_{1} = v_{2}$ then $v_{1} = new D (\circ, \overline{v})$ for some $\overline{v}$ .

5.3. Extended example

Assume given a class table $CT$ containing sanitizer and SSO methods $Sec . sanitize$ and $Sec . secureMeth$ , which are identity functions for the sake of brevity, i.e.: $\begin{array}{l} {mbody}_{CT} (Sec, sanitize) = x, x {mbody}_{CT} (Sec, secureMeth) = x, x \end{array}$ and let ${mbody}_{CT} (main, TopLevel) = attack, e$ where $e$ is: $\begin{array}{l} if attack . eq (new String (” foo ”)) then \\ new Sec () . secureMeth (attack) \\ else \\ new Sec () . secureMeth (new Sec () . sanitize (attack)) \end{array}$ Note that this is an example of a program that is unsafe by our definition, since a tainted value can flow directly into an SSO, though it is noninterfering modulo endorsement since that value can only be $new String (” foo ”)$ (similar to the example at the beginning of this section). However, $Phos$ will place a check that will ensure blocking of unsafe executions. We note that the execution of $Phos (p (new String (” foo ”)))$ up to the point it gets stuck within $Sec . check$ is associated with the following state transformer f: $\begin{matrix} f = sequence * return * {call}_{Sec . \log} * {call}_{Sec . secureMeth} * {if}_{T} * {call}_{String . eq} \end{matrix}$ Observe that $π_{2} (f (Phos (p (a)))) = ϵ$ for any a, trivially satisfying the requirements of explicit integrity modulo endorsement. Crucially, note that a need not be the string ” $foo$ ” in order for $f (Phos (p (a)))$ to be defined – even though the program $Phos (p (a))$ would not take the T branch through the conditional during actual execution, it is “forced” that way by f. This is central to the definition of explicit attacker power with respect to $Phos (p (new String (” foo ”))$ and f.

In contrast, the state transformer associated with the actual execution of $Phos (p (new String (s)))$ for $s \neq ” foo ”$ up to the point it gets endorsed by $String . endorse$ within $Sec . sanitize$ is: $\begin{matrix} g = endorse * {call}_{Sec . sanitize} * {if}_{F} * {call}_{String . eq} \end{matrix}$ We note that: $\begin{matrix} π_{2} (g (Phos (p (new String (s^{'}))))) = eev (new String (s^{'})) \end{matrix}$ for all $s^{'}$ . Furthermore, continued execution of $Phos (p (new String (s)))$ is associated with the following function h which takes the program through the successful check of the sanitized object: $\begin{matrix} h = check * sequence * return * {call}_{Sec . \log} * {call}_{Sec . secureMeth} \end{matrix}$ We note that: $\begin{matrix} π_{2} (h * g (Phos (p (new String (s^{'}))))) = eev (new String (s^{'})), iev (new String (s^{'})) \end{matrix}$ for all $s^{'}$ . Finally, we observe that: $\begin{matrix} p_{e} (Phos (p (a)), g) = p_{e} (Phos (p (a)), h * g) = {Phos (p (a))} \end{matrix}$ for all a, satisfying the requirements of explicit integrity modulo endorsement.

Since f and $h * g$ represent all possible control flow paths through the program that can generate events, it is evident that $Phos (p (a))$ satisfies explicit integrity modulo endorsement for all a.

5.4. Enforcement of explicit integrity modulo endorsement by $Phos$

Our general strategy is to show that non-endorsement events do not change attacker power as required by the definition of explicit integrity modulo endorsement. The complete proof details are given in our Technical Report [43].

Theorem 5.1.
If $e$ is in the image of $Phos$ , then it enjoys explicit integrity modulo endorsement.
Proof.
(Sketch) Proof by contradiction. If it does not enjoy explicit integrity modulo endorsement, then explicit attacker power is refined, i.e., different integrity events can be generated starting from trust equivalent states. This contradicts with an intermediary result reflecting on the preservation of integrity events by state transformers being applied on trust equivalent states. □

6. An implementation of $Phos$ in OpenMRS

In Section 1.1 we discussed an XSS vulnerability in the OpenMRS system (corrected in the current version) that inspired our interest in an in-depth taint analysis to better track data flow into secure operations and to enforce some level of sanitization. To explore and evaluate our proposed methods in practice, we have developed an automated analysis for OpenMRS by direct modification of the Phosphor system [11]. Our modification supports dynamic integrity taint analysis both prospectively and retrospectively. Our implementation is based on the formal model developed in previous sections, which enjoys a correctness guarantee. In this section we describe our implementation and our evaluation of it.

6.1. Modifications to phosphor

Out of the box, Phosphor provides a binary taint labeling scheme, with no support for endorsement. Users specify their security policy by identifying high integrity sinks, which are then automatically instrumented at the bytecode level with checks for low integrity inputs, by a combination of program rewriting and runtime mechanisms. Thus, to implement our in-depth taint analysis specification we needed to generalize the taint labeling scheme, add an endorsement mechanism, and add support for audit logging to the existing Phosphor codebase. This yielded our $Phos$ implementation, as distinct from Phosphor.

Phosphor distinguishes only between two types of data – tainted and untainted. To support a generalized labeling scheme, in $Phos$ we added to the Phosphor $Taint$ class definition a field containing a $TaintLevel$ enumeration. This latter type is endowed with a partial ordering that is specified by the programmer via an underlying graph definition, and join and meet operations. In our implementation we support the taint label lattice defined in Section 3.2 but this could be easily changed to accommodate others. We also define an $endorse$ operation that takes the join of the input taint label and $MaybeTainted$ as in this paper. Since Phosphor itself adds a $Taint$ object to all program objects, these modifications are propagated through the system by the existing codebase.

As for ordinary Phosphor, in $Phos$ we allow specification of secure sinks, however the rewriting algorithm adds instrumentation for audit logging of values at or below a specified taint level that reach any sink ( $MaybeTainted$ in our case). The following information is logged in such a case: the function name of any sink that had a tainted variable pass through it, the taint level of any sunken tainted variable, the value of sunken variables, and a stack trace of the thread when a tainted variable was sunk. Much of this information was already being collected in the unmodified Phosphor.

We also allow specification of a set of sanitizers in the same manner as secure sink specification, i.e. specific methods are identified in an initial configuration file provided when rewriting a program. These methods will have return values $endorse$ d via insertion of that method. Thus the end product functions the same as the system specified in Section 4 – an input set of $SSOs$ and $Sanitizers$ are provided, along with a program for instrumentation, and the program is rewritten with instrumentation to support ${SP}_{taint}$ and ${LS}_{taint}$ . Our $Phos$ implementation also supports a specification of low integrity sources at arbitrary taint levels. The implementation is available on a public GitHub repository [44].

Fig. 17.

$Phos$ instrumentation timing overhead for OpenMRS for actions (left) and page loads (right). Numbers on the x-axis identify particular actions and page loads, and the y-axis is completion time in seconds.

6.2. OpenMRS sources, sinks, and sanitizers

To apply $Phos$ to OpenMRS, it is necessary to identify sources, sinks, and sanitizers in the system. Since our concern is mainly defense against injection and XSS attacks, we focused on database interactions. OpenMRS in its current form uses the popular Hibernate ORM framework as a database API, which supports two ways of interacting with databases – via persistent relationally mapped object saving/loading, and via queries. We limited the scope of our work to focus on queries based on data in memory rather than persistent data, since the latter would require persistence of taint information and hence a far more complex implementation task.

The lists of sources, sinks, and sanitizers we identified in OpenMRS are provided in our implementation on GitHub [44]. Our method for identifying sinks and sanitizers was to leverage our knowledge of the Hibernate API. Specifically, to identify sinks, we searched the OpenMRS codebase for methods that employ Hibernate database write functionality. To identify sanitizers, we searched the OpenMRS codebase for methods that employ Hibernate sanitization functionality. The list of sources was determined by searching for methods that use $javax . servlet$ functionality for recovering data from $POST$ requests.

Another subtle but important detail of our integration of $Phos$ with OpenMRS is that in OpenMRS, the arguments for the sinks are not necessarily tainted themselves, but rather are objects containing tainted member variables. Therefore, we also modified Phosphor to not only check sink arguments for a taint, but also argument member variables.

6.3. Implementation evaluation

To evaluate our instrumentation of OpenMRS with $Phos$ , we developed an automated testing method to evaluate correctness of the implementation, as well as timing and memory overhead. However, to understand our evaluation, certain details need to be explained.

Phosphor initialization overhead. When instrumenting Java classes with Phosphor, one can either run the software manually, specifying all of the source files to the program, or Phosphor can automatically detect and instrument uninstrumented code as the program runs. The latter option was chosen for this project due to the nature of OpenMRS and the onerous overhead of manual instrumentation.

As a consequence of instrumenting Java classes dynamically, an instrumentation overhead occurs as new uninstrumented source code is discovered. Thus, an initial run through a Phosphor-modified OpenMRS would be slower than consecutive runs, which was indeed observed by testing – the initial run of a particular method typically took about twice as long as subsequent runs.

Fig. 18.

Phosphor instrumentation memory overhead. The x-axis denotes the fraction completion of a test run, and the y-axis denotes memory used in MB.

Since our main concern in evaluation was to compare the overhead of instrumented vs. unmodified OpenMRS, and initialization overhead is arguably amortized to insignificance over long use sessions, the results we report here are only for pre-initialized testing runs. However we do note this additional overhead with the use of Phosphor’s dynamic instrumentation feature.

Actions and page loads. The OpenMRS system has a web-based user interface, so we can partition its functionality into two major categories. The first is called action functionality, which results from submitting a form. Since form submission introduces tainted data that is potentially sanitized and destined for a secure sink, we can expect actions to incur overhead in instrumented OpenMRS and so are clearly important to consider.

OpenMRS also offers the potential for page loads, where users navigate between pages in the system by clicking links (but not submitting data). Since the underlying code is also instrumented in these situations, and continues to track taint, some overhead can also be observed. Thus we also evaluated overhead associated with page loads.

6.3.1. Experiments and results

To evaluate $Phos$ -instrumented OpenMRS, we developed a script that iterated over 42 actions, and over 121 page loads, recording timing and memory use, that we call a test run. We did a test run over unmodified OpenMRS to establish a baseline, and also did a test run over OpenMRS instrumented with our implementation of $Phos$ . Finally, to evaluate how much our modifications impact Phosphor overhead, we did an actions-only test run over OpenMRS instrumented with pre-initialized unmodified Phosphor.

An initial concern of our evaluation was determining whether the system worked correctly, and whether data reaching sinks was maybe tainted, indicating sanitization, as well as being logged properly. We confirmed this, and did not discover instances of unsanitized data reaching sinks. Subsequently, we considered timing and memory consumption.

Timing. Our timing results are summarized in Table 1. Here we show the average time to complete each action and page load for the unmodified OpenMRS baseline, as well as OpenMRS instrumented with our implementation of $Phos$ , and OpenMRS instrumented with pre-initialized unmodified Phosphor. These results demonstrate that instrumentation imposes a bit less than $3 \times$ overhead, while average times for completion are not onerous. Furthermore, our comparison of $Phos$ and Phosphor shows that our modifications to did not add significant overhead to the taint analysis.

Table 1
Average timing and overhead for unmodified OpenMRS (baseline), versus instrumented with Phosphor and with $Phos$

Actions Loads

Avg (secs) Overhead Avg (secs) Overhead

OpenMRS Baseline .236 – .567 –

$OpenMRS + Phosphor$ .614 261% – –

$OpenMRS + Phos$ .670 284% .636 112%

	Actions	Loads
OpenMRS Baseline	.236	–	.567	–
$OpenMRS + Phosphor$	.614	261%	–	–
$OpenMRS + Phos$	.670	284%	.636	112%

Figure 17 shows more detailed results, comparing times for the OpenMRS baseline and the $Phos$ -instrumented version for each action (left graph) and page load (right graph). These results show that timing overhead is fairly consistent, albeit with some significant anomalies. In particular, overhead for the instrumented version spiked on the $dataExport.list$ action, which is action number 23 in the graph. It is unclear what caused this anomaly, but appears to be an artifact of the Phosphor implementation (not our modifications).

Memory. Figure 18 shows baseline memory consumption during test runs of unmodified OpenMRS, versus pre-initialized OpenMRS instrumented with $Phos$ . As these results demonstrate, while the instrumentation with $Phos$ does impose memory overhead, the impact on performance is not practically significant.

7. Related work and conclusion

Some of the results in this paper were discussed in a preliminary manuscript [3], but the current work provides a fully developed metatheory, a formulation of the high-level security policy enforced by our system (explicit integrity modulo endorsement), and a complete implementation and empirical evaluation.

Taint analysis is an established solution to enforce confidentiality and integrity policies through direct data flow control. Various systems have been proposed for both low and high level languages. Our policy language and semantics are based on a well-developed formal foundation, where we interpret Horn clause logic as an instance of information algebra [28] in order to specify and interpret retrospective policies. The work presented in this paper supersedes a previous presentation [3] – in the current paper we extend our language model, provide more rigorous proofs of correctness of policy enforcement, consider the hyperproperty of taint analysis in a model of Java, and report on a prototype implementation.

Schwartz et al. [41] define a general model for runtime enforcement of policies using taint tracking for an intermediate language. In Livshits et al. [29], taint analysis is expressed as part of operational semantics, similar to Schwartz et al., and a taxonomy of taint tracking is defined. Livshits et al. [31] propose a solution for a range of vulnerabilities regarding Java-based web applications, including SQL injections, XSS attacks and parameter tampering, and formalize taint propagation including sanitization. The work uses PQL [32] to specify vulnerabilities. However, these works are focused on operational definitions of taint analysis for imperative languages. In contrast we have developed a logical specification of taint analysis for a functional OO language model that is separate from code, and is used to establish correctness of an implementation. Our work also comprises a unique retrospective component to protect against incomplete input sanitization. According to earlier studies [31,48], incomplete input sanitization makes a variety of applications susceptible to injection attacks.

Another related line of work is focused on the optimization of integrity taint tracking deployment in web-based applications. Sekar [42] proposes a taint tracking mechanism to mitigate injection attacks in web applications. The work focuses on input/output behavior of the application, and proposes a lower-overhead, language-independent and non-intrusive technique that can be deployed to track taint information for web applications by blackbox taint analysis with syntax-aware policies. In our work, however, we propose a deep instrumentation technique to enforce taint propagation in a layered in-depth fashion. Wei et al. [49] attempt to lower the memory overhead of TaintDroid taint tracker [21] for Android applications. The granularity of taint tracking places a significant role in the memory overhead. To this end, TaintDroid trades taint precision for better overhead, e.g., by having a single taint label for an array of elements. Our work reflects a more straightforward object-level taint approach in line with existing Java approaches.

Saxena et al. [37] employ static techniques to optimize dynamic taint tracking done by binary instrumentation, through the analysis of registers and stack frames. They observe that it is common for multiple local memory locations and registers to have the same taint value. A single taint tag is used for all such locations. A shadow stack is employed to retain the taint of objects in the stack. Cheng et al. [17] also study the solutions for taint tracking overhead for binary instrumentation. They propose a byte to byte mapping between the main and shadow memory that keeps taint information. Bosman et al. [15] propose a new emulator architecture for the x86 architecture from scratch with the sole purpose of minimizing the instructions needed to propagate taint. Similar to Cheng et al. [17], they use shadow memory to keep taint information, with a fixed offset from user memory space. Zhu et al. [50] track taint for confidentiality and privacy purposes. In case a sensitive input is leaked, the event is either logged, prohibited or replaced by some random value. We have modeled a similar technique for an OO language, through high level logical specification of shadow objects, so that each step of computation is simulated for the corresponding shadow expressions.

Particularly for Java, Chin et al. [18] propose taint tracking of Java web applications in order to prohibit injection attacks. To this end, they focus on strings as user inputs, and analyze the taint in character level. For each string, a separate taint tag is associated with each character of the string, indicating whether that character was derived from untrusted input. The instrumentation is only done on the string-related library classes to record taint information, and methods are modified in order to propagate taint information. Haldar et al. [25] propose an object-level tainting mechanism for Java strings. They study the same classes as the ones in Chin et al. [18], and instrument all methods in these classes that have some string parameters and return a string. Then, the returned value of an instrumented method is tainted if at least one of the argument strings is tainted. However, in contrast to our work, only strings are endowed with integrity information, whereas all values are assigned integrity labels in our approach. Recently Bodei et al. [14] have proposed a static enforcement mechanism for taint analysis in IoT devices which predicts the propagation of taint in the system according to the flow of control. These previous works lack retrospective features.

Recent work has also considered static analysis for ensuring proper context-based sanitization of user input data to defend against XSS attacks, in the JSPChecker system [45]. While this work refines what is meant by “correct” sanitization, it relies on static analysis and thus introduces false positives. In contrast, we propose a runtime tool that marks data generated by imperfect sanitizers for postfacto analysis. Our work is more general in the sense that it can be used for any category of integrity data flow vulnerabilities including XSS.

Phosphor [11 ,12] is an attempt to apply taint tracking more generally in Java, to any primitive type and object class. Phosphor instruments the application and libraries at bytecode level based on a given list of taint source and sink methods. Input sanitizers with endorsement are not directly supported, however. As Phosphor avoids any modifications to the JVM, the instrumented code is still portable. Our work is an attempt to formalize Phosphor in FJ extended with input sanitization and in-depth enforcement. Our larger goal is to develop an implementation of in-depth dynamic integrity analysis for Java by leveraging the existing Phosphor system.

Secure information flow [20] and its interpretation as the well-known hyperproperty [19] of noninterference [24] is challenging to implement in practical settings [36] due to implicit flows. Taint analysis is thus an established solution to enforce confidentiality and integrity policies since it tracks only direct data flow control. Various systems have been proposed for both low and high level languages. The majority of previous work, however, has been focused on taint analysis policy specification and enforcement (e.g., [29,30,41,49]), rather than capturing the essence of direct information flow which could provide an underlying framework to study numerous taint analysis tools.

Knowledge-based semantics has been introduced by Askarov et al. [6] as a general model for information flow of confidential data, concentrated on cryptographic computations and key release (declassification [35]) and later employed in other data secrecy analyses [4,5,7]. Schoepe et al. [39] have proposed the semantic notion of correctness for taint tracking that enforces confidentiality policies of direct information flow, called explicit secrecy. To this end, they propose a knowledge-based semantics, influenced by Volpano’s weak secrecy [47] and gradual release [6]. Explicit secrecy is defined as a property of a program, where the program execution does not change the explicit knowledge of public user. The authors show that noninterference is not comparable to explicit secrecy. However, rather than restricting the discussion to direct information flow in a low level language, we model a high level OO language with a functional flavor to represent generality of our framework.

Schoepe et al. [40] have recently employed explicit secrecy to study correctness results for dynamic confidentiality taint analysis in a core imperative setting with pointers and I/O, and deployed a Java-based tool, called DroidFace. A recent framework by Balliu et al. [8] attempts to bring together the general information flow and direct flow analyses using a security condition that models indirect flows which are observable by a low confidentiality user.

A counterpart for attacker knowledge in the realm of general flow of information integrity, called attacker power [5], is introduced as the set of low integrity inputs that generate the same observables. In this regard, Askarov et al. [5] use holes in the syntax of program code for injection points, influenced by [33]. However, their attack model is different as the low integrity and low confidentiality user is able to inject program code in the main program, by which she could gain more knowledge. We have tailored attacker power for explicit flows using state transformers, in order to interpret integrity taint analysis.

Birgisson et al. [13] give a unified framework to capture different flavors of integrity, in particular integrity via information flow and via different types of invariance. Similar to other works in this line, they give a simple imperative language with labeled operational semantics in order to enforce integrity policies through communication with a monitor. In contrast, we use program rewriting techniques to enforce policies regarding flow of data integrity, which are applicable to legacy systems.

In addition to formal properties of direct information flow, our formulation of correctness conditions also considers a formalization of audit logging based on our previous work [2], which considered a safety property unrelated to taint analysis. Other authors have recently considered formal characterizations of auditing based on logics of justification [10,34]. In contrast, we consider a specific security application of auditing in combination with taint analysis where audit logs are “extralinguistic” vestiges of program computation, whereas these related works consider programs that are able to reflect on their own audit trails, which is a distinct theoretical problem.

7.1. Conclusion

In this paper we considered integrity taint analysis in a pure object-oriented language model. Our security model accounts for sanitization methods that may be incomplete, a known problem in practice and one inspired by our study of the OpenMRS medical records software system. We proposed an in-depth security mechanism based on combining prospective measures (to support access control) and retrospective measures (to support auditing and accountability) that address incomplete sanitization. More precisely, we propose treating the results of sanitization as “partially” endorsed, or “maybe tainted”, and allow maybe tainted values to be used in security sensitive operations but record such events in the audit log.

We developed a uniform security policy of dynamic integrity taint analysis that specifies both prospective and retrospective measures, separate from code. The specification is defined in terms of a logical interpretation of program traces and leverages techniques from information algebra, allowing prospective and retrospective measures to be characterized in a uniform and integrated manner. Since the specification is defined separate from code, we use it to establish provable correctness conditions for a rewriting algorithm that instruments in-depth integrity taint analysis. A rewriting approach supports development of tools that can be applied to legacy code without modifying language implementations.

Although our specification of dynamic integrity taint analysis with endorsement establishes correctness conditions for implementations, it is still operational in nature. We therefore developed the hyperproperty of explicit integrity modulo endorsement to characterize the security property of integrity taint analysis in a non-operational manner. It is important to note that this formulation was not simply the dualization of previous formulations of explicit secrecy [39], since these formulations address only low-level code with unstructured data. We subsequently demonstrated that the image of our rewriting algorithm enjoys this security property.

Since our broader goal is to support well-founded practical tools for hardening software, we developed an instrumented version of OpenMRS that integrates our in-depth taint analysis formally specified in our model. Results from our evaluation of this implementation suggest that it is correct and practically feasible. We have made the implementation available on a public GitHub repository [44].

Footnotes

Acknowledgments

This work was supported by the National Science Foundation under Grant No. 1408801. Thanks to Ramy Koudsi and Adam Barson for their work on the implementation of $Phos$ , and to Scott Smith for comments on early drafts of this work.

References

Amir-Mohammadian, A formal approach to combining prospective and retrospective security, PhD thesis, The University of Vermont, 2017.

Amir-Mohammadian,

Chong and

Skalka, Correct audit logging: Theory and practice, in: POST, 2016, pp. 139–162.

Amir-Mohammadian and

Skalka, In-depth enforcement of dynamic integrity taint analysis, in: PLAS, 2016.

Askarov,

Hunt,

Sabelfeld and

Sands, Termination-insensitive noninterference leaks more than just a bit, in: ESORICS, 2008, pp. 333–348.

Askarov and

Myers, A semantic framework for declassification and endorsement, in: ESOP, 2010, pp. 64–84.

Askarov and

Sabelfeld, Gradual release: Unifying declassification, encryption and key release policies, in: IEEE S&P, 2007, pp. 207–221.

Askarov and

Sabelfeld, Tight enforcement of information-release policies for dynamic languages, in: CSF, 2009, pp. 43–59.

Balliu,

Schoepe and

Sabelfeld, We are family: Relating information-flow trackers, in: European Symposium on Research in Computer Security, 2017, pp. 124–145.

Bauer,

Ligatti and

Walker, More enforceable security policies, Technical Report TR-649-02, Princeton University, 2002.

10.

Bavera and

Bonelli, Justification logic and audited computation, Journal of Logic and Computation (2015), exv037.

11.

Bell and

G.E.

Kaiser, Phosphor: Illuminating dynamic data flow in commodity jvms, in: OOPSLA, 2014, pp. 83–101.

12.

Bell and

G.E.

Kaiser, Dynamic taint tracking for java with phosphor (demo), in: ISSTA, 2015, pp. 409–413.

13.

Birgisson,

Russo and

Sabelfeld, Unifying facets of information integrity, in: ICISS, 2010, pp. 48–65.

14.

Bodei and

Galletta, Tracking sensitive and untrustworthy data in IoT, in: ITASEC, 2017, pp. 38–52.

15.

Bosman,

Slowinska and

H.B.

Minemu, The world’s fastest taint tracker, in: RAID, 2011, pp. 1–20.

16.

J.G.

Cederquist,

Corin,

M.A.C.

Dekker,

Etalle,

J.I.

den Hartog and

Lenzini, Audit-based compliance control, International Journal of Information Security 6(2–3) (2007), 133–151. doi:10.1007/s10207-007-0017-y.

17.

Cheng,

Zhao,

Yu and

S.H.

Tainttrace, Efficient flow tracing with dynamic binary rewriting, in: IEEE ISCC, 2006, pp. 749–754.

18.

Chin and

Wagner, Efficient character-level taint tracking for java, in: ACM SWS, 2009, pp. 3–12. doi:10.1145/1655121.1655125.

19.

M.R.

Clarkson and

F.B.

Schneider, Hyperproperties, Journal of Computer Security 18(6) (2010), 1157–1210. doi:10.3233/JCS-2009-0393.

20.

D.E.

Denning and

P.J.

Denning, Certification of programs for secure information flow, Communications of the ACM 20(7) (1977), 504–513. doi:10.1145/359636.359712.

21.

Enck,

Gilbert,

B.-G.

Chun,

L.P.

Cox,

Jung,

McDaniel and

Sheth, Taintdroid: An information flow tracking system for real-time privacy monitoring on smartphones, Commun. ACM 57(3) (2014), 99–106. doi:10.1145/2494522.

22.

Ganapathy,

Jaeger,

Skalka and

Tan, Assurance for defense in depth via retrofitting, in: LAW, 2014.

23.

Garg,

Jia and

Datta, Policy auditing over incomplete logs: Theory, implementation and applications, in: CCS 2011, 2011, pp. 151–162.

24.

J.A.

Goguen and

Meseguer, Security policies and security models, in: IEEE S&P, 1982, pp. 11–20.

25.

Haldar,

Chandra and

Franz, Dynamic taint propagation for java, in: ACSAC, 2005, pp. 303–311.

26.

Igarashi,

B.C.

Pierce and

Wadler, Featherweight java: A minimal core calculus for java and GJ, ACM Trans. Program. Lang. Syst. 23(3) (2001), 396–450. doi:10.1145/503502.503505.

27.

Kohlas, Information Algebras: Generic Structures for Inference, Discrete Mathematics and Theoretical Computer Science, Springer, 2003.

28.

Kohlas and

Schmid, An algebraic theory of information: An introduction and survey, Information 5(2) (2014), 219–254. doi:10.3390/info5020219.

29.

Livshits, Dynamic taint tracking in managed runtimes, Technical report, Technical Report MSR-TR-2012-114, Microsoft Research, 2012.

30.

Livshits and

Chong, Towards fully automatic placement of security sanitizers and declassifiers, in: POPL, 2013, pp. 385–398.

31.

Livshits,

Martin and

M.S.

Lam, Securifly: Runtime protection and recovery from web application vulnerabilities, Technical report, Stanford University, 2006.

32.

Martin,

Livshits and

M.S.

Lam, Finding application errors using PQL: A program query language, in: OOPSLA, 2005.

33.

A.C.

Myers,

Sabelfeld and

Zdancewic, Enforcing robust declassification and qualified robustness, Journal of Computer Security 14(2) (2006), 157–196. doi:10.3233/JCS-2006-14203.

34.

Ricciotti and

Cheney, Strongly normalizing audited computation, CoRR, 2017, abs/1706.03711.

35.

Sabelfeld and

D.S.

Declassification, Dimensions and principles, Journal of Computer Security 17(5) (2009), 517–548. doi:10.3233/JCS-2009-0352.

36.

Sabelfeld and

A.C.

Myers, Language-based information-flow security, IEEE Journal on selected areas in communications 21(1) (2003), 5–19. doi:10.1109/JSAC.2002.806121.

37.

Saxena,

Sekar and

Puranik, Efficient fine-grained binary instrumentation with applications to taint-tracking, in: CGO, 2008, pp. 74–83. doi:10.1145/1356058.1356069.

38.

F.B.

Schneider, Enforceable security policies, ACM Transactions on Information and System Security 3(1) (2000), 30–50. doi:10.1145/353323.353382.

39.

Schoepe,

Balliu,

B.C.

Pierce and

Sabelfeld, Explicit secrecy: A policy for taint tracking, in: IEEE EuroS&P, 2016, pp. 15–30.

40.

Schoepe,

Balliu,

Piessens and

Sabelfeld, Let’s face it: Faceted values for taint tracking, in: European Symposium on Research in Computer Security, 2016, pp. 561–580.

41.

E.J.

Schwartz,

Avgerinos and

Brumley, All you ever wanted to know about dynamic taint analysis and forward symbolic execution (but might have been afraid to ask), in: IEEE S&P, 2010, pp. 317–331.

42.

Sekar, An efficient black-box technique for defeating web application attacks, in: NDSS, 2009.

43.

Skalka,

Amir-Mohammadian and

Clark, Dynamic integrity taint analysis in depth, Technical report, University of Vermont, 2019, http://www.cs.uvm.edu/~ceskalka/skalka-pubs/phos-TR19.pdf.

44.

Skalka,

Amir-Mohammadian and

Clark, Retrospective taint analysis for OpenMRS, 2019, https://github.com/uvm-plaid/phosphor-mod.

45.

Steinhauser and

F.G.

Jspchecker, Static detection of context-sensitive cross-site scripting flaws in legacy web applications, in: Proceedings of the 2016 ACM Workshop on Programming Languages and Analysis for Security, PLAS’16, ACM, New York, NY, USA, 2016, pp. 57–68. doi:10.1145/2993600.2993606.

46.

Usage statistics module, 2010, https://wiki.openmrs.org/display/docs/Usage+Statistics+Module, Accessed: 2015-09-27.

47.

D.M.

Volpano, Safety versus secrecy, in: SAS, 1999, pp. 303–311.

48.

Wassermann and

Su, Sound and precise analysis of web applications for injection vulnerabilities, in: PLDI, 2007, pp. 32–41.

49.

Wei and

D.L.

Lazytainter, Memory-efficient taint tracking in managed runtimes, in: SPSM Workshop at CCS, 2014, pp. 27–38.

50.

D.(Y.)

Zhu,

Jung,

Song,

Kohno and

Wetherall, Tainteraser: Protecting sensitive data leaks using application-level taint tracking, Operating Systems Review 45(1) (2011), 142–154. doi:10.1145/1945023.1945039.

	Actions		Loads

	Avg (secs)	Overhead	Avg (secs)	Overhead
OpenMRS Baseline	.236	–	.567	–
$OpenMRS + Phosphor$	.614	261%	–	–
$OpenMRS + Phos$	.670	284%	.636	112%

Maybe tainted data: Theory and a case study

Abstract

Keywords

1. Introduction

1.1. Practical motivations

1 We responsibly disclosed the vulnerabilities we found to the OpenMRS development community, and they have been corrected in current versions.

1.3. Overview by example

1.4. Technical overview

2. Foundations for in-depth policy specification

2.1. Introduction to information algebra

2.2. A general model for logging specifications

2 We use metavariable p to range over programs in either the source or target language; it will be clear from context which language is used.

3.1. Source language

3.1.1. Syntax

3.1.3. Method type lookup

3.1.6. Library methods

3.2. In-depth integrity analysis specified logically

4.1. In-depth taint analysis instrumentation

4.1.1. The Phos algorithm

5.1. Direct integrity flow semantics: Explicit integrity

5.1.1. Model specification

5.4. Enforcement of explicit integrity modulo endorsement by Phos

6.1. Modifications to phosphor

6.3. Implementation evaluation

Table 1 Average timing and overhead for unmodified OpenMRS (baseline), versus instrumented with Phosphor and with Phos Actions Loads Avg (secs) Overhead Avg (secs) Overhead OpenMRS Baseline .236 – .567 – OpenMRS + Phosphor .614 261% – – OpenMRS + Phos .670 284% .636 112%

7.1. Conclusion

Footnotes

Acknowledgments

References

¹
We responsibly disclosed the vulnerabilities we found to the OpenMRS development community, and they have been corrected in current versions.

²
We use metavariable p to range over programs in either the source or target language; it will be clear from context which language is used.

4.1.1. The $Phos$ algorithm

5.4. Enforcement of explicit integrity modulo endorsement by $Phos$

Table 1
Average timing and overhead for unmodified OpenMRS (baseline), versus instrumented with Phosphor and with $Phos$

Actions Loads

Avg (secs) Overhead Avg (secs) Overhead

OpenMRS Baseline .236 – .567 –

$OpenMRS + Phosphor$ .614 261% – –

$OpenMRS + Phos$ .670 284% .636 112%