Generative transformation via abstract change script

Abstract

With the popularity of code search, it is an important problem how to make the retrieved source code change automatically based on the user needs. However, none of the existing code transformation methods could solve this problem. They either fix compile bugs or depend on the formal specifications which lack the practicality. In this paper, we propose a novel generative transformation based on code change: we generate the abstract change script from code changes and apply the script to transform the retrieved source code. To evaluate our method, we extract 7 topics and collect 5–6 code snippets per topic from Github, and perform 5 different experiments in which we even explore 2 sensitivity-related rules and use the rules for raising the accuracy gradually. The experimental results show that our method is feasible and practical with 73.84% accuracy.

Keywords

Code reuse code search program transformation code change change script

1. Introduction

For the efficiency of program development, code reuse is important. It consists of code search and code transformation. The former comprise of the keyword-based [1], the signature-based [2], the specification-based [3] and the test case-based [4] searches. With more and more open source code available on the Internet [5, 6, 7, 8], many good code search methods are proposed, such as Satsy [9], Code Conjurer [8, 10], Strathcona [6] or Prospector [11]. However, users always manually change the source code which they retrieved by using the code search tools, because the retrieved source code cannot meet the user needs directly. Therefore, a meaningful problem to solve is how to make the retrieved source code changed automatically.

To solve the problem, we should have adopted the existing code transformation methods [12] but neither the compilation transformation nor the generative transformation works. The former (e.g., GenProg [13], Par [14] or BugModify [15]) is not suitable for solving the problem since it is just used for fixing compile bugs. By contrast, the latter seems to be suitable since it is always used for generating the new source code from the candidate source code based on the user needs. Unfortunately, the latter always makes source code conform to the user needs by using formal specifications. For example, Gopinath et al. [16] build a SAT formula that encodes the constraints imposed by the specification. If the SAT formula is satisfiable, the new source code could be derived from it. However, the formal specification is difficult to build, which limits the practicality in reality.

Fortunately, we observed an interesting phenomenon as shown in Fig. 1. We downloaded 3 pieces of source code from Github: $m_{A}$ (textChanged), $m_{B}$ (update Action) and $m_{C}$ (selectionChanged). The old position of the original methods ( $m_{A}$ , $m_{B}$ and $m_{C}$ ) is in black and the new position of the changed methods ( $m_{A}^{\prime}$ , $m_{B}^{\prime}$ and $m_{C}^{\prime}$ ) is in red.

Figure 1.

Similar changes to three pieces of source code.

Figure 2.

AST comparison between $m_{A}$ and $m_{B}$ .

Suppose that Smith retrieved $m_{A}$ , but he was dissatisfied with it and changed it manually, such as deletion ( $m_{A}$ : lines 3–4), update ( $m_{A}$ : line 6), insertion ( $m_{A}^{\prime}$ : lines 5’–6’) and move ( $m_{A}$ : lines 7–8). Steve did so for $m_{B}$ , such as deletion ( $m_{B}$ : line 4), update ( $m_{B}$ : line 5), insrtion ( $m_{B}^{\prime}$ : lines 5’–6’ and 9’–12’) and move ( $m_{B}$ : lines 6–7). Jim did so for $m_{C}$ , such as update ( $m_{C}$ : line 4), insertion ( $m_{C}^{\prime}$ : lines 5’–6’) and move ( $m_{C}$ : lines 5–6).

Although $m_{A}$ , $m_{B}$ and $m_{C}$ are different, the changes 3 users made are similar. Therefore, we make an hypothesis: we identify the similar changes between $m_{A}$ and $m_{B}$ , such as {update ( $m_{A}$ : line 6), insertion ( $m_{A}^{\prime}$ : lines 5’–6’), move ( $m_{A}$ : lines 7–8)} in $m_{A}$ and {update ( $m_{B}$ : line 5), insertion ( $m_{B}^{\prime}$ : lines 5’–6’), move ( $m_{B}$ : lines 6–7)} in $m_{B}$ ; then we apply the similar changes to $m_{C}$ without Jim’s effort, such as update ( $m_{C}$ : line 4), insertion ( $m_{C}^{\prime}$ : lines 5’–6’) and move ( $m_{C}$ : lines 5–6).

Inspired by the above hypothesis, we propose a generative transformation based on code changes. The method is based on the underlying idea: the common code changes that most users made could reflect their needs in some degree. The method contains the abstracting and the concretizing algorithms. The former identifies the similar changes from code changes and generates the abstract change script. The latter applies the script to transform the source code automatically.

Algorithm 1 Extraction

Input:

M

// a set of original source code

M^{\prime}

// a set of changed source code

Output:

C+\Delta

// the abstract change script

/* step 1: Obtain changes */

foreach (

m_{i}

m_{i}^{\prime}

) in (

M,M^{\prime}

)

\mid

\Delta_{i}=m_{i}^{\prime}-m_{i}

; // obtain the i-th changes

end

/* step 2: Identify common changes */

\Delta_{c}=\cap_{i=1}^{n}\Delta_{i}

/* step 3: Generalize abstract changes */

\Delta_{c}=

Generalization(

\Delta_{c}

)

/* step 4: Extract the changes-relevant context */

foreach (

m_{i}

m_{i}^{\prime}

\Delta_{c_{i}}

) in (

M,M^{\prime},\Delta_{c}

)

\mid

c_{i}=

denpend(

m_{i}

m_{i}^{\prime}

\Delta_{c_{i}}

);

\mid

obtain the i-th contexts

c_{i}

end

/* step 5: Generalize abstract context */

C_{c}=\cap_{i=1}^{n}c_{i}

C_{c}=

Generalizing(

C_{c}

)

/* step 6: Generate the context-aware change pattern*/

C_{c}\rightarrow C

; // merge contexts

C_{c}

into a single context

C

\Delta_{c}\rightarrow\Delta

; // merge changes

\Delta_{c}

into a single change

return

C+\Delta

; // return the abstract change script

To evaluate our method, we extract 7 topics and collect 5–6 code snippets per topic from the GitHub.1 Then we perform 5 different experiments, in which we explore 2 sensitivity-related rules and use the rules to raise accuracy gradually. All encouraging results indicate that our method is effective.

Main contributions are as follows:

(1)

As a new generative transformation, we transform the source code by using the abstract change script instead of the formal specifications.

(2)

We propose the abstracting algorithm to generate the abstract change script, as well as the concretizing algorithm to transform the source code automatically by using the script.

2. Our method

We focus on the code changes collected from Github and propose the abstracting and the concretizing algorithms.

2.1 Abstracting algorithm

Algorithm 1 describes six steps about how to generate the abstract change script.

2.1.1 Step 1: Obtain changes

Let $M=\{m_{1}$ , …, $m_{i}$ }. $m_{i}$ is the $i$ -th original source code. Let $M^{\prime}=\{m_{1}^{\prime}$ , …, $m_{i}^{\prime}$ }. $m_{i}^{\prime}$ is the changed version of $m_{i}$ . We represent $m_{i}$ with Abstract Syntax Tree (AST) and characterize code changes as AST changes $\Delta_{i}$ given by $\Delta_{i}=\{e_{i}|e_{i}\in\text{delete}(u)$ , $\text{insert}(u,v,k)$ , $\text{move}(u,v,k)$ , $\text{update}(u,v)\}$ .

Insert (node $u$ , node $v$ , int $k$ ): insert $u$ and position it as the ( $k$ $+$ 1)-th child of $v$ ;

Delete (node $u$ , node $v$ , int $k$ ): delete $u$ at the ( $k$ $+$ 1)-th child of $v$ ;

Update (node $u$ , node $v$ ): replace the label and AST type of $u$ with $v$ while maintaining $u$ ’s position;

Move (node $u$ , node $v$ , int $k$ ): delete $u$ from its current position and insert it as the ( $k$ $+$ 1)-th child of $v$ .

Let $m_{i}+\Delta_{i}=m_{i}^{\prime}$ , indicating that a piece of source code $m_{i}$ undergoes the changes $\Delta_{i}$ and becomes a new piece of source code $m_{i}^{\prime}$ . We obtain the changes by using Distiller [17], such that $\Delta_{i}=m_{i}^{\prime}-m_{i}$ . In this process, Distiller computes the one-to-one node mapper between the before and the after versions of $m_{i}^{\prime}$ s AST bottom-up by using bigram string similarity for leaf nodes (e.g., statements and method invocations) and subtree similarity for inner nodes (e.g., while and if statements). If a node is not in the mapper, it is $\Delta_{i}$ .

Figure 1 shows the original piece of source code $m_{A}$ , $m_{B}$ and the changed piece of source code $m_{A}^{\prime}$ , $m_{B}^{\prime}$ are represented as $\textit{AST}_{A}$ , $\textit{AST}_{B}$ and $\textit{AST}_{A}^{\prime}$ , $\textit{AST}_{B}^{\prime}$ as shown in Fig. 2. ‘O’ represents the nodes in $m_{A}$ and $m_{B}$ while ‘N’ represents the nodes in $m_{A}^{\prime}$ and $m_{B}^{\prime}$ . Each node corresponds to each line of code statement. For example, $O_{6}$ in $\textit{AST}_{A}$ corresponds to “MVAction action $=$ (MV Action) e.next();” in $m_{A}$ . $N_{4}$ in $\textit{AST}_{A}^{\prime}$ corresponds to “object next $=$ e.next();” in $m_{A}^{\prime}$ .

In this step, we obtain the code changes ( $\Delta_{A}$ , $\Delta_{B}$ ) of the pieces of source code ( $m_{A}$ , $m_{B}$ ) as follows:

$\displaystyle\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\Delta_{A}=\{e_{(A_{1})}(\textit{% Delete}\>0_{3}\>0_{1}\>1),e_{(A_{2})}(\textit{Delete}\>0_{4}\>0_{1}\>2),e_{(A_% {3})}(\textit{Move}\>0_{5}\>N_{1}\>1),e_{(A_{4})}(\textit{Update}\>0_{6}\>N_{4% }),e_{(A_{5})}(\textit{Move}\>0_{7}\>N_{5}\>1),e_{(A_{6})}(\textit{Move}\>0_{8% }\>N_{7}\>0),e_{(A_{7})}(\textit{Insert}\>N_{5}\>N_{3}\>1),e_{(A_{8})}(\textit% {Insert}\>N_{6}\>N_{5}\>0)\};$

$\displaystyle\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\Delta_{B}=\{e_{(B_{1})}(\textit{% Delete}\>0_{4}\>0_{3}\>1),e_{(B_{2})}(\textit{Update}\>0_{5}\>N_{4}),e_{(B_{3}% )}(\textit{Move}\>0_{5}\>N_{3}\>0),e_{(B_{4})}(\textit{Move}\>0_{6}\>N_{5}\>1)% ,e_{(B_{5})}(\textit{Move}\>0_{7}\>N_{7}\>0),e_{(B_{6})}(\textit{Insert}\>N_{5% }\>N_{3}\>1),e_{(B_{7})}(\textit{Insert}\>N_{6}\>N_{5}\>0),e_{(B_{8})}(\textit% {Insert}\>N_{9}\>N_{3}\>2),e_{(B_{9})}(\textit{Insert}\>N_{10}\>N_{9}\>0),e_{(% B_{10})}(\textit{Insert}\>N_{11}\>N_{9}\>1),e_{(B_{11})}(\textit{Insert}\>N_{1% 2}\>N_{11}\>0)\}.$

2.1.2 Step 2: Identify common changes

We identify the common changes $\Delta_{c}$ by using the modified Longest Common Edit Operation Subsequence (LCEOS) algorithm [17]. LCEOS iteratively compares the node operations in $\Delta_{i}$ pairwise, such that $\Delta_{c}=\cap_{(i=1)^{n}}\Delta_{i}$ , $\forall_{1}\leqslant i\leqslant n$ , $\Delta_{(c_{i})}\subseteq\Delta_{i}$ , in which the threshold $t_{s}$ (0.6) is used to tolerate the inexact matches between the node operations pair. If It fails to find any identical node operations in the pair, it converts all concrete instances of types, methods and variables in node operations to the abstract identifiers $ $t$ , $ $m$ and $ $v$ . If these abstract identifiers have the same edit type or the inheriting type, but the different representations of the label, they are still abstractly equivalent. Thus the result of the matching is a list of common changes $\Delta_{c}$ .

For the pieces of source code $m_{A}$ and $m_{B}$ shown in Fig. 1, we identify the longest common changes as

$\displaystyle\Delta_{c}=\{\textit{pair}_{1(e_{(A_{4})},e_{(B_{2})})},\textit{% pair}_{2(e_{(A_{5})},e_{(B_{4})})},\ \ \textit{pair}_{3(e_{(A_{6})},e_{(B_{5})% })},\textit{pair}_{4(e_{(A_{7})},e_{(B_{6})})},\ \ \textit{pair}_{5(e_{(A_{8})% },e_{(B_{7})})}\}.$

i.e., {updates ( $m_{A}$ : line 6), insertions ( $m_{A}^{\prime}$ : line 5’–6’), moves ( $m_{A}$ : line 7–8)} and {updates ( $m_{B}$ : line 5), insertions ( $m_{B}^{\prime}$ : line 5’–6’), moves ( $m_{B}$ : line 6–7)}.

Algorithm 2 Generalization

Input: nopSet // a set of node operation pairs (nop)

Output: nopSet// the abstract nopSet

// compares nop pairwise in nopSet

foreach nop in nopSet

\mid

nop

\rightarrow c_{i}

/*extract the concrete instances

c_{i}

\mid

types, methods and variables from nop*/

\mid

/* if

c_{i}

’s edit type or inheriting type is

\mid

equivalent*/

\mid

if AbstractMatch(

c_{i}

) is true

\mid

\mid

//if

c_{i}

is inconsistent with mapper

\mid

\mid

c_{i}

<>

mapper

\mid

\mid

\mid

omits nop;

\mid

\mid

\mid

continue;

\mid

\mid

end;

\mid

\mid

/*substitutes the abstract identifiers

a_{i}

\mid

\mid

$t,$m and $v for

c_{i}

in the both nop and the

\mid

\mid

method */

\mid

\mid

nopReplace(

c_{i}

a_{i}

);

\mid

\mid

methodReplace(

c_{i}

a_{i}

);

\mid

\mid

build mapper(

c_{i}

a_{i}

);

\mid

end

End

Return nopSet;

2.1.3 Step 3: Generalize abstract changes

After comparing the node operations in $\Delta_{c}$ pairwise, we generalize the abstract changes which are applicable to any target source code with the different identifiers. In such node operation pairs, if one or more concrete instances of types, methods and variables have the same edit type or the inheriting type, but the different name, we substitute the abstract identifiers $ $t$ , $ $m$ and $ $v$ for these concrete instances in both node operation pairs and the source code itself. Meanwhile we record the mapper between the identifiers and the concrete instances. In addition, to enforce a consistent naming, if some subsequent node operation pairs are inconsistent with the current mapper, they are omitted. Algorithm 2 describes this step.

For the pieces of source code $m_{A}$ and $m_{B}$ shown in Fig. 1, $e_{(A_{4})}$ (Update $0_{6}$ $N_{4}$ ) matches with $e_{(B_{2})}$ (Update $0_{5}$ $N_{4}$ ) in $\textit{pair}_{1(e_{(A_{4})},e_{(B_{2})})}$ . We detect the discrepant variable names $e$ vs. iter and record the identifiers mapper ( $e,\$v_{1}$ ) and (iter, $\$v_{1}$ ). Then we substitutes a fresh abstract identifier $\$v_{1}$ for $e$ in $e_{(A_{4})}$ (Update $0_{6}$ $N_{4}$ ), $m_{A}$ ( $O_{2}$ , $O_{5}$ , $O_{6}$ ), $m_{A}^{\prime}$ ( $N_{2}$ , $N_{3}$ , $N_{4}$ ), as well as iter in $e_{(B_{2})}$ (Update $0_{5}$ $N_{4}$ ), $m_{B}$ ( $O_{2}$ , $O_{3}$ , $O_{5}$ ), $m_{B}^{\prime}$ ( $N_{2}$ , $N_{3}$ , $N_{4}$ ). For the $\textit{pair}_{2}$ ( $e_{(A_{5})}$ , $e_{(B_{4})}$ , we does so in the same way. We record the identifiers mapper {(isContentDependent, $\$m_{1}$ ) (isDependent, $\$m_{1}$ )} are recorded and substitutes $\$m_{1}$ for isContentDependent in $e_{(A_{5})}$ (Move $0_{7}$ $N_{5}$ 1), $m_{A}$ ( $O_{7}$ ) as well as all isDependent in $e_{(B_{4})}$ (Move $0_{6}$ $N_{5}$ 1), $m_{B}$ ( $O_{6}$ ).

2.1.4 Step 4: Extract changes-relevant context

Let $C=\{c_{i}|c_{i}\in$ {DataDepend( $x, y$ ), ControlDepend( $x, y$ ), ContainDepend( $x, y$ )}}. $c_{i}$ is the changes-relevant context of $m_{i}$ and contains the unchanged nodes of $\textit{AST}_{i}^{\prime}$ on which changed nodes in $e_{i}$ depend. The context increases the chance of generating the valid changes syntactically and also serves as the anchors to position changes correctly in a new target location.

Formally, the node $y$ depends on the node $x$ if one of the following relationships holds:

DataDepend (node $x$ , node $y$ ): the node $x$ uses or defines a variable whose value is defined in the node $y$ .

ControlDepend (node $x$ , node $y$ ): the node $y$ is control dependent on the node $x$ if $y$ may or not execute depending on a decision made by $x$ . Formally, given a control-flow graph, $y$ is control dependent on $x$ , if: (i) $y$ post-dominates every vertex $p$ in $x\looparrowright y,p\neq x$ and (ii) $y$ does not strictly post-dominate $x$ .

ContainDepend (node $x$ , node $y$ ): the node $y$ is containment dependent on the node $x$ if $y$ is a child of $x$ in the AST.

We extract the changes-relevant context $c_{i}$ of the method $m_{i}$ with the control, data and containment dependence analysis, and omit other irrelevant nodes. Here, we have to recalculate the node positions in $\textit{AST}_{i}$ and $\Delta_{i}$ . This is because blindly including the irrelevant nodes as context would put the unnecessary constraints on the potential change locations and results in the false negatives during change location search.

As shown in Fig. 2- $\textit{AST}_{A}$ , we consider (i) $e_{(A_{7})}$ (Insert $N_{5}$ $N_{3}$ 1): The inserted node $N_{5}$ is control dependent on $N_{3}$ and data dependent on $N_{4}$ and $N_{2}$ . Mapping these nodes to the old version yields the context nodes { $O_{2}$ , $O_{5}$ , $O_{6}$ }; (ii) $e_{(A_{5})}$ (Move $O_{7}$ $N_{5}$ 1): The moved node $O_{6}$ depends on the nodes $c_{1}=\{O_{2},O_{5},O_{6}\}$ in the $m_{A}$ , while $N_{7}$ , $N_{5}^{\prime}$ s child at position 1, depend on the nodes $c_{2}=\{N_{2},N_{3},N_{4},N_{5}\}$ in the new $m_{A}^{\prime}$ . After extracting context, we omit the irrelevant nodes with dotted lines, such as { $O_{3}$ , $O_{4}$ , $O_{9}$ } in $\textit{AST}_{A}$ and $N_{9}$ in $\textit{AST}_{A}^{\prime}$ . Then we reset the node position in $\textit{AST}_{A}$ and $\textit{AST}_{A}^{\prime}$ .

2.1.5 Step 5: Generalize abstract contexts

By using the LCEOS algorithm [17], we also identify the common contexts $C_{c}$ by iteratively comparing the node operation in $c_{i}$ pairwise, such that $C_{c}=\cap_{i=1}^{n}c_{i},\forall_{1}\leqslant i\leqslant n,C_{(c_{i})}% \subseteq c_{i}$ . Then we compare the node operations in $C_{c}$ pairwise iteratively and generalize the abstract contexts.

For the pieces of source code $m_{A}$ and $m_{B}$ shown in Fig. 1, we identify the common contexts, such as $\textit{pair}_{3(m_{A(O_{2})},m_{B(O_{2})})}$ , $\textit{pair}_{4(m_{A(O_{5})},m_{B(O_{3})})}$ . Then we generalize the abstract contexts for the $\textit{pair}_{3(m_{A(O_{2})},m_{B(O_{2})})}$ as we do in step 3. Finally, we record the identifiers mapper {(fAction, $\$m_{2}$ ), (gerAction, $\$m_{2}$ )} and substitute $\$m_{2}$ for fActions in $m_{A(O_{2})}$ and getActions() in $m_{B(O_{2})}$ .

Figure 3.

Analysis of the poor or good results.

2.1.6 Step 6: Generate abstract change script

We merge all common contexts $C_{c}$ into an abstract changes-relevant context $C$ . Meanwhile, we merge all the common changes $\Delta_{c}$ into an abstract changes $\Delta$ . For the pieces of source code $m_{A}$ and $m_{B}$ shown in Fig. 1, $C$ is described as follow:

1. public void method declaration (…)

2. Iterator $\$v_{1}=\$m_{2}$ (…).values().iterator();

3. While( $\$v_{1}$ .has Next)

4. MVAction action $=$ (MVAction) $\$v_{1}$ .next ()

5. If (action. $\$m_{1}$ ())

6. Action.update()

$\Delta$ is described as follow:

Update $O_{4}$ $N_{4}$

$O_{4}=$ ’MVActionaction=(MVAction) $\$v_{1}$ .next();’

$N_{4}=$ ’object next= $\$v_{1}$ .next();’

Move $O_{5}$ $N_{5}$ 1 $O_{5}=$ ’if(cation.$m1())’

Move $O_{6}$ $N_{7}$ 0 $O_{6}=$ ’action.update()’

Insert $N_{5}$ $N_{3}$ 1 $N_{5}=$ ’if(next instance of MVAction)’

Insert $N_{6}$ $N_{5}$ 0 $N_{6}=$ ’MVAction action $=$ (MVAction) next

We think of $C$ and $\Delta$ as the abstract change script implying that a piece of source code $C$ would undergo the changes $\Delta$ and become a new piece of source code $C^{\prime}$ , such that $C+\Delta=C^{\prime}$ , as shown in Fig. 3.

2.2 Concretizing algorithm

By using the abstract change script, the source code could be changed automatically. Algorithm 3 describes this process: produce mapping, customize the concrete changes and replicate the changes.

2.2.1 Step 1: Produce nodes/identifiers mapper

For the source code m and the abstract context $C$ of the abstract change script, we use MatchingAbstractContexttoTargetTree [18, 19] to establish the nodes mapping: we find the nodes in $m$ that match the nodes in $C$ , and induce the one-to-one identifiers mapper between the abstract identifiers in $C$ and concrete identifiers in $m$ . Here, if every node in $C$ has a match in $m$ , the concrete changes are derived to customize $m$ ; if no match for each node is found, it reports that the change script cannot replicate the changes on $m$ .

For the piece of source code $m_{C}$ shown in Fig. 1, the nodes mapping ( $C,m_{C}$ ) $=$ {(1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6)} and the instances/identifiers mapper {( $\$v_{1},e$ ), ( $\$m_{1}$ , isSelectionDependent), ( $\$m_{2}$ , fActions) are established by matching the nodes between $C$ and $\textit{AST}_{c}$ .

Algorithm 3 Concretizing
Input: $C+\Delta$ // the abstract change script
$m$ // the target source code
Output: $m^{\prime}$ // the changed source code
// every node in $C$ has a match in $m$
if Match(each node in $C$ , each node in $m$ ) is true
$\mid$ // establish nodes mapping
$\mid$ build the mapper(instances,identifiers);
end
else
$\mid$ report(“cannot replicate the changes”);
$\mid$ break;
end
$\Delta\xrightarrow[]{mapper}\Delta_{m}$ ; /* customize the concrete changes
script for the target source code*/
$m+\Delta_{m}=m^{\prime}$ // auto-change $m$
Return $m^{\prime}$ ;

2.2.2 Step 2: Customize concrete change script

We replace all abstract identifiers in the abstract changes $\Delta$ with the corresponding concrete names of the source code $m$ based on the instances/identifiers mapper. Then we recalculate node positions of $\Delta$ with respect to the concrete nodes in $m$ .

For the piece of source code $m_{C}$ shown in Fig. 1, we customize the concrete changes $\Delta_{c}$ for $m_{c}$ based on the instances/identifiers mapper, and substitue $\$v_{1}$ , $\$m_{1}$ (), $\$m_{2}$ () for all instances of $e$ , isSelectionDependent(), fActions in $\Delta$ , as follows:

Update $O_{4}$ $N_{4}$

$O_{4}=$ ’MVAction action $=$ (MVAction)e.next();’

$N_{4}=$ ’object next $=$ e.next();’

Move $O_{5}$ $N_{5}$ 1 $O_{5}=$ ’if(action.isSelection Dependent ())’

Move $O_{6}$ $N_{7}$ 0 $O_{6}=$ ’action.update();’

Insert $N_{5}$ $N_{3}$ 1 $N_{5}=$ ’if (next instanceof MVAction}’

Insert $N_{6}$ $N_{5}$ 0 $N_{6}=$ ’MVAction action $=$ (MVAction) next;’

2.2.3 Setp 3: change source code

By using $\Delta_{m}$ , we transform the source code $m$ to the the source code $m^{\prime}$ that meets the user needs, such that $m+\Delta_{m}=m^{\prime}$ .

For the piece of source code $m_{C}$ shown in Fig. 1, $m_{C}$ automatically undergoes the update ( $m_{C}$ : line 4), the insertions ( $m_{C}$ ’: lines 5’–6’) and the moves ( $m_{C}$ : lines 5–6) instead of Jim’s manual changes.

3. Experiments

To evaluate our method, we (i) used a web spider to grab and organize code snippets from GitHub by topic, (ii) extracted 7 topics and collect 5–6 code snippets from each topic, (iii) and saved them as the testing set: $T_{1}$ (Conversion of Arabic numbers to Roman numerals, 5), $T_{2}$ (Computing the Easter holiday for a given year, 5), $T_{3}$ (Generating the complementary DNA seq, 6), $T_{4}$ (Sharpening an image, 5), $T_{5}$ (Sorting objects using QuickSort, 6), $T_{6}$ (Computing the MD5 hash of a string, 5) and $T_{7}$ (Capturing the screen into an image, 5). Take $T_{1}$ for example, “Conversion of Arabic numbers to Roman numerals” is a topic post, 5 represents 5 code snippets replies.

We used the accuracy as the metric that is the syntactic similarity between the output given by our method and the expected output. For each source code pair ( $m_{1},m_{2}$ ) that experienced similar changes, we used the following accuracy Eq. (1) where matchingNodes ( $m_{1}$ , $m_{2}$ ) is the number of matched AST node pairs computed by ChangeDistiller and size( $m_{1}$ ), size( $m_{2}$ ) are the numbers of AST nodes in $m_{1}$ and $m_{2}$ .

$\displaystyle\!\!\!\!\!\!\!\!\textit{accuracy}(m_{1},m_{2})=\frac{\textit{% matchingNodes}(m_{1},m_{2})}{\textit{size}(m_{1})+\textit{size}(m_{2})}$ (1)

Table 1

Accuracy of the five experimental results

		Accuracy (%)
Top	Num	A	B	C	D	E
T1	2	36.00	65.00	70.00	71.00	74.00
	3	35.00	64.00	69.00	70.00	74.25
	4	35.00	63.00	68.00	69.00	74.50
	5	34.00	63.00	68.00	69.00	74.50
Ave		35.00	63.75	68.75	69.75	74.31
T2	2	35.50	0.00	70.50	71.50	73.00
	3	35.00	0.00	69.25	71.00	73.00
	4	34.00	0.00	68.50	71.00	74.00
	5	33.00	0.00	68.50	71.00	74.00
Ave		34.37	0.00	69.18	71.12	73.50
T3	2	79.00	–	–	–	–
	3	79.00	–	–	–	–
	4	80.00	–	–	–	–
	5	82.00	–	–	–	–
	6	83.00	–	–	–	–
Ave		80.60	–	–	–	–
T4	2	36.00	0.00	0.00	71.50	73.25
	3	35.00	0.00	0.00	71.25	74.00
	4	35.00	0.00	0.00	71.00	74.00
	5	34.00	0.00	0.00	71.00	74.00
Ave		35.00	0.00	0.00	71.18	73.81
T5	2	77.00	–	–	–	–
	3	78.00	–	–	–	–
	4	79.00	–	–	–	–
	5	79.00	–	–	–	–
	6	80.00	–	–	–	–
Ave		78.60	–	–	–	–
T6	2	35.00	65.00	70.25	70.25	74.00
	3	34.00	63.00	69.00	69.00	74.25
	4	33.00	0.00	0.00	0.00	75.00
	5	33.00	0.00	0.00	0.00	76.00
Ave		33.75	32.00	34.81	34.81	74.56
T7	2	35.00	65.00	70.00	71.25	72.50
	3	33.25	64.00	69.50	71.00	73.00
	4	33.00	63.00	68.50	69.00	73.25
	5	33.00	62.00	68.00	68.00	73.50
Ave		33.56	63.50	63.50	69.81	73.06
Tol Ave		47.26	31.85	47.24	63.33	73.84

We conducted 5 different experiments with this testing set: we (i) analyzed the reasons for decreasing accuracy, (ii) explored 2 sensitivity-related rules and (iii) used the rules to improve the accuracy gradually. The Table 1 lists the accuracy of the five experimental results, where “topic” represents 7 different topics, “num” represents the number of code snippets for generating the change pattern per topic, “A–E” represent the results of the five experiments respectively. For example, Table 1-A represents the accuracy results of experiment-A.

3.1 Effect of the different code snippets

Experiment-A: We collected 7 topics, each topic had many different code snippets replied by users. From the code changes of these code snippets, we generated the abstract change script about this topic. Then we applied the script to transform other code snippets in the same topic. We used the accuracy for evaluating whether or not our method was effective.

Table 1-A showed the poor results with $T_{1}$ (35%), $T_{2}$ (34.37%), $T_{4}$ (35%), $T_{6}$ (33.75%) and $T_{7}$ (33.56%), but a few promising results with $T_{3}$ (80.6%), $T_{5}$ (78.6%). To analyze the reason, we explored how sensitive our method is to the similarity and the number of code snippets.

We depicted the poor results ( $T_{1}$ , $T_{2}$ , $T_{4}$ , $T_{6}$ and $T_{7}$ in Table 1-A). The accuracy went down as shown in Fig. 3a. It illustrates that the more code snippets provided, the less similar code snippets, the less context is likely to be shared among them. This is because using multiple different code snippets reduces the common changes, the derived change is likely to be less accurate.

We depicted the promising results ( $T_{3}$ and $T_{5}$ in Table 1-A) as shown in Fig. 3b. Different from Fig. 3a, the accuracy went up when more code snippets are given. It illustrates the accuracy varies inconsistently with the number of code snippets and strictly depends on the similarity of code snippets. For instance, if code snippets are diverse, we extract the fewer common changes to make the accuracy decrease; if code snippets are similar, adding code snippets may not decrease the number of common changes, but may induce more identifier abstraction and generate the more flexible changes to make the accuracy increase.

In sum, if code snippets are different, $T_{1}$ , $T_{2}$ , $T_{4}$ , $T_{6}$ and $T_{7}$ in Table 1-A occurs; otherwise $T_{3}$ and $T_{5}$ in Table 1-A occurs. Therefore, we found a sensitivity-related rule (1): the accuracy varies with the similarity and the number of code snippets and the similarity takes precedence over the number. The more similar code snippets, the more code snippets given, the higher accuracy.

3.2 Effect of the representative code snippet

To improve the above poor results ( $T_{1}$ , $T_{2}$ , $T_{4}$ , $T_{6}$ and $T_{7}$ in Table 1-A), we changed the experiment-A to the experiment-B based on the rule (1). We picked the representative code snippet out of all snippets. Then we generated the abstract change script from the code changes of the representative code snippets instead of all code snippets. Because the similarity between the different versions of the same code snippet is higher than that between the different code snippets.

In experiment-A, let $m_{i}$ $+$ $\Delta_{i}$ $=$ $m_{i}^{\prime}$ , the code snippet $m_{i}$ underwent the changes $\Delta_{i}$ and became a new code snippet $m_{i}^{\prime}$ . $\Delta_{i}$ is the changes of the different code snippet $m_{i}$ . For example, $\Delta_{1}$ is the changes of the first code snippet $m_{1}$ . $\Delta_{2}$ is the changes of the second code snippet $m_{2}$ . In experiment-B, let $m_{(i-1)}+\Delta_{i}=m_{i}^{\prime}$ instead of $m_{i}+\Delta_{i}=m_{i}^{\prime}$ . Each code snippet has many changed version. $\Delta_{i}$ represents the changes between the current modified version and the last changed version, such that $\Delta_{i}=m_{i}^{\prime}-m_{(i-1)}$ . For example, $\Delta_{1}$ is the first changes of the code snippet m. $\Delta_{2}$ is the second changes of the modified code snippet $m_{1}$ . Note that the “num” column in Table 1 represents the number of different versions of the representative code snippet. To evaluate whether or not the accuracy of $T_{1}$ , $T_{2}$ , $T_{4}$ , $T_{6}$ and $T_{7}$ raise, we conducted experiment-B. Table 1-B showed the better results with $T_{1}$ (63.75%) and $T_{7}$ (63.5%) than those in Table 1-A.

3.3 Effect of the stable change pattern

In Table 1-B, the accuracy of $T_{2}$ was still 0 because every new changes was the addition of the last changes. Suppose that only $m$ exists, then $m_{1}=m+\Delta_{1}$ , next $m_{2}=m+\Delta_{1}+\Delta_{2}$ , then $m_{3}=m+\Delta_{1}+\Delta_{2}+\Delta_{3}$ and $\Delta_{1}\neq\Delta_{2}\neq\Delta_{3}$ , such that $(m_{1}-m)\cap(m_{2}-m_{1})\cap(m_{3}-m_{2})=\Delta_{1}\cap\Delta_{2}\cap\Delta% _{3}=\Phi$ . In this case, our method decided that there is no common change.

To improve it, we changed the experiment-B to the experiment-C. In experiment-C, let $\Delta_{i}=m_{i}^{\prime}-m$ instead of $\Delta_{i}=m_{i}^{\prime}-m_{(i-1)}$ . $\Delta_{i}$ represents the different changes of the same method $m$ . For example, $\Delta_{1}$ is the first changes of the code snippet $m$ . $\Delta_{2}$ is the second changes of the same code snippet $m$ . In this case, $T_{2}$ supposes that ( $m_{1}-m)\cap(m_{2}-m_{1})\cap(m_{3}-m)=\Delta_{1}\cap(\Delta_{1}+\Delta_{1})% \cap(\Delta_{1}+\Delta_{2}+\Delta_{3})=\Delta_{1}$ , our method decided that there are common changes $\Delta_{1}$ .

To evaluate whether or not $T_{2}$ restore, we conducted the experiment-C. Table 1-C showed $T_{2}$ recovers from 0% to 69.18% and $T_{1}$ , $T_{7}$ are also a little better than those in Table 1-B.

3.4 Effect of the first modification

In Table 1-C, the accuracy of $T_{4}$ was still 0. We discovered that each changes $\Delta_{i}$ were similar except the first changes $\Delta_{1}$ . Because $\Delta_{1}$ involved amounts of changes to fix bugs while other subsequent $\Delta_{i}$ involved small changes unrelated to bugs. In this case, $\Delta_{1}$ was a outlier such that $\Delta_{\textit{outlier}}\cap\forall\Delta_{i}=\Phi$ .

To evaluate whether or not $T_{4}$ restore, we changed the experiment-C to the experiment-D. We identified the common changes $\Delta_{c}$ starting from $\Delta_{2}$ instead of $\Delta_{1}$ , such that $\Delta_{c}=\cap_{(i=2)}^{n}$ , $\Delta_{i}$ and conducted the experiment-D. Table 1-D showed $T_{4}$ recovered from 0% to 71.18% and $T_{1}$ , $T_{2}$ , $T_{7}$ were also a little better than those in Table 1-C. It illustrates that our method is sensitive to not only the similarity of code snippets $m_{i}$ , but also the similarity of $\Delta_{i}$ .

3.5 Effect of the similar changes

To explore how sensitive our method is to the similarity of $\Delta_{i}$ , we re-focused in Table 1-B. We (i) depicted the results of $T_{1}$ as Fig. 4(a- $T_{1}$ ) where the accuracy still decreased because no $\Delta_{\textit{outlier}}$ existed; (ii) depicted the results of $T_{4}$ as Fig. 4(a- $T_{4}$ ) where the accuracy decreased to 0 suddenly at the second changes and never went up again because $\Delta_{1}$ was a outlier; (iii) depicted the results of $T_{6}$ as Fig. 4(a- $T_{6}$ ) where the accuracy decreased to 0 suddenly at the forth changes because $\Delta_{4}$ was a outlier. In sum, the accuracy decreased with the number of $\Delta_{i}$ . Sometimes, it decreased suddenly to 0 and never restored if $\Delta_{\textit{outlier}}$ existed. Thus the second sensitivity-related rule (2) is that the more similar changes $\Delta_{i}$ , the higher accuracy.

Figure 4.

Before and after heuristic strategy.

To avoid ill-effects of $\Delta_{\textit{outlier}}$ , such as sudden decrease, we changed the experiment-D to the experiment-E. We designed a heuristic changes-choosing strategy to choose the top-N similar $\Delta_{i}$ with the largest intersection and excluded $\Delta_{\textit{outlier}}$ actively. $N$ represents the maximum number of the common changes $\Delta_{c}$ . Suppose that $N=$ 2, $\Delta_{1}$ occurs, $\Delta_{c}=\Delta_{1}$ ; next $\Delta_{2}$ occurs, $\Delta_{c}=\Delta_{1}\cap\Delta_{2}$ ; then $\Delta_{3}$ occurs, if $\Delta_{1}\cap\Delta_{2}\neq\Phi$ , $\Delta_{1}\cap\Delta_{3}=\Phi$ and $\Delta_{2}\cap\Delta_{3}=\Phi$ , we chose $\Delta_{c}=\Delta_{1}\cap\Delta_{2}$ as the largest intersection and omitted $\Delta_{3}$ ; go on $\Delta_{4}$ occurs, if ( $\Delta_{1}\cap\Delta_{4})>((\Delta_{1}\cap\Delta_{2}$ ) or ( $\Delta_{2}\cap\Delta_{4}$ )), we chose $\Delta_{c}=\Delta_{1}\cap\Delta_{4}$ and omitted $\Delta_{2}$ , and so on.

We conducted the experiment-E with the default setting of 2 for $N$ . Surprisingly, Table 1-E showed $T_{6}$ decreased to 0 at the fourth changes previously but now it still increased, while $T_{4}$ decreased to 0 at the second changes previously but now it also still increased, even recovered from 34.81% to 74.56% averagely. $T_{1}$ , $T_{2}$ , $T_{4}$ , $T_{7}$ , $T_{8}$ were also a little better than those in Table 1-D.

Besides in Table 1-E, we (i) depicted the results of $T_{1}$ as Fig. 4(b- $T_{1}$ ) where the accuracy still increased as the number of code snippets; (ii) depicted the results of $T_{4}$ as Fig. 4(b- $T_{4}$ ) where the accuracy decreased to 0 suddenly at the second changes and went up again despite a outlier $\Delta_{1}$ ; (iii) depicted the results of $T_{6}$ as Fig. 4(b- $T_{6}$ ) where the accuracy still increased despite a outlier $\Delta_{4}$ . In sum, the heuristic changes-choosing strategy makes sense that no matter how many times the code snippet is changed, the accuracy still remains or increases as the number of $\Delta_{i}$ or recovers soon despite sudden-decreasing immediately to 0 at the second changes.

4. Conclusion and future

In this paper, we proposed the generative transformation based on code change. Instead of the formal specification, this method transforms the retrieved source code by using abstract change script. Through a series of experiments, we not only summarize the best practice for our method, but also confirm that it could assist users to reuse the source code on the Internet with 73.84% accuracy.

However, there are still the threats to validity. The first is human subjects. The limited number and the programming capabilities of the human subjects may bias the results. In the future, we plan to conduct more experiments and user studies. The second is a tiny sample of the available testing set with 7 topics and 5–6 pieces of source code per topic. In the future, we plan to investigate more queries over a much larger codebase.

Footnotes

https://github.com/explore

Acknowledgments

This work was supported by the Key Scientific Research Projects of Henan Province High Talent Scientific Research Project no. 15A520022.

References

Little

and Miller

R.C.

, Keyword programming in java, in: Proc twenty-second IEEE/ACM International Conference on Automated Software Engineering, Atlanta, Georgia, USA, 2007, pp. 84–93.

Zaremski

A.M.

and Wing

J.M.

, Signature matching: A key to reuse, in: Proc SIGSOFT ’93 Proceedings of the 1st ACM SIGSOFT Symposium on Foundations of Software Engineering, Los Angeles, California, USA, 1993, pp. 182–190.

Rollins

E.J.

and Wing

J.M.

, Specifications as search keys for software libraries, ACM Transactions on Software Engineering and Methodology (TOSEM) (1997), 333–369.

Lemos

Lazzarini

V.A.

Ossher

Morla

Baldi

and Lopes

, CodeGenie: Using test-cases to search and reuse source code, in: Proc twenty-Second IEEE/ACM International Conference on Automated Software Engineering, Atlanta, Georgia, USA, 2007, pp. 525–526.

Baxter

I.D.

Pidgeon

and Mehlich

, DMS®: Program transformations for practical scalable software evolution, in: Proc 26th International Conference on Software Engineering, Society Washington, DC, USA, 2004, pp. 625–634.

Eaddy

Zimmermann

Sherwood

K.D.

Garg

Murphy

G.C.

Nagappan

and Aho

A.V.

, Do crosscutting concerns cause defects, IEEE Transactions on Software Engineering SE-34(4) (2008), 497–515.

Gulwani

, Automating string processing in spreadsheets using input-output eamples, in: Proc 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, Austin, Texas, USA, 2011, pp. 317–330.

Landauer

and Hirakawa

, Visual AWK: A model for text processing by demonstration, in: Proc Symposium on Visual Languages, Darmstadt, Germany, Germany, 2002, pp. 267–274.

Stolee

K.T.

Elbaum

and Dwyer

M.B.

, Code search with input/output queries: Generalizing, ranking, and assessment, Journal of Systems and Software, 2016, pp. 35–48.

10.

Gale

L.P.

, Recommendations and proposals for an Ada strategy in the space software development environment, in: Proc International Eurospace-Ada-Europe Symposium, Europe, 1994, pp. 175–203.

11.

Kim

Cai

and Kim

, An empirical investigation into the role of API-level refactorings during software evolution, in: Proc 33rd International Conference on Software Engineering, Waikiki, Honolulu, HI, USA, 2011, pp. 151–160.

12.

Reiss

S.P.

, Semantics-based code search, in: Proc 31st International Conference on Software Engineering, Society Washington, DC, USA, 2009, pp. 243–253.

13.

Weimer

Nguyen

T.V.

Goues

C.L.

and Forrest

, Automatically finding patches using genetic programming, in: Proc 31st International Conference on Software Engineering, Society Washington, DC, USA, 2009, pp. 364–374.

14.

Kim

Nam

Song

and Kim

, Automatic patch generation learned from human-written patches, in: Proc 2013 International Conference on Software Engineering, San Francisco, CA, USA, 2013, pp. 802–811.

15.

Jeffrey

Feng

Gupta

and Gupta

, BugFix: A learning-based tool to assist developers in fixing bugs, in: Proc 2009 IEEE 17th International Conference on Program Comprehension, Vancouver, BC, Canada, 2009, pp. 70–79.

16.

Gopinath

Malik

M.Z.

and Khurshid

, Specification-based program repair using SAT, in: Proc International Conference on Tools and Algorithms for the Construction and Analysis of Systems, 2011, pp. 173–188.

17.

Hunt

J.W.

and Szymanski

T.G.

, A fast algorithm for computing longest common subsequences, Communications of the ACM (1977), 350–353.

18.

Meng

Kim

and Mckinley

K.S.

, Systematic editing: Generating program transformations from an example, in: Proc 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, San Jose, California, USA, 2011, pp. 329–342.

19.

Fluri

Wursch

Pinzger

and Gall

H.C.

, Change distilling-tree differencing for fine-grained source code change extraction, IEEE Transactions on Software Engineering, SE-33(11) (2007), 725–743.