An Empirical Study of Code Clone Clustering Based on Clone Evolution

删除或更新信息，请邮件至freekaoyan#163.com(#换成@)

本站小编哈尔滨工业大学/2019-10-23

An Empirical Study of Code Clone Clustering Based on Clone Evolution

Fanlong Zhang, Xiaohong Su, Wen Zhao , Tiantian Wang

(School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China)

Abstract:

There are lots of code clones appearing in software, which are similar code fragments with each other. In the past decades, researchers have proposed some state-of-the-art methods to detect clones. The code clones have showing some relationship with the evolution of software. In order to explore relationships between clones and their evolution, we propose a framework to cluster clones with a Fuzzy C-means clustering method. Firstly, we detect all the clones using NiCad, and build the clone genealogies for multiple versions software. Secondly, we extract some metrics to describe the clones and their evolution. Finally, we cluster all clone’s vectors, which are generated with the different metrics for different proposes. Experimental results on six open source software packages have shown the relationships among the clone life, the number of change times, the clone pattern and et al. can help developers to understand clones.

Key words: code clones clone clustering clone analysis clone evolution empirical study

DOI：10.11916/j.issn.1005-9113.15316

Clc Number:TP311.5

Fund:

Fanlong Zhang, Xiaohong Su, Wen Zhao, Tiantian Wang. An Empirical Study of Code Clone Clustering Based on Clone Evolution[J]. Journal of Harbin Institute of Technology, 2017, 24(2): 10-18. DOI: 10.11916/j.issn.1005-9113.15316. 复制到剪切板

Fund Sponsored by the National Natural Science Foundation of China (Grant No. 61173021) Corresponding author Fanlong Zhang, E-mail: hitzhangfanlong@gmail.com Article history Received: 2015-11-03

Contents            Abstract            Full text            Figures/Tables            PDF

An Empirical Study of Code Clone Clustering Based on Clone Evolution
Fanlong Zhang, Xiaohong Su, Wen Zhao, Tiantian Wang
School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China

Received: 2015-11-03
Fund: Sponsored by the National Natural Science Foundation of China (Grant No. 61173021)
Corresponding author: Fanlong Zhang, E-mail: hitzhangfanlong@gmail.com

Abstract: There are lots of code clones appearing in software, which are similar code fragments with each other. In the past decades, researchers have proposed some state-of-the-art methods to detect clones. The code clones have showing some relationship with the evolution of software. In order to explore relationships between clones and their evolution, we propose a framework to cluster clones with a Fuzzy C-means clustering method. Firstly, we detect all the clones using NiCad, and build the clone genealogies for multiple versions software. Secondly, we extract some metrics to describe the clones and their evolution. Finally, we cluster all clone's vectors, which are generated with the different metrics for different proposes. Experimental results on six open source software packages have shown the relationships among the clone life, the number of change times, the clone pattern and et al. can help developers to understand clones.
Key words: code clones    clone clustering    clone analysis    clone evolution    empirical study
1 IntroductionCode Clones are similar fragments to each other, which can take up 7%-23% in software^[1]. Several factors might force or influence developers or maintenance engineers in creating clones in software, such as reuse approach, benefit of maintenance, and overcoming underlying limits^[2]. Researchers have proposed some methods to detect clones, which can be divided into text, token, AST, PDG and metrics. The code clones are always evolving with the software, and their evolution can be described by clone genealogy which is proposed by Kim^[3]. At the same time, Duala-Ekoko can track these clones in software with Clone Region Descriptors (CRD)^[4].

Clones have both positive and negative impacts on software. Some researchers find that clones can introduce more defects, which may be caused by the inconsistent change^[5]. Also, clones can incur additional maintenance costs. On the other hand, clones have some benefits. Some researchers find clones are more stable than non-clones^[6-7]. Reusing exist code can reduce development time (causing more clones). We should take a neutral attitude on the clones in order to mitigate the negative effects and make full use of the derived benefits.

Developers can detect millions clones with clone detection tools, and can build clone genealogy to describe their evolution. These clones and their evolutions may reveal something individually. However, what about the relationships between clones and their evolutions? Will these relationships help developers to understand the clones more deeply and also give suggestions to more effective maintenance and management of clones? In this paper, in order to reveal these relationships, we propose a framework to cluster clones using the machine learning method (Fuzzy C-means method). Firstly, we use NiCad (a clone detection tool) to detect all the clones in software, and build the clone genealogy from detection results. Secondly, we extract some metrics to describe the clones and their evolution, such as clone life, the number of change time, the clone pattern and et al. We generate some clone clustering vectors for each clone with the clone metrics. Depending on the maintenance objectives, we use different metrics to generate different vectors, and use FCM method to cluster all clones with different vectors instead of the clones. Finally, we conduct an empirical study on six open-source software packages. Our study reveals some interesting findings which can help developers understand and manage clones better.The contributions of this paper are as follows:

1) we propose 4 questions about the relationships between the clones and their evolution.

2) we propose a framework to cluster clones with FCM method. We extract some metrics from clones and clone group, and generate clone clustering vector with metrics. Using these vectors, we can cluster all the clones by FCM method.

3) In our attempt to answer the above 4 questions, we conduct an empirical study on six open source software, and obtain interesting findings which can give suggestions to better understanding and managing clones.

This paper is organized as follows: a brief review on code clones in shown in section 2. Section 3 presents the motivation and four questions are given which can describe the relationship between the clones and their evolution. Section 4 shows our method to cluster clones with FCM. The methodology and a brief introduction of six softwares are shown in Section 5. Section 6 demonstrates results and discussions. Section 7 describes the limitation and our future work. The conclusions are drawn in Section 8.

2 Related WorkClone research has a long history since 1990s. According to the research purpose, clone research can be divided into clone detection, clone analysis, and clone management. Clone detection is the activity to detect all clones from software. Clone analysis aim to understand clones, such as clone reason, clone classification, clone evolution, and so on. Clone management aims to maintain and manage the clones, because clone detection and analysis cannot solve clone problem. In recent years, clone analysis and management are becoming more and more popular, which researchers are fascinated with.

Clone detection is the most popular activity in clone areas. In past decades, researchers promoted and developed several methods and tools to detect the clones, such as NiCad^[8], CCFinder^[9], and Deckard^[10]. According to the technology, these methods can be divided into base text, token, AST, PDG and Metrics. The NiCad is a clone detection method that has been shown to yield both high precision and high recall in detecting near-miss intentional clones. CCFinder can effectively find clones with metrics. It can also effectively identify the characteristics of software. A detailed review on clone detection based on a comprehensive set of 213 articles was shown in Ref.[11].

Code clones are always evolving with software. Kim proposes clone genealogy to describe evolution, and obtains interesting findings about the short and long life clones^[3]. Duala-Ekoko proposes a clone tracking system, which can product CRDs from the output of different clone detection tools, and notify the clone modifications to developers^[4]. Tibor presents an approach for mapping code duplications from one particular version of software to another one. And he proposes a definition of clone smell, which can help developer determine which clones are dangerous for software^[12].

Researchers also have a high passion on the clone pattern, such as consistent and inconsistent change.Specifically, clone stability is a famous clone pattern. Krinke finds that clone code is more stable than the non-cloned code, and clone code is also older than non-cloned code on average^[13]. Nicolas observes that only 1.02%-4.00% of all clone genealogies introduce software defects at the release level, and suggests that clones do not have a significant impact on the post-release quality of the studied softwares^[14]. Krinke also finds that usually half of the changes to code clone groups are inconsistent change. It is rarely the case that there are additional changes in later versions when there is inconsistent change to a code clone group in a near version^[15].

Although clone detection and analysis can help developers detect and understand clones in software, the two activities cannot overcome clones' problem. Therefore, researchers believe that we should manage or maintain clones with more flexible methods. Clone management is divided into three forms: preventive clone management, compensative clone management, and corrective clone management^[16]. Over the years, there has been increasing interest in clone management research. The most popular activity is clone refactoring. Researchers propose some methods to identify the refactoring opportunities and to refactoring clones. At the same time, there are some clone management tools. The JSync^[17], a novel clone management tool, provides two main functions. It can help developers to manage the clone enveloping relation and the inconsistent change. The CloneTracker^[18], an Eclipse plug-in, provides support for tracking code clones in evolving software. When developers choose clone groups, the tool automatically generate a clone model that is robust to changes to the source code. The CeDAR (Clone Detection, Analysis, and Refactoring)^[19], also an Eclipse plug-in, can forward clone detection results to the refactoring engine in Eclipse.

3 MotivationMillions of clones which are evolving with software are detected by clone detection tools, and clone genealogy can describe the clone evolution. However, how can these two combined help developers to understand clones? In order to answer this, we try to extract some useful information from the clones and their evolution. Using this information, we can do further clone analysis. In order to reveal them, we advocate answering of the following four questions in detail:

1) What is the relationship between the clone life and the clone patterns? Clone may exist for a long/short time in software. It may undergo changes throughout its life, and exhibited as clone patterns. So is there any relationship between clone life and clone patterns, for example, do long-life clones have more special clone patterns?

2) What is the relationship between the number of change times and clone patterns? During the clone evolution, some clones may change many times, but some clones do not. Is there some relationship between the number of change times and the clone patterns?

3) What is the relationship between clone granularity and the clone patterns? Clones have different granularity. It means that some are very small (may comprises just several lines). Do the big ones and small ones have different clone patterns? So what is the relationship between them?

4) What is the relationship between clone similarity and the clone patterns? Do the exact clones (with highest similarity) have fewer clone patterns?

In order to explore these relationships, we use machine learning method to analyze all clones. In this paper, we choose unsupervised machine learning method because of our limited knowledge of clones. As Fuzzy C-Means clustering (FCM) method can handle large data with high accuracy, we choose FCM method to analyze clones. In the FCM clustering process, all the objects are divided into several different clusters, and objects in the same cluster have maximum similarity, and objects in different clusters have minimum similarity.

4 Clone Clustering with Fuzzy C-Means4.1 Framework of Clone Clustering with FCMThe clone clustering framework with FCM is shown in Fig. 1. It can be divided into three steps: the pre-processing, metric extraction and clone clustering. During the pre-processing stage, we use NiCad to detect all clones in software. We then organize clones collected from consecutive versions to build a clone genecology. At Metric Extraction stage, we extract 11 metrics both from clones and their clone groups. These metrics capture all relevant information about clones and their evolution. Finally, we generate clone clustering vector for clones with the metrics, and cluster all clones by FCM. By exploring the different relationships among extracted metrics, we generate different clustering vectors to capture different aspects of clones, and are thus able to address the above 4 questions.

Figure 1
Click to view original

Figure 1 Framework of clone clustering with FCM

4.2 Pre-processingIn this stage, we aim to detect all clones from consecutive versions of each software, and to build a clone genealogy. Firstly, we obtain consecutive versions of open source code from website and version control system subversion. Secondly, we detect all clones for each software version with NiCad. The clone detection produces clones, grouped by their similarity. With that, we connect up similar clone groups across consecutive versions. This enables us to build a clone genealogy. Finally, we use CRD to describe clones, and build a clone genealogy by mapping the neighboring clone fragments and clone groups. With the clone and genealogy, we can easily extract the clone evolutionary metrics. Fig. 2 gives an example of clone genealogy, which can describe the clone evolution and clone patterns. In this clone genealogy, it exists 5 versions in software at least. From the version i to i+1, the clone group has two new clone fragments, so its clone pattern is add. In the next version, clone group split into two groups, which have the different clone pattern (one is sub, and the other is consistent change). The definition of the clone genealogy and clone patterns can be found in Ref.[20].

Figure 2
Click to view original

Figure 2 The evolution of code clones

4.3 Metric ExtractionWe extract 11 clone metrics to present clone information. These metrics include clone static and evolutionary information: static metrics and evolution metrics. Static metrics are features pertaining to clone static information, such as clone granularity and clone similarity. These metrics are extracted from clones found in a single version. Evolutionary metrics are features pertaining to clone evolutionary information, such as clone life, the number of clone change time and clone pattern. These metrics have to be extracted from clone evolution (represented by clone group mapping and clone genecology). Informal definitions of these metrics are shown as below.

1) Clone Granularity: the number of clone lines;

2) Clone Similarity: the degree of similarity among the clones in same clone group. The clone detection tool produces this metric directly;

3) Clone Life: the number of versions which the clone exists in software;

4) The Number of Change times: the number of times a clone changes in its entire life.

Clone Pattern: A code clone belongs to one clone group, and one clone group belongs to one clone genealogy. A clone pattern is the changes of the two groups from the old clone group and the new clone group. All the patterns are defined briefly below; interested readers may refer to our previous work^[21] for details, and all the clone patterns can be seen in Fig. 2.

1) Same: The new clone group is totally the same as the old one;

2) Identical (same number): the number of clones in new clone group is the same as the old one;

3) Add: there is one or more new clones which are added into the new one;

4) Sub: there is one or more clones missing from the new one;

5) Split: the old clone group is split into two new clone groups;

6) Consistent Change: all the clones in the new group are obtained from the old group under identical changes;

7) Inconsistent Change: clones in the old group undergo different changes to arrive at the new group.

4.4 Clone ClusteringWe use metrics to clustering clones instead of clone fragments. Therefore, we generate a vector named clone clustering vector with different metrics for each clone fragment. Through clustering all vectors with FCM, we can analyze all the clones very effectively. The clone clustering stage by Fuzzy C-means method can be divided into two steps: the first step is generating the clone clustering vectors, and the second step is clustering all clone vectors.

1) Generating clone clustering vectors. For each clone, its clone clustering vector is an m dimensional vector, V= {v₁, v₂…, v_m}, where v_i describes one metric mentioned in previous subsection. We can choose different metrics for different purposes. After generating all vectors for clones, we obtain a clustering vector space X for all clones, which can be defined by the X= {x₁, x₂…, x_n}. Take an example, one clone yields an 9-dimensional vector V= {v₁, v₂, …, v₉}, where v_i (for all 1≤i≤9) is a metric of this clone.

2) Clones are clustered by FCM method. All the vectors (all the clones) can be divided into c fuzzy groups by FCM method. And FCM seeks each cluster center (c) for each fuzzy group, which can make the value of dissimilarity index function reaching the minimum. For each clone, it belongs to all the groups not one group. FCM used a set membership values (between 0 and 1) to describe this possibility which the clone belongs to all the groups. A membership value is the possibility of this vector belongs to one group. The set of membership values is an vector: < u₁, u₂, … u_c>. The sum of membership values of this vector equals 1. It means the clone must belong to the c groups. For all the clones, we use a membership matrix U[m, n] (allow the u_ij between [0, 1]) to describe possibilities of them belong to the groups.

To answer each of the four questions, we select the appropriate metrics to be used and generating corresponding clone clustering vectors, one for each clones. For instance, in answering the first question, we elect to create clone clustering vector from all the evolutionary metrics (including clone life, the number of clone change time, and seven different clone patterns). Details are given in the next section.

5 Case StudyOur experiments are built on six open source software systems, which are built in a variety of programming languages: Java, C and C#. As shown in Table 1, DNSJava and jEdit are implemented by programming language JAVA. DNSJava is a set of Java language DNS protocol, and jEdit is text-oriented development of software compiler. wget and conky are implemented in C. wget is a command line download tool, and conky is an X system Monitor window system. ProcessHacker and iTextsharp are implemented in C#. ProcessHacker is a windows system process management procedure, and iTextsharp is a library which is used to generate PDF documents. In Table 1, we give the details of the softwares, such as versions, start, end, LOC, files clones. The Versions means versions of the softwares that we used. The Start is the start version, and the End is the end version. The LOC is the lines of the code. The Files is the number of the files, and the Clones is the number of clones in software.

表 1

Table 1 The open sources software for experiments SoftwaresDNSJavajEditwgetconkyProcessHackeriTextsharp

Versions34229223018

Start1.2.03.0.01.6.01.3.01.0.0.05.0.0

End2.1.45.0.01.14.01.7.11.5.0.05.4.0

LOC9 905-23 96347 474-110 38415 328-75 97910 447-21 43010 308-85 007169 091-199 002

Files83-190251-57146-29325-8439-5261 270-1 622

Clones6116 6363789282 24415 397

Table 1 The open sources software for experiments

We propose four questions in order to explore the relationships between the clones and their evolution. To answer those questions, we carry out several different experiments by choosing different metrics to form different clone clustering vectors.

Firstly, to find out the relationship between clone life, number of change times and clone patterns, as raised in the first question, we use all evolutionary metrics to form vectors and cluster clones. Our experiment finds that clone life is a unique clone metric, and all clones can be divided into two classes according to their clone life. We will discuss in detailed in next section.

Next, we remove clone life information in order to explore the relationship between the number of clone change time and clone patterns, as raised in the second question. Here, we find that all clones can be separated by the number of their change times. Again, we will explain in more detail in the next section

In question third, we choose a different set of metrics including clone granularity to cluster clones. This gives us the relationship between granularity and the clone patterns.

At last, similar to third question, we explore the relationship between clone similarity and the clone pattern to answer the last question.

6 Results and DiscussionsIn this section, we give results and discussions of the experiments. We conduct 4 different experiments to answer the above 4 questions, and draw conclusions which can help developers to better understand clones.

6.1 Relationship Between Clone Life and Clone PatternsIn the clone life experiment, we consider all evolutionary metrics in forming clone clustering vectors, including clone life, the number of change times, and clone patterns. Then we apply FCM method to cluster clones.

The results of this experiment are shown in Table 2. From the second row, we find that clones are clustered into two groups by clone life: group 1 (the short-life clones) and group 2 (the long-life clones). Take DNSJava as an example, clones can be separated by clone life: group 1 has short life (not over 11), and group 2 has long life (over 11). From the third row to the last row we compute the statistics about the clones which have different metrics in groups 1 and 2, such as inconsistent change, add, subtract and split. As shown in row 2, in the short-life group, the number of clones are much bigger than the long-life ones for 6 softwares. It means that most clones do not exist for a long time in software. From rows 3 to 6, the short-life clones are also much more than the long-life ones. We do not consider other metrics (such as the number of change times, same), because the numbers have no meaning to clone life or clone patterns, and we cannot get any useful relationships (The bellow sections are in the same way).

表 2

Table 2 The results of clone life and clone patterns PatternsDNSJavajEditwgetconkyPorocessHackeriTextsharp

Clone Life≤11/ > 11≤9/ > 9< 5/≥5≤9/ > 9≤10/ > 10≤10/ > 11

Number of clones485/1265 812/824350/28816/1121 755/4898 699/485

IC32/0248/646/4109/3169/1900/11

Add66/01067/19222/0331/1447/21098/21

Sub9/097/616/473/0153/192/11

Split-/-40/36/012/249/026/3

Note: IC—Inconsistent Change

Table 2 The results of clone life and clone patterns

However, we are not sure if the short-life clones have more clone patterns (inconsistent change, add, subtract, split) than the long-life ones, because the different number of the two groups. Therefore, we calculate the absolute ratio (rate P₁) and relative ratio (rate P₂), the results as shown in Tables 3 and 4. Here, P₁ is the absolute ratio reflecting the proportion of clones-among all the clones in the specific group (either groups 1 or 2)-undergoing specific pattern. For instance, the absolute ratio of "IC" pattern for group 1 in the case of DNSJava is 6.6% (32:485). On the other hand, P₂ is the relative ratio reflecting the proportion of clones from specific group undergoing specific pattern, with respect to all clones that underwent that pattern. Thus, the relative ratio of "Add" pattern for group 1 in the case of conky is 99.7% (331:(331+1)), indicating that most of the "Add" pattern occurs in group 1. From both Tables 3 and 4, we find that the number of clones belonging to the short-life group is far greater than the long-life group. In Table 3, we find that "Add" pattern occurred the most among all change patterns. It tells us that clones increase in magnitude within the system. The second highest is inconsistent change. It tells us that inconsistent change in clones occurs relatively frequent. Expect IC and subtract of wget, split of conky, the number in short-life group are bigger than the long-life group. Both wget and conky are language C. It can tell us the short-life clones have more clone patterns (inconsistent change, add, subtract, split) than the long-life ones, especially for the object-oriented programming language. In Table 4, we firstly calculate P₂ for number of groups (Number). We compare the numbers of P₂ (P₂ of clone patterns is bigger than P₂ of its group, expect IC and subtract of wget, split of conky). It can tell the same things in Table 3.

表 3

Table 3 The rate of P₁ PatternsDNSJavajEditwgetconkyProcessHackeriTextsharp

Clone Life≤11/ > 11≤9/ > 9< 5/≥5≤9/ > 9≤10/ > 10≤10/ > 10

Number of clones485/1265 812/824350/28816/1121 755/4898699/6 698

IC (%)6.60/04.27/0.7313.14/14.2913.36/2.689.63/0.2010.35/0.16

Add (%)13.61/018.36/2.3163.43/040.56/0.8925.47/0.4112.62/0.31

Sub (%)1.86/01.67/0.734.57/14.298.95/08.72/0.201.06/0.16

Split (%)-/-0.69/0.361.71/01.47/1.792.79/00.30/0.04

Table 3 The rate of P₁

表 4

Table 4 The rate of P₂ PatternsDNSJavajEditwgetconkyProcessHackeriTextsharp

Clone Life≤11/ > 11≤9/ > 9< 5/≥5≤9/ > 9≤10/ > 10≤10/ > 10

Number of clones485/ 1265 812/824350/28816/1121 755/4898 699/6 698

Percentage (%)79.4/20.687.6/12.492.6/7.487.9/12.178.2/11.856.5/43.5

IC (%)100/097.64/2.3692/897.32/2.6899.41/0.5998.80/1.20

Add (%)100/098.25/1.75100/099.70/0.3099.55/0.4598.12/1.88

Sub (%)100/094.17/5.8380/20100/099.35/0.6589.32/10.68

Split (%)-/-93.02/6.98100/085.71/14.29100/089.66/10.34

Table 4 The rate of P₂

From this experiment, we find that the number of short-life clones is much more than that of long-life clones, and the clone patterns (including the inconsistent change, add, subtract, split) easily occur in short-life clones, especially the add pattern and inconsistent change pattern. And for the object-oriented programming language, it seems strongly. These specific clone patterns expedite the disbandment of clone groups. With this conclusion, it does not really make sense to focus on short-life clones, as they are short life and will disappear pretty soon. We should pay relatively more attention to occurrences of add pattern and inconsistent change pattern. Through analyzing the source code of jEdit, we find that inconsistent change frequently occurs in the early and mid-stage. The reason may be that the system is not stable enough in its infancy stage, and in mid-stage, the system needs some changes to adjust the function. For example, in the jEdit interim stage, the developers removed ColorOptionPane.java and StyleOptionPane.java which control the settings of the Color and Style in the 4.03 version, and added SyntaxHiliteOptionPane.java which can control the Color. This results in inconsistent change in version 4.10.

6.2 Relationship Between Number of Change Times and Clone PatternsIn order to investigate the relationship between the number of change times and the clone patterns, we remove the clone life from the clone clustering vectors (the resulting vectors feature the number of clone change times and seven clone patterns).

The results of the number of change times are shown in Table 5. In the second row, we find that clones can be divided into two groups (called groups 1 and 2 respectively) by the number of change times. We compute the statistics of the clones for each group which experienced inconstant change, and calculate the rates P₁(for absolute ratio) and P₂(for relative ratio), which are shown in the 3rd line to 5th line. We can see that the number of inconsistent change is very big. It means the inconsistent change is occur frequently. From Table 5, we find that the clones which have inconsistent change occurs much more frequently in group 1 than in group 2. It means that the inconsistent change may occur in the clones which have fewer changes. We do not consider wget because of its small versions.

表 5

Table 5 The results of number of change times and clone patterns ParametersDNSJavajEdit(22)conkyProcessHackeriTextsharp

Change times Number of clonesn=0/n > 0 122/489n=0/n0 6 264/372n≤1/n > 1 816/122n≤51/n > 1 2 089/1 155n=0/n > 0 1 131/14 266

IC32/0244/10107/5166/4911/0

P₁(%)26.23/03.90/2.6913.11/4.107.95/2.5880.55/0

P₂(%)100/096.07/3.9399.11/0.8997.65/2.35100/0

Note: Number Change—the number of change times

Table 5 The results of number of change times and clone patterns

From the results, we find that inconsistent change occurs infrequently in general. We have known that the inconsistent change often appears in the short-life clones (see section 6.1), and has fewer changes. By analyzing source code specifically, we notice that clones do not change during evolution, but the clone group which it belongs to may experience an inclusion of a new clone or a deletion/refactoring of a clone, which can lead to an inconsistent change.

6.3 Relationship Between Clone Granularity and Clone PatternsWe take the clone granularity and all evolutionary metrics into clone clustering vector, including the clone granularity, clone life, the number of change times, and the clone patterns.

The results of clone granularity are shown in Table 6. From the second row, we find that the clones can be divided into two groups (called them groups 1 and 2 respectively) by clone granularity: the small clones (group 1) and the big clones (group 2). The number of group 1 is much bigger than group 2. It can tell us that most clones are not have a bigger granularity. From the 3rd row to the 5th row, we compute the statistics of the clones which have the inconsistent change both in group 1 and 2, and also P₁ (absolute ratio) and P₂ (relative ratio). From the P₁ row, we can find that all the number is not very small. It can tell us that the inconsistent change is occurs easily. From the P₁ and P₂ row, we cannot get an general conclusion about inconsistent change pattern easily occur in which groups. From this experiment, the inconsistent change occurs in the clones are frequently, but there is no generally conclusion on which group can easily occurs.

表 6

Table 6 The results of clone granularity and clone patterns ParametersDNSJavajEditwgetconkyProcessHackeriTextsharp

LOC Number< 16/≥16 547/64≤30/ > 30 5 388/1 248< 90/90 362/16≤38/ > 38 821/107≤38/ > 38 2 000/244≤38/ > 38 2 000/244

IC26/6207/4750/0101/14160/10160/10

P₁(%)4.75/9.383.84/3.7713.81/012.30/13.088/4.105.95/5.08

P₂(%)81.25/18.7581.50/18.50100/087.83/12.1794.12/5.8896.49/3.51

Table 6 The results of clone granularity and clone patterns

6.4 Relationship Between Clone Life and Exact CloneIn this experiment, we put clone similarity and all evolutionary metrics into clone clustering vector (including the clone similarity, clone life, the number of change times, and all possible clone patterns), and perform clustering on all these clones.

The results of clone similarity are shown in Table 7. From the results, we find that the clones can be divided into two groups (1 and 2 respectively) by clone life. From the 3rd row to the 5th row, we compute the statistics of the clones which have similarity value 1.0 (named exact clone, the others are near-miss clones) in groups 1 and 2, and also the P₁ and P₂. From the 4th row in Table 7(P₁), we can see that the numbers are very small. It means exact clone take up only a little of all the clones, which also have short life. And from the 4th and 5th row, the number of exact clones in short-life group is bigger than long-life group. It means exact clones occurring in long life clones is insignificant (they will be disappeared by some reason). It suggests that we should take more attention to near-miss clones.

表 7

Table 7 The result of lone life and exact clone ParametersDNSJavajEditwgetconkyProcessHacker

Clone Life Number< 12/≥12 465/146≤13/13 6 451/85< 5/≥5 350/28≤9/ > 9 817/111≤10/ > 10 1 755/489

Similarity=114/0732/432/060/648/4

P₁(%)3.01/011.35/2.169.14/07.34/5.412.74/0.82

P₂(%)100/099.46/0.54100/090.90/9.1092.31/7.69

Table 7 The result of lone life and exact clone

From this experiment, the exact clones are generally short-life clones. By analyzing source code, we find two possible explanation. One possibility is that exact clones were deleted in subsequent versions (it may get refactored). The other possibility is that the file which the exact belong to was deleted in subsequent versions.

7 Limitations and Future WorkIn this paper, we show that applying FCM to clustering the clones to analyse various clone relationships, which can help developers understand the clones. Although the results obtained are pretty reliable (the current metrics are reasonable, and pretty good), our method still have some limitations. Our method depends on NiCad and clone mapping algorithm to build clone genealogy. We use NiCad to detect clones, the results of which are then used for mapping the clones so as to build clone genealogy. Our approach strongly depends on the availability of clone metrics. While these metrics have served us well, we would like to expand the collection of metrics in future, as new interesting metrics can hint on possibly new relationships between clones and their evolution. In the future, we can also conduct more experiments on different softwares with more clone metrics. We believe that we have presented some more meaningful conclusions to help clone analysis and clone maintenance. We can also extend our work on an whole perspective, such as clone genealogy. We believe that we will find some more meaningful conclusions to help clone analysis and management.

8 ConclusionsIn this paper, we propose an approach to cluster clones with FCM method, which can analyze the clones to get some useful conclusions. We extract some clone metrics to describe clone and clone evolution from cone detection results and clone genealogy. And we generate the clone clustering vector for each clone, which can be easily clustered. We construct an empirical study on six open-source softwares to answer the 4 questions, which can reveal the general relationships between clone and their evolution. In results, we get some conclusions which can help developers understand clones. The conclusions include: (a) Most clones do not exist for a long time in software. The short-life clones have more clone patterns (inconsistent change, add, subtract, split) than the long-life ones, especially for the object-oriented programming language. (b) Generally speaking, the inconsistent change is occur frequently, especially in the clones which have fewer changes. It suggests that we should concern the clones which have fewer changes. (c) Generally, most clones are not have a bigger granularity. The inconsistent change occurs in the clones frequently, but there is no generally conclusion on which can easily occurs. (d) The exact clone takes up only a little of all the clones, which also have short life. The exact clone occurring in long life clones is insignificant (they will be disappeared by some reason). It suggests that we should take more attention to near-miss clones.

These conclusions can provide some guidance for developers to understand clones in software development. However, our work also has some shortcomings, which are shown in the previous section. In the future, we will improve our method on clone group and clone genecology to find some new conclusions. The relationships on clone group and genealogy can be viewed as the features of clone evolution. These features may be helpful for clone management.

References
[1]Koschke R. Survey of research on software clones.Dagstuhl Seminar Proceedings 06301-Duplication, Redundancy, and Similarity in Software, 2007.(

0)

[2]Roy C K, Cordy J R. A Survey on Software Clone Detection Research. Kingston: Queen's University at Kingston, 2007.(

0)

[3]Kim M, Sazawal V, Notkin D, et al. An empirical study of code clone genealogies. Proceedings of the 10th European software engineering conference held jointly with 13th ACM SIGSOFT international symposium on Foundations of software engineering. New York, NY: ACM, 2005: 187-196.(

0)

[4]Duala-Ekoko E, Robillard M P. Clone region descriptors: Representing and tracking duplication in source code.ACM Transactions on Software Engineering and Methodology (TOSEM), 2010, 20(1): Article No. 3.DOI:10.1145/1767751.1767754(

0)

[5]Juergens E, Deissenboeck F, Hummel B, et al. Do code clones matter? Proceedings of the 31st International Conference on Software Engineering. IEEE Computer Society.Piscataway:IEEE, 2009: 485-495.(

0)

[6]Gode N, Harder J. Clone stability. Proceedings of 2011 15th European Conference on Software Maintenance and Reengineering (CSMR).Piscataway:IEEE, 2011: 65-74.(

0)

[7]Harder J, G?de N. Cloned code: stable code.Journal of Software: Evolution and Process, 2013, 25(10): 1063-1088.DOI:10.1002/smr.v25.10(

0)

[8]Roy C K, Cordy J R. NICAD: Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. Proceedings of the 16th IEEE International Conference on Program Comprehension, 2008.Piscataway: IEEE, 2008: 172-181.(

0)

[9]Kamiya T, Kusumoto S, Inoue K. CCFinder: a multi-linguistic token-based code clone detection system for large scale source code.IEEE Transactions on Software Engineering, 2002, 28(7): 654-670.DOI:10.1109/TSE.2002.1019480(

0)

[10]Jiang L, Misherghi G, Su Z, et al. Deckard: Scalable and accurate tree-based detection of code clones. Proceedings of the 29th international conference on Software Engineering, IEEE Computer Society.Piscataway:IEEE, 2007: 96-105.(

0)

[11]Rattan D, Bhatia R, Singh M. Software clone detection: A systematic review.Information and Software Technology, 2013, 55(7): 1165-1199.DOI:10.1016/j.infsof.2013.01.008(

0)

[12]Bakota T. Tracking the evolution of code clones. SOFSEM 2011: Theory and Practice of Computer Science. Berlin: Springer, 2011: 86-98.(

0)

[13]Krinke J. Is cloned code older than non-cloned code? Proceedings of the 5th International Workshop on Software Clones. New York: ACM, 2011: 28-33.(

0)

[14]Bettenburg N, Shang W, Ibrahim W, et al. An empirical study on inconsistent changes to code clones at release level. Proceedings of 16th Working Conference on Reverse Engineering, 2009: 85-94.(

0)

[15]Krinke J. A study of consistent and inconsistent changes to code clones. Proceedings of the 14th Working Conference on Reverse Engineering, 2007. WCRE 2007.Piscataway:IEEE, 2007: 170-178.(

0)

[16]Koschke R. Frontiers of software clone management.Frontiers of Software Maintenance, 2008, 2008: 119-128.(

0)

[17]Nguyen H A, Nguyen T, Pham N H, et al. Clone management for evolving software.IEEE Transactions on Software Engineering, 2012, 38(5): 1008-1026.DOI:10.1109/TSE.2011.90(

0)

[18]Duala-Ekoko E, Robillard M P. Clonetracker: tool support for code clone management. Proceedings of the 30th International Conference on Software Engineering. New York: ACM, 2008: 843-846.(

0)

[19]Tairas R, Gray J. Increasing clone maintenance support by unifying clone detection and refactoring activities.Information and Software Technology, 2012, 54(12): 1297-1307.DOI:10.1016/j.infsof.2012.06.011(

0)

[20]Ci M, Su X H, Wang T T, et al. A New Clone Group Mapping Algorithm for Extracting Clone Genealogy on Multi-version Software. Proceedings of the 3rd International Conference on Instrumentation, Measurement, Computer, Com-munication and Control (IMCCC).Piscataway:IEEE, 2013: 848-853.(