SPECIFICATION MINING

Mrs. Sangeetha K; Pankaj Dalal

doi:10.17577/IJERTCONV2IS10039

NCETECE - 2014 (Volume 2 - Issue 10)

SPECIFICATION MINING

DOI : 10.17577/IJERTCONV2IS10039

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 134
Total Downloads : 1
Authors : Mrs. Sangeetha K, Pankaj Dalal
Paper ID : IJERTCONV2IS10039
Volume & Issue : NCETECE – 2014 (Volume 2 – Issue 10)
Published (First Online): 30-07-2018
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

SPECIFICATION MINING

Mrs. Sangeetha K Prof. Pankaj Dalal

M-Tech Scholar (Software Engineering) Associate Professor (CSE)

sangi.biju@gmail.com pkjdalal@gmail.com

Shrinathji Institute of Technology & Engineering, Shrinathji Institute of Technology & Engineering, Nathdwara-313301 Nathdwara-313301

ABSTRACT

To aid program testing eorts and help program verication tools to nd bugs and ensure correctness of systems the mined specications can be used. Mining of specification process starts with a program under analysis and/or a set of test cases. By applying static or dynamic analysis method traces should be produced from the program. Techniques employing dynamic analysis require the running of the test cases to produce a set of traces which is later analyzed. Program traces are often long and involve several phases. Semantic clustering can be used to split the traces vertically into phases based on comments or annotations on source code. Our technique merge the idea of SMArTIC (Specification Mining Architecture with Trace filtering and Clustering) and semantic clustering, to split the traces into horizontally (splitting traces into separate clusters) and also vertically (splitting traces into phases) which improves the accuracy, robustness and scalability of specification miners with trace filtering and clustering. This will also leverage the benefit of divide and conquer approach in the specification mining process.

Keywords- Specification mining, filtering, clustering, LSI

INTRODUCTION

Software bugs are prevalent. Bugs not only make it more expensive in developing software systems due to the high cost involved in debugging, but also may cause various security vulnerabilities. The absence of specications has made it harder to locate bugs. Many existing bug nding tools, e.g., model checking, require the availability of specications in order to locate bugs which are dened as anomalies or violations of these specications.

To address the above challenges (i.e., to improve program understanding and to nd bugs), specication mining has been proposed. It is a program analysis method to automatically infer the specification of a program based on examples of correct usage. Usage refers to the manner in which the program or its exposed methods are invoked. For example, correct usage of resources such as a file or new connection follows acceptable invocation sequence: acquisition, access and then release. Similarly to use individual methods correctly, the parameters passed to it should meet the necessary preconditions. These are the implicit rules, followed by most programs but not explicitly stated, that mining techniques attempt to uncover.

The mining of various specication formats such as automata and temporal rules has been developed. In general, specication mining techniques employ data mining or machine learning on execution traces to generate

models that are useful in program verication. These techniques work under the assumption that by observing sucient executions of a good software implementation, inferences regarding the specication (or expected behavior) of the software can be made.

Specication mining starts with a program under analysis and/or a set of test cases. There have been both dynamic and static approaches for specication mining. Broadly, dynamic specication mining techniques rely on actual executions of programs. In contrast, static approaches look to extract the specication by reasoning on the control ow of a subject program or of other client programs that invoke the subject. Static specication mining can be performed if program source code is available. However, to obtain precise specications, expensive analysis may have to be performed to eliminate infeasible paths. This obstacle is more overwhelming in the distributed case, where feasible scenarios (the number of processes and how they will interact) have to infer based on the static view provided by the program source code executed by each process.
BACKGROUND

The term specication mining is rst coined by Ammons et al.in [1]. According to him, Specication mining is a process of inferring models or properties that hold for a system. It discovers some of the temporal and data- dependence relationships that a program follows when it interacts with an application programming interface (API) or abstract data type (ADT).

Alur et al. propose an approach to infer a nite state machine from Application Programming Interface (API) code [3]. Mariani and PezzÂ´e[4] propose a new grammar inference engine specially suited for program executions named k-behavior. Acharya et al [5] propose a static analysis approach to mine nite state machines from systems. Their tool leverages a model checker to statically generate traces of a system. Dallmeier et al [6] propose a hybrid static and dynamic analysis approach to infer a nite state machine. First, a set of methods termed as mutators (i.e., those that change a system state) and another set termed as inspectors (i.e., those that read/inspect a system state) are determined statically Acharya et al. extend their previous work [5] in [8]. Static traces are rst extracted from program code similar to their previous work. Relevant segments of traces are then recovered. These segments of traces fed to a frequent partial order miner. The resultant partial orders are then composed to generate a specication in the form of a nite state machine.

Mariani et al. extend their previous work [4] in [9]. They propose a technique to generate prioritized regression test cases for the integration of Commercial-o-the-Shelf (COTS) components. A model in the form of a nite state machine and a set of boolean expressions on a set of variables, operators, and values is learned based on an

Program

Tracer

Instrumented Program

older version of a system. Lorenzoli et al. mine extended nite state machines (EFSMs) in [10]. An extended nite state machine enriches a standard nite state machine with value-based invariants. Mariani and Pastore mine nite

Test Inputs Run Traces

state machines to identify failure causes from system logs[11]. System logs are rst collected. Several pre- processing modules to detect for events in logs and to transform data to an appropriate format that abstracts away

Flow Dependence Annotator

Annotated traces

concrete values are rst employed.

Scenario Scenario

Lo and Khoo extend the work by Ammons et al.[1] by proposing a metric of precision and recall in evaluating the quality of a specication mining engine producing nite state machines[7]. They also introduce trace clustering and trace ltering to reduce the eect of bad traces and inaccuracies during the inference of mined specications. They developed a mining architecture called SMArTIC which is Specication Mining Architecture with Trace fIltering and Clustering and is a API specication mining

seeds

Scenario Extractor

Automation learner

Figure 1: Specification Mining System

strings

Specification

architecture used to improve the accuracy, robustness and scalability of specication miners.
MINING PROCESS

Mining process begins with traces of a programs run-time interaction with an API or ADT. According to Ammons et al. [1] specication mining system is composed of four parts: tracer, ow dependence annotator, scenario extractor, and automaton learner (Figure1).

The tracer instruments programs so that they trace and record their interactions with an API or ADT, as well a compute their usual results. The tracers produce traces in a standard form, so that the rest of the process is independent of the tracing technology.

Flow dependence annotation is the rst step in rening the traces into interaction scenarios, which can be fed to the learner. It connects an interaction that produces a value with the interactions that consume the value. Next, the scenario extractor uses these dependences to extract interaction scenariossmall sets of dependent interactionsand puts the scenarios into a standard, abstract form.

The automaton learner is composed of two parts: an off- the-shelf probabilistic nite state automaton (PFSA) learner and a post processor called the corer.

This process can be illustrated with the help of an example program statement as given below.

int s = socket(AF_INET, SOCK_STREAM, 0);

…

bind(s, &serv_addr, sizeof(serv_addr));

…

listen(s, 5);

…

while(1) {

int ns = accept(s, &addr, &len); if (ns < 0) break;

do { read(ns, buffer, 255);

…

write(ns, buffer, size); if (cond1) return; }

while (cond2) 16 close(ns); } close(s);

Figure 2: An example program using the socket API.

This program uses the server-side socket API [14]. It generally observes the correct protocol: create a new socket s through a call to socket, prepare s to accept connection s by calling bind and listen, call accept for each connection, service each connection, and nally call close to destroy s. Unfortunately, the program is buggy: if the return statement on line 14 is executed, s is never closed.

Even though a program is buggy, individual interaction traces can be correct. The following figure shows one such trace. These traces can be given as the input to any mining system. After processing these traces, the automaton based miner will produce a specification automaton.
1. socket (domain = 2, type = 1, proto = 0, return = 7)
2. bind (so = 7, addr = 0x400120, addr_len = 6, return = 0)
3. listen (so = 7, backlog = 5, return = 0)
4. accept(so=7,addr = 0x400200,addr_len = 0x400240,return = 8)
5 read (fd = 8, buf = 0x400320, len = 255, return = 12) 6 write (fd = 8, buf = 0x400320, len = 12, return = 12) 7 read (fd = 8, buf = 0x400320, len = 255, return = 7)
1. write (fd = 8, buf = 0x400320, len = 7, return = 7)
2. close (fd = 8, return = 0)
  
  10 accept (so = 7, addr = 0x400200, addr_len = 0x400240, return = 10)
  
  11 read (fd = 10, buf = 0x400320, len = 255, return = 13) 12 write (fd = 10, buf = 0x400320, len = 13, return = 13) 13 close (fd = 10, return = 0)
  
  Figure 3: A trace of an execution of the program in figure 2- Input to
  
  a mining system
  
  The figure 4 shows the output of a mining system. It shows the specification automaton for the socket protocol of the program segment given in the figure 2.
  
  socket(return = x) bind(so = x)
  
  listen(so = x) accept(so = x, return = y)
  
  read(fd = y) write(fd = y)
  
  Figure 5: SMArTIC Structure
  
  The overall structure of SMArTIC is as shown in Figure 5. It comprises 4 major blocks, namely ltering, clustering, learning and merging blocks. Each block is in turn com- posed of several major elements. The ltering block lters erroneous traces to address the robustness issue. The clustering block divides traces into groups of similar traces to address scalability issue. The learning block generates specications in the form of automata. The merging block merges the automatons generated from each cluster into a unied one.
  1. FILTERING BLOCK
    
    close(fd = x)
    
    close(fd = y)
    
    Figure 6: Filtering Block
    
    The ltering block aims to lter out erroneous traces based on common behavior found in a multi-set of program
    
    Figure 4: The output of the mining process: a specication automaton for the socket protocol.
MINING ARCHITECTURE

Lo and Khoo proposed a novel architecture for automaton based specification mining [7]. It explores the art and science behind the construction of such a miner. It achieves specication mining through pipelining of four functional components: Error-trace ltering, clustering, learning, and automaton merging.

This architecture known as SMArTIC is Specication Mining Architecture with Trace fIltering and Clustering which is a API specication mining architecture used to improve the accuracy, robustness and scalability of specication miners.

traces. Since a trace is a temporal or sequential ordering of events, representing common behavior by statistically signicant temporal rules will be appropriate. Certainly, temporal rules based on full set of temporal logics will be a good candidate, but it is desirable to have a more light- weight solution.

Implementation-wise, the structure of the ltering block is as shown in Figure 6..
CLUSTERING ENHANCEMENT

Lo and Khoo has proposed an architecture which groups the similar traces into cluster (horizontal split). Program traces are often long and involve several phases. Now semantic clustering can be used to split the traces vertically into phases based on comments or annotations on source code. By merging the idea of SMArTIC and semantic clustering, it may be possible to split the traces not only horizontally (splitting traces into separate clusters) but also vertically (splitting traces into phases). This will leverage the benefit of divide and conquer approach in the specification mining process. i.e, in the mining process after filtering the erroneous traces from the set of traces, it should be fed to the clustering block.

Apart from the previous SMArTIC, our technique will split the traces both into clusters as well as phases. This will localize the inaccuracies in learning one sub specification and prevent the inaccuracies from being permeated to other sub specifications. Also by grouping related traces together, better generalization can be achieved when learning from each cluster.

The success of this process depends upon the choice of the algorithms. For splitting the traces into clusters or splitting the traces horizontally, we can use k- medoid algorithm as used in SMArTIC [7]. Then the splitted traces will be given to retrieve the semantic similarity between different traces and we cluster these traces according to their similarity into different phases. The semantics of the source code i.e., the names of identifiers, comments etc can be analyzed by using an information retrieval technique called Latent Semantic Indexing (LSI) [15].

Latent semantic indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. A key feature of LSI is its ability to extract the conceptual content of a body of text by establishing associations between those terms that occur in similar contexts.

It is called Latent Semantic Indexing because of its ability to correlate semantically related terms that are latent in a collection of text; it was first applied to text at Bell Laboratories in the late 1980s. The method, also called latent semantic analysis (LSA), uncovers the underlying latent semantic structure in the usage of words in a body of text and how it can be used to extract the meaning of the text in response to user queries, commonly referred to as concept searches. Queries, or concept searches, against a set of documents that have undergone LSI will return results that are conceptually similar in meaning to the search criteria even if the results dont share a specific word or words with the search criteria.

By merging the idea of SMArTIC and semantic clustering, it may be possible to split the traces not only horizontally (splitting traces into separate clusters) but also vertically

(splitting traces into phases). This will leverage the benefit of divide and conquer approach in the specification mining process
CONCLUSION

Specication mining is a process of inferring models or properties that hold for a system. It starts with a program under analysis and/or a set of test cases and also involves the running of the test cases to produce a set of traces which is later analyzed. This topic has been ongoing for over a decade and more than 50 papers have been published on this topic. These papers can be classified into studies on the extraction of nite state machines, works that mine for value-based invariants, works that mine rules and patterns and studies extracting sequence diagram like specications.

Specication mining process originated in the eld of software engineering and programming language can be improved by the synergy of various computer science domains: data mining, software engineering, programming language, learning theory, automata theory, etc., resulting in: new and more objective evaluation frameworks, more accurate mining results, more compact mining results, scalable and manageable mining processes, automation of manual processes, novel applications and even has impact on the original domains where several techniques used are originated.

D.Lo and S. C. Khoo has proposed a novel architecture called SMArTIC [7] which addresses the research issues like accuracy, robustness and scalability. They proposed a metric of precision and recall in evaluating the quality of a specication mining engine producing nite state machines. They also introduce trace clustering and trace ltering to reduce the eect of bad traces and inaccuracies during the inference of mined specications.

Now we can extend the clustering process of specification mining, by using the concept of Latent Semantic Indexing for splitting the traces into phases. This will leverage the benefit of divide and conquer approach in the specification mining process.

REFERENCES

G. Ammons, R. Bodik, and J. R. Larus. ining specication.

In Proc. of Principles of Programming Languages, 2002
J.E. Cook and A.L. Wolf. Automating process discovery through event data analysis. In Proceedings of ACM/IEEE International Conference on Software Engineering, pages 73 82, 1995.
R. Alur, P. Cerny, G. Gupta, and P. Madhusudan. Synthesis of interface specications for java classes. In Proceedings of ACM Symposium on Principles of Programming Languages, pages 98109, 2005.
L. Mariani and M. Pezz`e. Behavior capture and test: automated analysis for component integration. In Proceedings of IEEE International Conference on Engineering of Complex Computer Systems, pages 292301,2005.
M. Acharya, T. Xie, and J. Xu. Mining interface specications for generating checkable robustness properties. In Proceedings of International Symposium on Software Reliability Engineering, pages 311320, 2006.
V. Dallmeier, C. Lindig, A. Wasylkowski, and A. Zeller. Mining object behavior with ADABU. In Proceedings of International Workshop on Dynamic Analysis, pages 1724, 2006.
D. Lo and S-C. Khoo. SMArTIC: Specication mining architecture with trace ltering and clustering. In SoC-NUS tech. report, TRA 8/06, 2006.
M. Acharya, T. Xie, J. Pei, and J. Xu. Mining API patterns as partial orders from source code: from usage scenarios to specications. In Proceedings of Joint Meeting of the European Software Engineering Conference and the ACM International Symposium on Foundations of Software Engineering, pages 2534, 2007.
L. Mariani, S. Papagiannakis, and M. Pezz`e. Compatibility and regression testing of COTS-component-based software. In Proceedings of ACM/IEEE International Conference on Software Engineering, pages 8595, 2007.
D. Lorenzoli, L. Mariani, and M. Pezz`e. Automatic Generation of Software Behavioral Models. In Proceedings of ACM/IEEE International Conference on Software Engineering, pages 501510, 2008.
L. Mariani and F. Pastore. Automated identication of failure causes in system logs. In Proceedings of International Symposium on Software Reliability Engineering, pages 117 126, 2008.
W. Damm and D. Harel. LSCs: breathing life into message sequence charts. Journal on Formal Methods in System Design, 19(1):4580, 2001.
A. Kuhn, S. Ducasse, and T. Girba. Enriching reverse engineering with semantic clustering. In Proc. of Work. Conf. on Reverse Engineering, 2005.
Douglas E.Comer and David L.Stevens. Internetworking with TCP/IP.Client-server Programming and Applications, BSD Socket Version. Prentice-Hall,Englewood Cliffs,NJ 07632, USA,1993.
A. Kuhn, S. Ducasse, and T. Girba. Enriching reverse engineering with semantic clustering. In Proc. of Work. Conf. on Reverse Engineering, 2005.
Mining Software Specifications: Methodologies and Applications By: David Lo; Siau-Cheng Khoo; Jiawei Han; Chao Liu0

SPECIFICATION MINING

Figure 3: A trace of an execution of the program in figure 2- Input to

a mining system

Leave a Reply