SPECIFICATION MINING

DOI : 10.17577/IJERTCONV2IS10039

Download Full-Text PDF Cite this Publication

Text Only Version

SPECIFICATION MINING

Mrs. Sangeetha K Prof. Pankaj Dalal

M-Tech Scholar (Software Engineering) Associate Professor (CSE)

sangi.biju@gmail.com pkjdalal@gmail.com

Shrinathji Institute of Technology & Engineering, Shrinathji Institute of Technology & Engineering, Nathdwara-313301 Nathdwara-313301

ABSTRACT

To aid program testing eorts and help program verication tools to nd bugs and ensure correctness of systems the mined specications can be used. Mining of specification process starts with a program under analysis and/or a set of test cases. By applying static or dynamic analysis method traces should be produced from the program. Techniques employing dynamic analysis require the running of the test cases to produce a set of traces which is later analyzed. Program traces are often long and involve several phases. Semantic clustering can be used to split the traces vertically into phases based on comments or annotations on source code. Our technique merge the idea of SMArTIC (Specification Mining Architecture with Trace filtering and Clustering) and semantic clustering, to split the traces into horizontally (splitting traces into separate clusters) and also vertically (splitting traces into phases) which improves the accuracy, robustness and scalability of specification miners with trace filtering and clustering. This will also leverage the benefit of divide and conquer approach in the specification mining process.

Keywords- Specification mining, filtering, clustering, LSI

  1. INTRODUCTION

    Software bugs are prevalent. Bugs not only make it more expensive in developing software systems due to the high cost involved in debugging, but also may cause various security vulnerabilities. The absence of specications has made it harder to locate bugs. Many existing bug nding tools, e.g., model checking, require the availability of specications in order to locate bugs which are dened as anomalies or violations of these specications.

    To address the above challenges (i.e., to improve program understanding and to nd bugs), specication mining has been proposed. It is a program analysis method to automatically infer the specification of a program based on examples of correct usage. Usage refers to the manner in which the program or its exposed methods are invoked. For example, correct usage of resources such as a file or new connection follows acceptable invocation sequence: acquisition, access and then release. Similarly to use individual methods correctly, the parameters passed to it should meet the necessary preconditions. These are the implicit rules, followed by most programs but not explicitly stated, that mining techniques attempt to uncover.

    The mining of various specication formats such as automata and temporal rules has been developed. In general, specication mining techniques employ data mining or machine learning on execution traces to generate

    models that are useful in program verication. These techniques work under the assumption that by observing sucient executions of a good software implementation, inferences regarding the specication (or expected behavior) of the software can be made.

    Specication mining starts with a program under analysis and/or a set of test cases. There have been both dynamic and static approaches for specication mining. Broadly, dynamic specication mining techniques rely on actual executions of programs. In contrast, static approaches look to extract the specication by reasoning on the control ow of a subject program or of other client programs that invoke the subject. Static specication mining can be performed if program source code is available. However, to obtain precise specications, expensive analysis may have to be performed to eliminate infeasible paths. This obstacle is more overwhelming in the distributed case, where feasible scenarios (the number of processes and how they will interact) have to infer based on the static view provided by the program source code executed by each process.

  2. BACKGROUND

    The term specication mining is rst coined by Ammons et al.in [1]. According to him, Specication mining is a process of inferring models or properties that hold for a system. It discovers some of the temporal and data- dependence relationships that a program follows when it interacts with an application programming interface (API) or abstract data type (ADT).

    Alur et al. propose an approach to infer a nite state machine from Application Programming Interface (API) code [3]. Mariani and Pezz´e[4] propose a new grammar inference engine specially suited for program executions named k-behavior. Acharya et al [5] propose a static analysis approach to mine nite state machines from systems. Their tool leverages a model checker to statically generate traces of a system. Dallmeier et al [6] propose a hybrid static and dynamic analysis approach to infer a nite state machine. First, a set of methods termed as mutators (i.e., those that change a system state) and another set termed as inspectors (i.e., those that read/inspect a system state) are determined statically Acharya et al. extend their previous work [5] in [8]. Static traces are rst extracted from program code similar to their previous work. Relevant segments of traces are then recovered. These segments of traces fed to a frequent partial order miner. The resultant partial orders are then composed to generate a specication in the form of a nite state machine.

    Mariani et al. extend their previous work [4] in [9]. They propose a technique to generate prioritized regression test cases for the integration of Commercial-o-the-Shelf (COTS) components. A model in the form of a nite state machine and a set of boolean expressions on a set of variables, operators, and values is learned based on an

    Program

    Tracer

    Instrumented Program

    older version of a system. Lorenzoli et al. mine extended nite state machines (EFSMs) in [10]. An extended nite state machine enriches a standard nite state machine with value-based invariants. Mariani and Pastore mine nite

    Test Inputs Run Traces

    state machines to identify failure causes from system logs[11]. System logs are rst collected. Several pre- processing modules to detect for events in logs and to transform data to an appropriate format that abstracts away

    Flow Dependence Annotator

    Annotated traces

    concrete values are rst employed.

    Scenario Scenario

    Lo and Khoo extend the work by Ammons et al.[1] by proposing a metric of precision and recall in evaluating the quality of a specication mining engine producing nite state machines[7]. They also introduce trace clustering and trace ltering to reduce the eect of bad traces and inaccuracies during the inference of mined specications. They developed a mining architecture called SMArTIC which is Specication Mining Architecture with Trace fIltering and Clustering and is a API specication mining

    seeds

    Scenario Extractor

    Automation learner

    Figure 1: Specification Mining System

    strings

    Specification

    architecture used to improve the accuracy, robustness and scalability of specication miners.

  3. MINING PROCESS

    Mining process begins with traces of a programs run-time interaction with an API or ADT. According to Ammons et al. [1] specication mining system is composed of four parts: tracer, ow dependence annotator, scenario extractor, and automaton learner (Figure1).

    The tracer instruments programs so that they trace and record their interactions with an API or ADT, as well a compute their usual results. The tracers produce traces in a standard form, so that the rest of the process is independent of the tracing technology.

    Flow dependence annotation is the rst step in rening the traces into interaction scenarios, which can be fed to the learner. It connects an interaction that produces a value with the interactions that consume the value. Next, the scenario extractor uses these dependences to extract interaction scenariossmall sets of dependent interactionsand puts the scenarios into a standard, abstract form.

    The automaton learner is composed of two parts: an off- the-shelf probabilistic nite state automaton (PFSA) learner and a post processor called the corer.

    This process can be illustrated with the help of an example program statement as given below.

    int s = socket(AF_INET, SOCK_STREAM, 0);

    bind(s, &serv_addr, sizeof(serv_addr));

    listen(s, 5);

    while(1) {

    int ns = accept(s, &addr, &len); if (ns < 0) break;

    do { read(ns, buffer, 255);

    write(ns, buffer, size); if (cond1) return; }

    while (cond2) 16 close(ns); } close(s);

    Figure 2: An example program using the socket API.

    This program uses the server-side socket API [14]. It generally observes the correct protocol: create a new socket s through a call to socket, prepare s to accept connection s by calling bind and listen, call accept for each connection, service each connection, and nally call close to destroy s. Unfortunately, the program is buggy: if the return statement on line 14 is executed, s is never closed.

    Even though a program is buggy, individual interaction traces can be correct. The following figure shows one such trace. These traces can be given as the input to any mining system. After processing these traces, the automaton based miner will produce a specification automaton.

    1. socket (domain = 2, type = 1, proto = 0, return = 7)

    2. bind (so = 7, addr = 0x400120, addr_len = 6, return = 0)

    3. listen (so = 7, backlog = 5, return = 0)

    4. accept(so=7,addr = 0x400200,addr_len = 0x400240,return = 8)

    5 read (fd = 8, buf = 0x400320, len = 255, return = 12) 6 write (fd = 8, buf = 0x400320, len = 12, return = 12) 7 read (fd = 8, buf = 0x400320, len = 255, return = 7)

    1. write (fd = 8, buf = 0x400320, len = 7, return = 7)

    2. close (fd = 8, return = 0)

      10 accept (so = 7, addr = 0x400200, addr_len = 0x400240, return = 10)

      11 read (fd = 10, buf = 0x400320, len = 255, return = 13) 12 write (fd = 10, buf = 0x400320, len = 13, return = 13) 13 close (fd = 10, return = 0)

      Figure 3: A trace of an execution of the program in figure 2- Input to

      a mining system

      The figure 4 shows the output of a mining system. It shows the specification automaton for the socket protocol of the program segment given in the figure 2.

      socket(return = x) bind(so = x)

      listen(so = x) accept(so = x, return = y)

      read(fd = y) write(fd = y)

      Figure 5: SMArTIC Structure

      The overall structure of SMArTIC is as shown in Figure 5. It comprises 4 major blocks, namely ltering, clustering, learning and merging blocks. Each block is in turn com- posed of several major elements. The ltering block lters erroneous traces to address the robustness issue. The clustering block divides traces into groups of similar traces to address scalability issue. The learning block generates specications in the form of automata. The merging block merges the automatons generated from each cluster into a unied one.

      1. FILTERING BLOCK

        close(fd = x)

        close(fd = y)

        Figure 6: Filtering Block

        The ltering block aims to lter out erroneous traces based on common behavior found in a multi-set of program

        Figure 4: The output of the mining process: a specication automaton for the socket protocol.

  4. MINING ARCHITECTURE

    Lo and Khoo proposed a novel architecture for automaton based specification mining [7]. It explores the art and science behind the construction of such a miner. It achieves specication mining through pipelining of four functional components: Error-trace ltering, clustering, learning, and automaton merging.

    This architecture known as SMArTIC is Specication Mining Architecture with Trace fIltering and Clustering which is a API specication mining architecture used to improve the accuracy, robustness and scalability of specication miners.

    traces. Since a trace is a temporal or sequential ordering of events, representing common behavior by statistically signicant temporal rules will be appropriate. Certainly, temporal rules based on full set of temporal logics will be a good candidate, but it is desirable to have a more light- weight solution.

    Implementation-wise, the structure of the ltering block is as shown in Figure 6..

      1. CLUSTERING BLOCK

        Input traces might be mixed up from several unrelated scenarios. Grouping unrelated traces together for a learner to learn might multiply the effect of inaccuracies in learning a scenario. Such inaccuracies can be further permeated into other scenarios through generalization. The clustering block converts a set of traces into groups of related traces. Clustering is meant to localize inaccuracies in learning one sub-specication and prevent the inaccuracies from being permeated to other sub- specications. Furthermore, by grouping related traces together, better generalization can be achieved when learning from each cluster. Two major issues pertaining to clustering are: the choice of clustering algorithm and an

        appropriate similarity metric; i.e., measurement of similarity between two traces. The performance of the clustering algorithm is aected by appropriate similarity/distance metric. The general structure of the clustering block is as shown in Figure 7.

        1] CLUSTERING ALGORITHM: A classical o-the-shelf clustering algorithm, namely the k-medoid algorithm is used in SMArTIC. The k-medoid algorithm works by computing the distance between pairs of data items based on a similarity metric; this corresponds to computing the distance between pairs of traces. It then groups the traces with small distances apart into the same cluster. The k in k- medoid is the number of clusters to be created. In SMArTIC, the Turn algorithm presented by Foss is adapted into the k-medoid algorithm. The Turn algorithm can automatically determine the number of clusters to be created by considering the similarities within each cluster and dierences among clusters. This algorithm will repetitively increase the number of clusters. For each repetition, it will divide datasets into clusters and evaluate a measure of similarities within each cluster and dierences among dierent clusters. The algorithm will terminate once a local maximum is reached.

        Similarity metric for the program traces is calculated from the regular expression of the program traces. The regular expression representation will be obtained by converting a trace to its hierarchical grammar representation using Sequitur. The output of sequitur will be post-processed to construct the regular expression representation and then be fed in as input to global sequence alignment. With these a method to nd a reasonable distance metric for the similarity of program traces will be obtained. Generated sequitur grammar of a trace will be processed through several passes of reduction to produce a regular expression representation

        Figure 7: Clustering Block Structure

      2. LEARNING BLOCK

        Although temporal rules have also been used to capture certain information of a program specication, automata

        have been commonly used in capturing specications, especially protocol specications. The purpose of this learning block is to learn automatons from clusters of ltered traces. This block is actually a placeholder in the architecture. Dieret PFSA specication miners can be placed into this block, as long as they meet the input-output specication of a learner. Once a learner is plugged in, it will be used to mine the traces obtained from each cluster. At the end, the learner produces one mined automaton for each cluster. Sk-strings learner is used by Ammons [1] to mine the specication of the X11 windowing library. It is an extension of the k-tails heuristic algorithm of Biermann and Feldman for learning stochastic automata. In k-tails, two nodes in a constructed automaton are checked for equivalence by looking at subsequent k-length strings that can be generated from them. Dierent from k-tails, in sk- strings, subsequent strings need not necessarily end at an end node, except for strings of length less than k. Furthermore, only the top s% of the most probable strings that can be generated from both nodes is considered. Implementation-wise, the sk-strings learner rst builds a canonical machine similar to a prex-tree acceptor from the traces. The nodes in this canonical machine are later merged if they are indistinguishable with respect to the top s% of the most probable strings of length at most k that can be generated starting from them.

      3. MERGING BLOCK

        The merging process aims to merge multiple PFSAs produced by the learner into one such that there is no loss in precision, recall and likelihood before and after the merge. Equivalently, the merged PFSA accepts exactly the same set of sentences as the combined set (i.e., union) of sentences accepted by the multiple PFSAs. The primary purpose of the merging process is to reduce the number of states residing in the output PFSA by collapsing those transitions behaving equivalently in two or more input PFSAs, thus improving scalability.

        Merging of two PFSAs involves identication of equivalent transitions between the two PFSAs. The output PFSA contains all transitions available in the input PFSAs, with each set of equivalent transitions represented by a single transition.

        Merger block will merge PFSAs produced by learner block to a unied one. The merging is performed in the safe way where no further generalization is performed during the merging process. Hence the set of language accepted by the nal PFSA will be the union of the set of languages accepted by each of the PFSAs. The merging process is performed iteratively by merging 2 PFSAs at a time. Lets call them X and Y for ease of reference.

        There are 6 steps in automaton merge algorithm. They are

        1. Handling of exception cases,

        2. Creation of uniable list,

        3. Creation of mergable list,

        4. Creation of equivalence class,

        5. Creation of structural merge

        6. Creation of full merge.

  5. CLUSTERING ENHANCEMENT

    Lo and Khoo has proposed an architecture which groups the similar traces into cluster (horizontal split). Program traces are often long and involve several phases. Now semantic clustering can be used to split the traces vertically into phases based on comments or annotations on source code. By merging the idea of SMArTIC and semantic clustering, it may be possible to split the traces not only horizontally (splitting traces into separate clusters) but also vertically (splitting traces into phases). This will leverage the benefit of divide and conquer approach in the specification mining process. i.e, in the mining process after filtering the erroneous traces from the set of traces, it should be fed to the clustering block.

    Apart from the previous SMArTIC, our technique will split the traces both into clusters as well as phases. This will localize the inaccuracies in learning one sub specification and prevent the inaccuracies from being permeated to other sub specifications. Also by grouping related traces together, better generalization can be achieved when learning from each cluster.

    The success of this process depends upon the choice of the algorithms. For splitting the traces into clusters or splitting the traces horizontally, we can use k- medoid algorithm as used in SMArTIC [7]. Then the splitted traces will be given to retrieve the semantic similarity between different traces and we cluster these traces according to their similarity into different phases. The semantics of the source code i.e., the names of identifiers, comments etc can be analyzed by using an information retrieval technique called Latent Semantic Indexing (LSI) [15].

    Latent semantic indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. A key feature of LSI is its ability to extract the conceptual content of a body of text by establishing associations between those terms that occur in similar contexts.

    It is called Latent Semantic Indexing because of its ability to correlate semantically related terms that are latent in a collection of text; it was first applied to text at Bell Laboratories in the late 1980s. The method, also called latent semantic analysis (LSA), uncovers the underlying latent semantic structure in the usage of words in a body of text and how it can be used to extract the meaning of the text in response to user queries, commonly referred to as concept searches. Queries, or concept searches, against a set of documents that have undergone LSI will return results that are conceptually similar in meaning to the search criteria even if the results dont share a specific word or words with the search criteria.

    By merging the idea of SMArTIC and semantic clustering, it may be possible to split the traces not only horizontally (splitting traces into separate clusters) but also vertically

    (splitting traces into phases). This will leverage the benefit of divide and conquer approach in the specification mining process

  6. CONCLUSION

Specication mining is a process of inferring models or properties that hold for a system. It starts with a program under analysis and/or a set of test cases and also involves the running of the test cases to produce a set of traces which is later analyzed. This topic has been ongoing for over a decade and more than 50 papers have been published on this topic. These papers can be classified into studies on the extraction of nite state machines, works that mine for value-based invariants, works that mine rules and patterns and studies extracting sequence diagram like specications.

Specication mining process originated in the eld of software engineering and programming language can be improved by the synergy of various computer science domains: data mining, software engineering, programming language, learning theory, automata theory, etc., resulting in: new and more objective evaluation frameworks, more accurate mining results, more compact mining results, scalable and manageable mining processes, automation of manual processes, novel applications and even has impact on the original domains where several techniques used are originated.

D.Lo and S. C. Khoo has proposed a novel architecture called SMArTIC [7] which addresses the research issues like accuracy, robustness and scalability. They proposed a metric of precision and recall in evaluating the quality of a specication mining engine producing nite state machines. They also introduce trace clustering and trace ltering to reduce the eect of bad traces and inaccuracies during the inference of mined specications.

Now we can extend the clustering process of specification mining, by using the concept of Latent Semantic Indexing for splitting the traces into phases. This will leverage the benefit of divide and conquer approach in the specification mining process.

REFERENCES

  1. G. Ammons, R. Bodik, and J. R. Larus. ining specication.

    In Proc. of Principles of Programming Languages, 2002

  2. J.E. Cook and A.L. Wolf. Automating process discovery through event data analysis. In Proceedings of ACM/IEEE International Conference on Software Engineering, pages 73 82, 1995.

  3. R. Alur, P. Cerny, G. Gupta, and P. Madhusudan. Synthesis of interface specications for java classes. In Proceedings of ACM Symposium on Principles of Programming Languages, pages 98109, 2005.

  4. L. Mariani and M. Pezz`e. Behavior capture and test: automated analysis for component integration. In Proceedings of IEEE International Conference on Engineering of Complex Computer Systems, pages 292301,2005.

  5. M. Acharya, T. Xie, and J. Xu. Mining interface specications for generating checkable robustness properties. In Proceedings of International Symposium on Software Reliability Engineering, pages 311320, 2006.

  6. V. Dallmeier, C. Lindig, A. Wasylkowski, and A. Zeller. Mining object behavior with ADABU. In Proceedings of International Workshop on Dynamic Analysis, pages 1724, 2006.

  7. D. Lo and S-C. Khoo. SMArTIC: Specication mining architecture with trace ltering and clustering. In SoC-NUS tech. report, TRA 8/06, 2006.

  8. M. Acharya, T. Xie, J. Pei, and J. Xu. Mining API patterns as partial orders from source code: from usage scenarios to specications. In Proceedings of Joint Meeting of the European Software Engineering Conference and the ACM International Symposium on Foundations of Software Engineering, pages 2534, 2007.

  9. L. Mariani, S. Papagiannakis, and M. Pezz`e. Compatibility and regression testing of COTS-component-based software. In Proceedings of ACM/IEEE International Conference on Software Engineering, pages 8595, 2007.

  10. D. Lorenzoli, L. Mariani, and M. Pezz`e. Automatic Generation of Software Behavioral Models. In Proceedings of ACM/IEEE International Conference on Software Engineering, pages 501510, 2008.

  11. L. Mariani and F. Pastore. Automated identication of failure causes in system logs. In Proceedings of International Symposium on Software Reliability Engineering, pages 117 126, 2008.

  12. W. Damm and D. Harel. LSCs: breathing life into message sequence charts. Journal on Formal Methods in System Design, 19(1):4580, 2001.

  13. A. Kuhn, S. Ducasse, and T. Girba. Enriching reverse engineering with semantic clustering. In Proc. of Work. Conf. on Reverse Engineering, 2005.

  14. Douglas E.Comer and David L.Stevens. Internetworking with TCP/IP.Client-server Programming and Applications, BSD Socket Version. Prentice-Hall,Englewood Cliffs,NJ 07632, USA,1993.

  15. A. Kuhn, S. Ducasse, and T. Girba. Enriching reverse engineering with semantic clustering. In Proc. of Work. Conf. on Reverse Engineering, 2005.

  16. Mining Software Specifications: Methodologies and Applications By: David Lo; Siau-Cheng Khoo; Jiawei Han; Chao Liu0

Leave a Reply