Ranking of Database Query Results Based on Concept Hierarchies

Kaligotla Ravi Kumar

doi:10.17577/IJERTV3IS10951

Volume 03, Issue 01 (January 2014)

Ranking of Database Query Results Based on Concept Hierarchies

DOI : 10.17577/IJERTV3IS10951

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 86
Total Downloads : 205
Authors : Kaligotla Ravi Kumar
Paper ID : IJERTV3IS10951
Volume & Issue : Volume 03, Issue 01 (January 2014)
Published (First Online): 29-01-2014
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Ranking of Database Query Results Based on Concept Hierarchies

Kaligotla Ravi Kumar

Lecturer, Department of Computer Science Engineering Bule Hora University, Ethiopia, Africa.

Abstract:

Search queries on biomedical databases, such as PubMed, often return a large number of results, only a small subset of which is relevant to the user. Ranking and categorization, which can also be combined, have been proposed to alleviate this information overload problem. Results categorization for biomedical databases is the focus of this work. A natural way to organize biomedical citations is according to their MeSH annotations. MeSH is a comprehensive concept hierarchy used by PubMed. In this paper, we present the BioNav system, a novel search interface that enables the user to navigate large number of query results by organizing them using the MeSH concept hierarchy. First, the query results are organized into a navigation tree. At each node expansion step, BioNav reveals only a small subset of the concept nodes, selected such that the expected user navigation cost is minimized. In contrast, previous works expand the hierarchy in a predefined static manner, without navigation cost modeling. We show that the problem of selecting the best concepts to reveal at each node expansion is NP- complete and propose an efficient heuristic as well as a feasible optimal algorithm for relatively small trees. We show experimentally that BioNav outperforms state-of-the-art categorization systems with respect to the user navigation cost. We have implemented BioNav for the MEDLINE database at http://db.cse.buffalo.edu/bionav.

Introduction

Literature survey is the most important step in software development process. Before developing the tool it is necessary to determine the time factor, economy n company strength. Once these things r satisfied, ten next steps are to determine which operating system and language can be used for developing the tool. Once the programmers start building the tool the programmers need lot of external support. This support can be obtained from senior programmers, from book or from websites. Before building the system the above consideration r taken into account for developing the proposed system.

Many solutions have been proposed to address this problemcommonly referred to as information overload [1], [2], [3], [9], [16]. These approaches can be broadly classified into two classes: ranking and categorizationwhich can also be combined. Ranking presents the user with a list of results ordered by some metric of relevance [9] or by content similarity to a result or a set of results [16]. In categorization [1], [2], [3], query results are grouped based on hierarchies, keywords, tags, or attribute values. User studies have demonstrated the usefulness of categorization in finding relevant results of exploratory queries [12]. While ranked results are useful when the ranking function is aligned with user preferences or the result list is small in size, categorization is generally employed by users when ranking fails or the query is too broad [12]. BioNav belongs primarily to the categorization class, which is especially suitable for this domain given the rich concept hierarchies (e.g., MeSH [19]) available for biomedical data.

An intuitive way to categorize the results of a query on PubMed is by using the MeSH static concept hierarchy [19], thus, utilizing the initiative of the US National Library of Medicine (NLM) to build and maintain such a comprehensive structure. Each citation in MEDLINE is associated with several MeSH concepts in two ways: 1) by being explicitly annotated with them, and 2) by mentioning those in their text (see Section 7 for details). Since these associations are provided by PubMed, a relatively straightforward interface to navigate the query result would first attach the citations to the corresponding MeSH concept nodes and then let the user navigate the navigation tree. Fig. 1 displays a snapshot of such an interface where shown next to each node label is the count of distinct citations in the sub-tree rooted at that node.

A typical navigation starts by revealing the children of the root ranked by their citation count, and is continued by the user expanding on or more of them, revealing their ranked children and so on, until she clicks on a concept and inspects the citations attached to it. A similar interface and navigation method is used by e-commerce sites, such as Amazon

and eBay. For this example interaction, we assume that some of the citations the user is interested in are available on the three indicated concepts corresponding to three independent lines of research related to prothymosin, and therefore the user is interested in navigating to these concepts. These include, Histones, which play a role in gene regulation and are essential for virus replication and tumor growth, Cell Growth Processes and

Transcription, Genetic, a key process for synthesis and replication of RNA and thus plays an important role in the duplication of cancer cells.
1. Overview of Computer Attacks
  
  A computer attack is any malicious activity directed at a computer system or the services provided by the system. The attacks could come in the form of viruses, Denial of Service (DoS), exploitation of bugs, use of services by those unauthorized or even a physical attack against the computer hardware. An understanding of some techniques that have been used by crackers or even script kiddies will go a long way in helping provide a system that will protect institutions against the intrusions. Hackers have employed a varied number of techniques to break into systems and these include and are not limited to:
  - Social Engineering: This is the type of attack where the attacker fools an authorized network user into divulging information that leads to access to network resources. The details may include: passwords, security policies, network or host addresses. The attacker could also impersonate genuine users, service providers or the support staff and sometimes uses this identity to deliver harmful software such as Trojan.
  - Abuse of Feature: In abuse of feature, the authorized users perform action that are considered illegitimate. He may open illegitimate sites, send request or open up telnet sessions with into prohibited servers or hosts that could for example fill up the users assigned quarters or storage space.
  - Configuration errors – The attacker may gain access to a network with configuration errors. These weaknesses could be allowing access without prompting for any form of authentication such as assigning users guest accounts which are left without password requirements. Users may erroneously be assigned administrative privileges. A user may also obtain privileges of other users by obtaining their passwords or bypassing control that restrict access(Root attack).
    - Masquerading – Having monitored the flow of traffic, the attacker can modify a TCP packet or originate a packet but send it with a forged source address to give the impression that it is from a trusted source. The attacker can therein send Trojan, perform a denial of service or make request for sensitive data.
    - Worms – These are destructive self- replicating programs that spread across the network. They can be sent as e-mail attachments to hamper network performance.
    - Viruses – Are programs that replicate when a user performs some actions, such as running a program. The viruses may have the ability to destroy sensitive documents.
    - Implementatin Bug – Most trusted applications and programs have bugs. These bugs are exploited by hackers to gain access to the computer systems or networks. Examples of these include: buffer over flows, race conditions or even mishandled files. This is also referred to as client attack, where the attacker may exploit bugs on either the client, server, network software or protocol.
      
      Generally these attacks can be classified into four major classes, interception, interruptions, modification and fabrications.
    - Interception: This means that some unauthorized party gains access to an asset. This could be a program, person or a computing system. An example of this could be wire tapping, or illicit copying of program or data files. The damage could be worsened when the attacker leaves no traces on the network.
    - Interruption: In this situation an asset of the system is made unavailable or unusable. An example is a malicious destruction of a hardware device, erasure of a program or data file or malfunctioning of an operating system so that it does not get say a particular disk.
    - Modification: The attacker both accesses and tampers with the asset. This may include altering the program so that it can perform an additional computation or modify data being transmitted. They may range from simple changes to more subtle changes that may not even be detected.
    - Fabrication: This involves creating counterfeit objects on the computing system. When skillfully done may also go
    undetected and thus a very serious threat to the network security.
Theoretical Analysis

In this paper, we present the BioNav system, a novel search interface that enables the user to navigate large number of query results by organizing them using the MeSH concept hierarchy. First, the query results are organized into a navigation tree. At each node expansion step, BioNav reveals only a small subset of the concept nodes, selected such that the expected user navigation cost is minimized. In contrast, previous works expand the hierarchy in a predefined static manner, without navigation cost modeling. We show that the problem of selecting the best concepts to reveal at each node expansion is NP- complete and propose an efficient heuristic as well as a feasible optimal algorithm for relatively small trees. We show experimentally that BioNav outperforms state-of-the-art categorization systems with respect to the user navigation cost. We have implemented BioNav for the MEDLINE database at http://db.cse.buffalo.edu/bionav.
1. Existing System
  
  Existing search operation Information overload is a major problem when searching Biomedical databases such as PubMed, where typically a large number of citations are returned, of which only a small subset is relevant to the user.
2. Proposed System
  
  The proposals dynamically categorize SQL query results by inferring a hierarchy based on the characteristics of the result tuples. Their domain is the tuple attributes and their problem is how to organize them hierarchically in order to minimize the navigation cost. They also decide the value ranges for each attribute, for both categorical and numerical ones, and how to rank them. One of the systems takes into consideration the users preferences during the inference for a more personalized experience. Once the hierarchy is inferred, they follow a static navigation method. BioNav is distinct since it offers dynamic navigation on a predefined hierarchy, as is the MeSH concept hierarchy. Hence, BioNav is complementary to these systems, since it can be used to optimize the navigation, after these systems construct the navigation tree.
3. System Study
  The aspect of study is to check the level of acceptance of the system by the user. This includes the process of training the user to use the system efficiently. The user must not feel threatened by the system, instead must accept it as a necessity. The level of acceptance by the users solely depends on the methods that are employed to educate the user about the system and to make him familiar with it. His level of confidence must be raised so that he is also able to make some constructive criticism, which is welcomed, as he is the final user of the system.
Experimental Investigation
1. System Design
Implementation
1. Introduction
  
  Implementation is the stage of the project when the theoretical design is turned out into a working system. Thus it can be considered to be the most critical stage in achieving a successful new system and in giving the user, confidence that the new system will work and be effective.
  
  The implementation stage involves careful planning, investigation of the existing system and its constraints on implementation, designing of methods to achieve changeover and evaluation of changeover methods.
2. Modules
  Operating System : Windows 2000/XP Application Server : Tomcat 7.0.11
  
  Front End : J2EE-(HTML, Java, Jsp, Servlets )
  
  Scripts : JavaScript. Development tool : Net beans IDE 7.0 Build tool : Ant
  
  Server side Script : Java Server Pages. Database : MsAccess Database Connectivity : JDBC
Experimental Results
1. Testing Introduction
  
  The purpose of testing is to discover errors. Testing is the process of trying to discover every conceivable fault or weakness in a work product. It provides a
  
  way to check the functionality of components, sub assemblies, assemblies and/or a finished product It is the process of exercising software with the intent of ensuring that the Software system meets its requirements and user expectations and does not fail in an unacceptable manner. There are various types of test. Each test type addresses a specific testing requirement.
2. Test Strategy and approach
  Integration Testing
  
  Software integration testing is the incremental integration testing of two or more integrated software components on a single platform to produce failures caused by interface defects.
  
  The task of the integration test is to check that components or software applications, e.g. components in a software system or one step up software applications at the company level interact without error.
  
  Test Results: All the test cases mentioned above passed successfully. No defects encountered.
  
  Acceptance Testing
  
  User Acceptance Testing is a critical phase of any project and requires significant participation by the end user. It also ensures that the system meets the functional requirements.
  
  Test Results: All the test cases mentioned above passed successfully. No defects encountered.
Discussion of Results

MEDLINE is a medical literature analysis an retrieval system and bibliographic database of life science and biomedical information.

MEDLINE is free available on the internet and searchable via PubMed and USNLM ( United States National Library of Medicine) Center for biotechnology information.

Search queries on biomedical databases, such as PubMed, often return a large number of results, only a small subset of which is relevant to the user. Ranking and categorization, which can also be combined, have been proposed to alleviate this information overload problem. Results categorization for biomedical databases is the focus of this work. A natural way to organize biomedical citations is according to their MeSH annotations. MeSH is a comprehensive concept hierarchy used by PubMed. In this paper, we present the BioNav system, a novel search interface that enables the user to navigate large number of query results by organizing them using the MeSH concept hierarchy. First, the query results are organized into a navigation tree. At each node expansion step, BioNav reveals only a small subset of the concept nodes, selected such that the expected user navigation cost is minimized. We have implemented BioNav for the MEDLINE database.
Conclusion

Information overload is a major problem when searching biomedical databases such as PubMed, where typically a large number of citations are returned, of which only a small subset is relevant to the user. In this paper, we presented the BioNav system to address this problem. Our solution is to organize the query results according to their associations to concepts of the MeSH concept hierarchy, and provide a dynamic navigation method that minimizes the information overload observed by the user. When the user expands a MeSH concept on our web interface, BioNav reveals only a selective list of descendant concepts, instead of simply showing all its children, ranked based on their estimated relevance to the user's query. We formally stated the underlying framework and the navigation and cost models used for the evaluation of our approach. Our complexity result proved that the problem of expanding the navigation tree in a way that minimizes the user's navigation cost is NP- complete. A feasible (for small trees) optimal algorithm and an efficient heuristic were developed. Experimental results validated the effectiveness of the proposed heuristic for diverse sets of queries and navigation trees, when compared to categorization systems using a static navigation method. The

architecture of the BioNav system was implemented and is available at http://db.cse.buffalo.edu/bionav.

References:

J S. Agrawal, S. Chaudhuri, G. Das and A. Gionis: Automated Ranking of Database Query Results. In Proceedings of First Biennial Conference on Innovative Data Systems Research (CIDR), 2003.
K. Chakrabarti, S. Chaudhuri and S.W. Hwang: Automatic Categorization of Query Results. SIGMOD Conference 2004: 755- 766.
Z. Chen and T. Li: Addressing Diverse User Preferences in SQLQuery- Result Navigation. SIGMOD Conference 2007: 641-652
L. Comtet: Advanced Combinatorics: The Art of Finite and Infinite Expansions, rev. enl. ed. Dordrecht, Netherlands: Reidel, pp. 176- 177, 1974.
R. Delfs, A. Doms, A. Kozlenkov and M. Schroeder: GoPubMed: Ontology-Based Literature Search Applied to Gene Ontology and PubMed. German Conference on Bioinformatics 2004: 169-178.
D. Demner-Fushman and Jimmy Lin: Answer Extraction, Semantic Clustering, and Extractive Summarization for Clinical Question Answering. International Conference on Computational Linguistics and the Annual Meeting of the Association For Computational Linguistics, 2006: 841- 848
(2008) Entrez Programming Utilities. . Available:http://www.ncbi.nlm.nih.gov/entr ez/query/static/eutils_help.html
U. Feige, D. Peleg and G. Kortsarz: The Dense k-Subgraph Problem. Algorithmica 29 (2001) 410-421
V. Hristidis and Y. Papakonstantinou: DISCOVER: Keyword Search in Relational Databases. In Proc. of VLDB Conference, 2002.
R. Hoffman and A. Valencia: A gene network for navigating the literature. Nature Genetics, 36(7):664,

2004.
(2008) Humboldt-UniversitÃ¤t zu Berlin Ali Baba: PubMed as a graph. [Online]. Available: http://alibaba.informatik.huberlin.
(2008) iHOP – Information Hyperlinked over Proteins. Available:http://www.ihop- net.org/UniPub/iHOP/
A. Kashyap, V. Hristidis, M. Petropoulos, and S. Tavoulari: BioNav: Effective Navigation on Query Results of Biomedical Databases. (Short Paper), ICDE 2009, to appear. Available at http://www.cs.fiu.edu/~vagelis/publications/BioNavI CDE09.pdf
S. Kundu and J. Misra: A Linear Tree Partitioning Algorithm. SIAM J. Comput. 6(1): 151- 154 (1977)
W. Lee, L. Raschid, H. Sayyadi and P. Srinivasan: Exploiting Ontology Structure and Patterns of Annotation to Mine Significant Associations between Pairs of Controlled Vocabulary Terms. DILS 2008: 44-60.
J. Lin and W.J. Wilbur, Pubmed Related Articles: A Probabilistic Topic Based Model for Content Similarity, BMC Bioinformatics, vol. 8, article no. 423, 2007.

17 D. Lindberg, B. Humphreys, and A. McCray,

The Unified Medical Language System, Methods of Information in Medicine, vol. 32, no. 4, pp. 281- 291, 1993.

D. Maglott, J. Ostell, K.D. Pruitt, and T. Tatusova, Entrez Gene: Gene-Centered Information at NCBI, Nucleic Acids Research, vol. 33, pp. D54- D58, Jan. 2005.
Medical Subject Headings (MeSH), http:

//www.nlm.nih.gov/ mesh/, 2010.
J.A. Mitchell, A.R. Aronson, and J.G. Mork,

Gene Indexing: Characterization and Analysis of NLMs GeneRIFs, Proc. AMIA Ann. Symp., pp. 460-464, Nov.

Ranking of Database Query Results Based on Concept Hierarchies

Leave a Reply