This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: Text Mining and Software Engineering: An Integrated Source Code and Document Analysis Approach * Ren´e Witte and Qiangqiang Li Institut f¨ur Programmstrukturen und Datenorganisation (IPD) Fakult¨at f¨ur Informatik Universit¨at Karlsruhe (TH), Germany Yonggang Zhang and Juergen Rilling Department of Computer Science and Software Engineering Concordia University, Montr´eal, Canada Abstract Documents written in natural languages constitute a major part of the artifacts produced during the software engineering lifecycle. Especially during software maintenance or reverse engineering, semantic information conveyed in these documents can provide important knowledge for the software engineer. In this paper, we present a text mining system capable of populating a software ontology with information detected in documents. A particular novelty is the integration of results from automated source code analysis into a natural language processing (NLP) pipeline, allowing to cross-link software artifacts represented in code and natural language on a semantic level. 1 Introduction With the ever increasing number of computers and their support for business processes, an estimated 250 billion lines of source code were being maintained in 2000, with that number rapidly increasing . The relative cost of maintaining and managing the evolution of this large software base represents now more than 90% of the total cost  associated with a software product. One of the major challenges for software engineers while performing a maintenance task is the need to comprehend a multitude of often disconnected artifacts created originally as part of the software development process . These artifacts include, among others, source code and corresponding software documents, e.g., requirements specifications, design descriptions, and user’s guides. From a maintainer’s perspective, it becomes essential to establish and maintain the seman- * This paper is a postprint of a paper submitted to and accepted for publication in the IET Software Journal , Vol. 2, No. 1, 2008, and is subject to IET copyright [ http://www.iet.org ]. The copy of record is available at http://link.aip.org/link/?SEN/2/3/1 . tic connections among all these artifacts. Automated source code analysis, implemented in integrated devel- opment environments like Eclipse , has improved soft- ware maintenance significantly. However, integrating the often large amount of corresponding documenta- tion requires new approaches to the analysis of natural language documents that go beyond simple full-text search or information retrieval (IR) techniques . In this paper, we propose a Text Mining (TM) ap- proach to analyse software documents at a semantic level. A particular feature of our system is its use of formal ontologies (in OWL-DL format) during both the analysis process and as an export format for the results. In combination with a source code analysis system for populating code-specific parts of the ontol-...
View Full Document
- Spring '08
- Software engineering