Identifying Semantic Outliers of Source Code Artifacts and Their Application to Software Architecture Recovery

Understanding software architecture is essential to software maintenance. There has been much effort to derive software architecture views from source code artifacts. Typically, along with structural information, the semantic information derived from an identifier name and comments are helpful. Howe...

Full description

Bibliographic Details
Main Authors: Ki-Seong Lee, Chan-Gun Lee
Format: Article
Language:English
Published: IEEE 2020-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9269345/
Description
Summary:Understanding software architecture is essential to software maintenance. There has been much effort to derive software architecture views from source code artifacts. Typically, along with structural information, the semantic information derived from an identifier name and comments are helpful. However, because code vocabulary choice depends on a developer's subjective decision, some source code may have semantically low text quality, leading to an inaccurate architecture recovery. This paper aims to improve the architecture recovery of a software system by identifying and removing the semantic outliers of source code artifacts. Accordingly, we propose a novel measure Conceptual Conformity (CC), which computes the similarity between two latent topic distributions obtained from both the source code and its package. We use CC to identify source code that is not relevant to the package's semantic context and define it as a semantic outlier. Because the semantic outliers may cause inaccurate architecture recovery, we remove them during the recovery process. We apply our approach to three open-source projects. The results demonstrate that, for projects with low recovery performance, removing outliers leads to higher recovery accuracy.
ISSN:2169-3536