Summary: | Code cloning considerably facilitates software development but also leads to recurring bugs and other software quality problems. In this paper, we propose a fast code clone detection method based on weighted recursive autoencoders (RAE) to measure code similarity at the function level. Different from manually defining features for code clone detection, our deep learning-based method can automatically learn program features. First, we analyze program abstract syntax trees using weighted RAE, extract the program features and encode the functions to vectors. During the modeling process, we consider node weight information in abstract syntax trees to increase the proportion of information contributed by important nodes in the final vector representation of one program. Second, we report functions with similar vectors as code clone pairs. The second phase is time consuming when analyzing large software systems because it needs quadratic pairwise comparisons. To solve this problem, we transform the clone detection problem into an approximate nearest neighbors search (ANNS) in a high-dimensional vector set and use the navigating spreading-out graph to reduce the computational time complexity. Experimental results on BigCloneBench show that our method outperforms the compared algorithm based on unweighted RAE in terms of precision, recall, and AUC value and can return clone pairs in approximately 33 min, while the compared algorithm requires approximately 14 days when performing pairwise comparisons among 785,438 functions' vectors. Our method also outperforms many prominent tools, including Oreo, in detecting Moderately Type-3 or Type-4 clones, and our false positive rate (FPR) equals 0.055, which means few false positives. More importantly, our method has no need for labeled data, and all of the source code is released to guarantee experimental reproducibility.
|