Dataset for Source Code Fragment Summarization

Recent studies have applied different approaches for summarizing software artifacts, yet very few efforts have been made in summarizing the source code fragments available on web.

This project investigates the novelty of generating code fragment summary lines using supervised machine learning algorithms and crowdsourcing mechanism. We have introduced crowdsourcing as a problem solving model to extract source code features in summarizing software artifacts. As per our knowledge this is the first effort to employ crowdsourcing in summarizing software artifacts.

Our Corpus consists of 127 code fragments, retrieved from Eclipse and NetBeans FAQ on web. Our corpus of code fragments can be downloaded here.

In this project, we organize the crowdsourcing activity in the form of an open call on the Intranet of our institution. Altogether, 10 individuals responded our call and nine individuals submitted their work. On average, these individuals have three to four years of software development and research experience.

List of features extracted through crowdsourcing could be accessed here.

Source code for our SVM and Naive Bayes classifiers is available here.

People

1. Najam Nazar

2. He Jiang

3. Guojun Gao

4. Tao Zhang*

5. Xiaochen Li

6. Zhilei Ren

* Tao Zhang is currently belongs to the Hong Kong Polytechnic University.

Source Code Fragment Summarization with Crowdsourcing Based Features