【文章內(nèi)容簡介】
software defect detection, where the labeled trainingexamples are limited while the unlabeled examples areabundant. Recently, Seliya and Khoshgoftaar[8] applied a generativemodelbased semisupervised learning method to software defect detection and achievedperformance improvement. Note that [8] adopted agenerative approach for exploiting unlabeled data whilethe proposed method adopts a discriminative approach.Thus, we did not include it in our empirical study forthe purpose of fair parison. Learning from Imbalanced DataIn many realworld applications such as softwaredefect detection, the class distribution of the data is imbalanced, that is, the examples from the minority classare (much) fewer than those from the other class. Sinceit is easy to achieve good performance by keeping themajorityclass examples being classi175。ed correctly, thesensitivity of the classi175。ers to the minority class may bevery low if directly learning from the imbalanced data.To achieve better sensitivity to the minority class, theclassimbalance problem should be explicitly tackled.Popular classimbalance learning techniques includesampling[11。4344] and costsensitive learning[4546].Since sampling technique is used in this paper, we introduce sampling in more details.Sampling attempts to achieve a balanced class distribution by altering the dataset. Undersamplingreduces the number of the majorityclass exampleswhile oversampling increases the number of minorityclass examples[11], both of which have been shownto be e174。ective to classimbalance problems. Sophisticated methods can be employed to balance theclass distribution, such as adding synthetic minorityclass examples generated from the interpolation ofneighboring minorityclass examples[43]。 discarding thenonrepresentative majorityclass examples to balancethe class distribution[44]。 bining di174。erent samplingmethods for further improvement[47]。 using ensembletechnique for exploratory undersampling to avoid theremoval of useful majority class examples[48].The classimbalance learning method is seldom usedin software defect detection. Recently, Pelayo andDick[9] studied the e174。ectiveness of Smote[43] over thesoftware defect detection, and found that balancing theskewed class distribution is bene175。cial to software defectdetection.3 Proposed ApproachLet L = f(x1。 y1)。 (x2。 y2)。 : : : 。 (xm0 。 ym0 )g denotethe set of labeled examples and let U = fxm0+1,xm0+2。 : : : 。 xNg denote the set of unlabeled examples,where xi is a ddimensional feature vector, and yi 2f161。1。 +1g is the class label. Conventionally, +1 denotes the minority class (., \defective in softwaredefect detection). Thereinafter, we refer to class +1 asthe minorityclass and 161。1 as the majorityclass. BothL and U are independently drawn from the same unknown distribution D whose marginal distributions satisfy PD(yi = +1) 191。 PD(yi = 161。1), and hence, L and Uare imbalanced datasets in essence.As mentioned in Section 1, directly applying semisupervised learning to imbalanced data would berisky. Since L is imbalanced and usually small, very fewexamples of the minorityclass would be used to initiate萬方數(shù)據(jù)Yuan Jiang et al.: Software Defect Detection with Rocus 331the semisupervised learning process. The resultingmodel may have poor sensitivity to the minorityclassand hence can hardly identify the examples of theminorityclass from the unlabeled set. In this case,learner would have to use little information from theminorityclass and overwhelming information of themajorityclass for model re175。nement, and this leads toeven poorer sensitivity to the minorityclass. As the iterative semisupervised learning proceeds, the learnedmodel would be biased to predict every example to themajorityclass.In order to successfully conduct iterative semisupervised learning on the imbalanced data, the learnershould have the following two properties. First, thelearner should have strong generalization ability, suchthat even if provided with a small labeled training setwith imbalanced class distribution, the learner wouldnot have zero sensitivity to the minorityclass examplesduring the automatically labeling process。 second, thein176。uence of overwhelming number of the newly labeledmajorityclass examples should be further reduced inorder to improve the sensitivity of the learner to theminority examples after its re175。nement in each learning iteration. Based on these two considerations, wepropose the Rocus method to exploit the imbalancedunlabeled examples.To meet the 175。rst requirement, we train multiple classi175。ers and then bine them for prediction. The reason behind this speci175。c choice of the ensemble learningparadigm is that an ensemble of classi175。ers can usuallyachieve better generalization performance than a single classi175。er. Such superiority is more obvious whenthe training set is small[37] and the class distribution isimbalanced[48]. Thus, by exploiting the generalizationpower, the trained ensemble from L is able to identifysome minorityclass examples from U e174。ectively.Since multiple classi175。ers are used, we employ thedisagreementbased semisupervised learning paradigm[10] to exploit the unlabeled examples in U. In detail, after the initial ensemble of classi175。ers fh1。 h2。 : : :,hCg are constructed, some individual classi175。ers selectsome examples in U to label according to a disagreement level, and then teach the other classi175。ers withthe newly labeled examples. Here, similar to [37], weadopt a simple case where the classi175。ers H161。i = fh1。 : : :,hi161。1。 hi+1。 : : : 。 hCg are responsible for selecting con175。dently labeled unlabeled examples in U for an individual classi175。er hi. Given an unlabeled example, we 175。rstlabel this example using the majority voting of C161。1 individual classi175。ers, and then estimate