【正文】
and position, but for clarity in Fig. 6 detections are only spread out in position. A threshold is applied to these values, and the centroids (in both position and scale) of all above threshold regions are puted. All detections contributing to a centroid are collapsed down to a single point. Each centroid is then examined in order, starting from the ones which had the highest number of detections within the specified neighborhood. If any other centroid locations represent a face overlapping with the current centroid, they are removed from the output pyramid. All remaining centroid locations constitute the final detection result. In the face detection work described in [3], similar observations about the nature of the outputs were made, resulting in the development of heuristics similar to those described above. Arbitration among Multiple NetworksTo further reduce the number of false positives, we can apply multiple networks, and arbitrate between their outputs to produce the final decision. Each network is trained in a similar manner, but with random initial weights, random initial nonface images, and permutations of the order of presentation of the scenery images. As will be seen in the next section, the detection and false positive rates of the individual networks will be quite close. However, because of different training conditions and because of selfselection of negative training examples, the networks will have different biases and will make different errors. the implementation of arbitration is illustrated in Fig. 7. Each detection at a particular position and scale is recorded in an image pyramid, as was done with the previous heuristics. One way to bine two such pyramids is by ANDing them. This strategy signals a detection only if both networks detect a face at precisely the same scale and position. Due to the different biases of the individual networks, they will rarely agree on a false detection of a face. This allows ANDing to eliminate most false detections. Unfortunately, this heuristic can decrease the detection rate because a face detected by only one network will be thrown out. However, we will see later that individual networks can all detect roughly the same set of faces, so that the number of faces lost due to ANDing is small.Similar heuristics, such as ORing the outputs of two networks, or voting among three networks,were also tried. Each of these arbitration methods can be applied before or after the “thresholding” and “overlap elimination” heuristics. If applied afterwards, we bine the centroid locations rather than actual detection locations, and require them to be within some neighborhood of one another rather than precisely aligned.Arbitration strategies such as ANDing, ORing, or voting seem intuitively reasonable, but per haps there are some less obvious heuristics that could perform better. To test this hypothesis, we applied a separate neural network to arbitrate among multiple detection networks. For a location of interest, the arbitration network examines a small neighborhood surrounding that location in theoutput pyramid of each individual network. For each pyramid, we count the number of detections in a 3x3 pixel region at each of three scales around the location of interest, resulting in three num bers for each detector, which are fed to the arbitration network, as shown in Fig. 8. The arbitration network is trained to produce a positive output for a given set of inputs only if that location con tains a face, and to produce a negative output for locations without a face. As will be seen in the next section, using an arbitration network in this fashion produced results parable to (and in some cases, slightly better than) those produced by the heuristics presented earlier.3 Experimental ResultsA number of experiments were performed to evaluate the system. We first show an analysis of which features the neural network is using to detect faces, then present the error rates of the system over two large test sets. Sensitivity AnalysiSIn order to determine which part of its input image the network uses to decide whether the input is a face, we performed a sensitivity analysis using the method of [2]. We collected a positive test set based on the training database of face images, but with different randomized scales, translations, and rotations than were used for training. The negative test set was built from a set of negative examples collected during the training of other networks. Each of the 20x20 pixel input images was divided into 100 2x2 pixel subimages. For each subimage in turn, we went through the test set, replacing that subimage with random noise, and tested the neural network. The resulting root mean square error of the network on the test set is an indication of how important that portion of the image is for the detection task. Plots of the error rates for two networks we trained are shown in Fig. 9. Network 1 uses two sets of the hidden units illustrated in Fig. 1, while Network 2 uses three sets.The networks rely most heavily on the eyes, then on the nose, and then on the mouth (Fig. 9).Anecdotally, we have seen this behavior on several real test images. In cases in which only one eye is visible, detection of a face is possible, though less reliable, than when the entire face is visible. The system is less sensitive to the occlusion of the nose or mouth. TestingThe system was tested on two large sets of images, which are distinct from the training sets. Test Set 1 consists of a total of 130 images collected at CMU, including images from the World Wide Web, scanned from photographs and newspaper pictures, and digitize