【正文】
do. In principle, we should be able to reduce that idle time by increasing the load on the node. In our more extensive experiments, we found that we could keep the idle time down to between 1015%. Finally, we observe that cycles wasted on the Icache or on branch mispredictions is relatively small (a few percent), stalls due to the fixedpoint units account for 10% of the cycles, and stalls because of the memory hierarchy (Dcache, reject, ERAT, and other LSU) represent approximately 20% of the cycles. The fraction of cycles wasted in the memory hierarchy is similar to the fraction of cycles doing useful work. That means that a perfect memory system would only be able to at most double the performance of the existing processors. A similar benefit could be obtained by doubling the number of processors and maintaining the memory hierarchy per processor. . Experiments in a scaleup system To run query on the p5 575, we first configured Nutch/Lucene as shown in Figure 8. We ran a single frontend and a single backend in the machine. The actual data to be searched was stored in the external DS4100 storage controller, connected to the p5 575 through Fiber Channel. The driver was running on a separate machine, and we measured the throughput at the driver. p5 575Driver Frontend Backendthroughput measurementdataFigure 8: Query on p5 575: pure scaleup. Throughput results for the configuration in Figure 8, for different data set sizes, are shown in Figure 9. We plot both the absolute queries per second (blue line, left yaxis) and the more meaningful metric of queries per second times the data set size (magenta line, right yaxis). We observe that the latter metric peaks at about 1100 queries/second*GB, for a data set size of 250 GB. We found these results disappointing, since we had measured the peak value for a JS21 blade (with only four processors) at approximately 900 queries/ second*GB. Query throughput x data set size0510152025303540450 100 200 300 400 500 600 700 800 900 1000Data set size (GB)Queries per second020040060080010001200Queries per second*GBFigure 9: Throughput as a function of data set size for query on the p5 575: pure scaleup. To investigate ways to improve the performance of query on an SMP, we experimented with the configuration shown in Figure 10. In that configuration we ran multiple backends inside the SMP. Each backend is responsible for 10 GB of data. So, a larger data set will use more backends. We call this configuration scaleoutinabox. Driver FrontendBackendBackendBackenddatathroughput measurementdatadatap5 575Figure 10: Running query on the POWER5 p5 575: scaleout in a box configuration. Throughput results for the configuration in Figure 10, for different data set sizes, are shown in Figure 11. We show the results for the pure scaleup and scaleoutinabox in the same figure so that we can better pare them. We see a much improved behavior with this configuration. The throughput * data set size metric peaks at approximately 4000 queries/second*GB for a data set size of 250 GB. Query throughput x data set size0102030405060700 100 200 300 400 500 600 700 800 900 1000Data set size (GB)Queries per second050010001500202025003000350040004500Queries per second*GBFigure 11: Throughput as a function of data set size for query on the p5 575: pure scaleup and scaleoutinabox. . Scaleout experiments We start by reporting results from the experiments using the configuration shown in Figure 12. In this particular implementation of the architecture shown in Figure 3, there is one frontend running on a JS21 blade and a variable number of backends, each on their own JS21 blade. The data segment for each backend is stored in an ext3 file system in the local disk of each blade. Driver FrontendBackendBackendBackenddatathroughput measurementdatadata Figure 12: Configuration with each data segment in an ext3 file system in the local disk of each JS21 backend blade. Throughput measurements (queries per second) as a function of the number of backends are shown in Figure 13 for three different data segment size (per backend): 10, 20, and 40 GB/backend. The total data set size, therefore, varies from 10 GB (one backend with 10 GB) to 480 GB (12 backends with 40 GB each). Figure 14 is a plot of the average CPU utilization in the backends as a function of the number of backends. This latter plot shows that the CPUs are well utilized in this workload. (100% utilization corresponds to all 4 cores in the JS21 running all the time.) Figure 13: Total queries per second as a function of number of backends. Data sets on local disk and ext3 file system. Figure 14: average processor utilization in the backends as a function of number of backends. Data sets on local disk and ext3 file system. We observe in Figure 13 that the throughput increases with the number of backends. At first, this is a surprising result, since as we increase the number of backends, each query is sent to all the backends. We would expect a flat throughput or maybe even declining, as the frontend has to do more work. We can explain the observed behavior as follows. Each query operation has two main phases: the search for the indexing terms in the backends (includin