【正文】
terfaces. It also has its own dedicated DS4100 storage controller. (See below for a description of the DS4100.) Scaleout systems e in many different shapes and forms, but they generally consist of multiple interconnected nodes with a selfcontained operating system in each node. We chose BladeCenter as our platform for scaleout. This was a natural choice given the scaleout orientation of this platform. The first form of scaleout systems to bee popular in mercial puting was the rackmounted cluster. The IBM BladeCenter [3],[5]solution (and similar systems from panies such as HP and Dell) represents the next step after rackmounted clusters in scaleout systems for mercial puting. The blade servers [6] used in BladeCenter are similar in capability to the densest rackmounted cluster servers: 4processor configurations, 1632 GiB of maximum memory, builtin Ether, and expansion cards for either Fiber Channel, Infiniband, Myri, or 10 Gbit/s Ether. Also offered are doublewide blades with up to 8processor configurations and additional memory. Figure 2 is a highlevel view of our cluster architecture. The basic building block of the cluster is a BladeCenterH (BCH) chassis. We couple each BCH chassis with one DS4100 storage controller through a 2Gbit/s Fiber Channel link. The chassis themselves are interconnected through two nearestneighbor works. One of the works is a 4Gbit/s Fiber Channel work and the other is a 1Gbit/s Ether work. The cluster consists of 8 chassis of blades (112 blades in total) and eight DS4100 storage subsystems The BladeCenterH chassis is the newest BladeCenter chassis from IBM. As with the previous BladeCenter1 chassis, it has 14 blade slots for blade servers. It also has space for up to two (2) management modules, four (4) switch modules, four (4) bridge modules, and four (4) highspeed switch modules. (Switch modules 3 and 4 and bridge modules 3 and 4 share the same slots in the chassis.) We have populated each of our chassis with two 1Gbit/s Ether switch modules and two Fiber Channel switch modules. Three different kinds of blades were used in our cluster: JS21 (PowerPC processors), HS21 (Intel Woodcrest processors), and LS21 (AMD Opteron processors). Each blade (JS21, HS21, or LS21) has both a local disk drive (73 GB of capacity) and a dual Fiber Channel work adapter. The Fiber Channel adapter is used to connect the blades to two Fiber Channel switches that are plugged in each chassis. Approximately half of the cluster (4 chassis) is posed of JS21 blades. These are quadprocessor (dualsocket, dualcore) PowerPC 970 blades, running at GHz. Each blade has 8 GiB of memory. For the experiments reported in this paper, we focus on these JS21 blades. The DS4100 storage subsystem consists of dual storage controllers, each with a 2 Gb/s Fiber Channel interface, and space for 14 SATA drives in the main drawer. Although each DS4100 is paired with a specific BladeCenterH chassis, any blade in the cluster can see any of the LUNs in the storage system, thanks to the Fiber Channel work we implement. 3. The Nutch/Lucene workload Nutch/Lucene [4] is a framework for implementing search applications. It is representative of a growing class of applications that are based on search of unstructured data (web pages). We are all used to search engines like Google and Yahoo that operate on the open Inter. However, search is also an important operation within Intras, the internal works of panies. Nutch/Lucene is all implemented in Java and its code is open source. Nutch/Lucene, as a typical search framework, has three major ponents: (1) crawling, (2) indexing, and (3) query. In this paper, we present our results for the query ponent. For pleteness, we briefly describe the other ponents. Crawling is the operation that navigates and retrieves the information in web pages, populating the set of documents that will be searched. This set of documents is called the corpus, in search terminology. Crawling can be performed on internal works (Intra) as well as external works (Inter). Crawling, particularly in the Inter, is a plex operation. Either intentionally or unintentionally, many web sites are difficult to crawl. The performance of crawling is usually limited by the bandwidth of the work between the system doing the crawling and the system being crawled. The Nutch/Lucene search framework includes a parallel indexing operation written using the MapReduce programming model [2]. MapReduce provides a convenient way of addressing an important Figure 2: Hardware architecture of our BladeCenter cluster. Fiber ChannelFCSWFCSWFCSWFCSW1 GE1 GE1 GE1 GEFCSWFCSWFCSWFCSW1 GE1 GE1 GE1 GE2GDS41002G2G2GBCH chassis4G1GEther1G(though limited) class of reallife mercial applications by hiding parallelism and faulttolerance issues from the programmers, letting them focus on the problem domain. MapReduce was published by Google in 2020 and quickly became a defacto standard for this kind of workloads. Parallel indexing operations in the MapReduce model works as follows. First, the data to be indexed is partitioned into segments of approximately equal size. Each segment is then processed by a mapper task that generates the (key, value) pairs for that segment, where key is an indexing term and value is the set of documents that contain that term (and the location of the term in the document). This corresponds to the map phase, in MapReduce. In the next phase, the red