【正文】
y need to read ? Master aggregates and orchestrates sorting of needed chunks – Assigns log chunks to be sorted to different tablet servers – Servers sort chunks by tablet, writes sorted data to local disk ? Other tablet servers ask master which servers have sorted chunks they need ? Tablet servers issue direct RPCs to peer tablet servers to read sorted data for its tablets Compression ? Many opportunities for pression – Similar values in the same row/column at different timestamps – Similar values in different columns – Similar values across adjacent rows ? Within each SSTable for a locality group, encode pressed blocks – Keep blocks small for random access (~64KB pressed data) – Exploit fact that many values very similar – Needs to be low CPU cost for encoding/decoding ? Two building blocks: BMDiff, Zippy BMDiff ? Bentley, Mcllroy DCC’99: “Data Compression Using Long?Common?Strings” ? Input: dictionary * source ? Output: sequence of – COPY: x bytes from offset y – LITERAL: literal text ? Store hash at every 32byte aligned boundary in – Dictionary – Source processed so far ? For every new source byte – Compute incremental hash of last 32 bytes – Lookup in hash table – On hit, expand match forwards amp。 efficient parallelization/distribution – Faulttolerance, I/O scheduling, status/monitoring – User writes Map and Reduce functions ? Heavily used: ~3000 jobs, 1000s of machine days each day See: “MapReduce: Simplified Data Processing on Large Clusters”. OSDI^04 BigTable can be input and/or output for MapReduce putations Typical Cluster Cluster Scheduling Master Lock Service GFS Master Machine 1 Scheduler Slave GFS Chunkserver Linux User Task Machine 2 Scheduler Slave GFS Chunkserver Linux User Task Machine 3 Scheduler Slave GFS Chunkserver Linux Single Task BigTable Server BigTable Server BigTable Master BigTable Overview ? Data Model ? Implementation Structure – Tablets, pactions, locality groups, … ? API ? Details – Shared logs, pression, replication, … ? Current/Future Work Basic Data Model ? Distributed multidimensional sparse map (row, column, timestamp) ? cell contents ? Good match for most of our applications … … “html…” t1 t2 t3 ROWS COLUMNS TIMESTAMPS “contents” Rows ? Name is an arbitrary string – Access to data in a row is atomic – Row creation is implicit upon storing data ? Rows ordered lexicographically – Rows close together lexicographicall