【正文】
ons PNUTS / SHERPA To Help You Scale Your Mountains of Data Yahoo! Serving Storage Problem ? Small records – 100KB or less ? Structured records – lots of fields, evolving ? Extreme data scale Tens of TB ? Extreme request scale Tens of thousands of requests/sec ? Low latency globally 20+ datacenters worldwide ? High Availability outages cost $millions ? Variable usage patterns as applications and users change 110 E 75656 C A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E What is PNUTS/Sherpa? E 75656 C A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E CREATE TABLE Parts ( ID VARCHAR, StockNumber INT, Status VARCHAR … ) Parallel database Geographic replication Structured, flexible schema Hosted, managed infrastructure A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E 112 What Will It Bee? E 75656 C A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E E 75656 C A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E E 75656 C A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E Indexes and views Scalability ? Thousands of machines ? Easy to add capacity ? Restrict query language to avoid costly queries Geographic replication ? Asynchronous replication around the globe ? Lowlatency local access High availability and fault tolerance ? Automatically recover from failures ? Serve reads and writes despite failures Design Goals 115 Consistency ? Perrecord guarantees ? Timeline model ? Option to relax if needed Multiple access paths ? Hash table, ordered table ? Primary, secondary access Hosted service ? Applications plug and play ? Share operational cost Technology Elements PNUTS ? Query planning and execution ? Index maintenance Distributed infrastructure for tabular data ? Data partitioning ? Update consistency ? Replication YDOT FS ? Ordered tables Applications Tribble ? Pub/sub messaging YDHT FS ? Hash tables Zookeeper ? Consistency service YCA: Authorization PNUTS API Tabular API 116 Data Manipulation ? Perrecord operations ? Get ? Set ? Delete ? Multirecord operations ? Multiget ? Scan ? Getrange 117 Tablets—Hash Table Apple Lemon Grape Orange Lime Strawberry Kiwi Avocado Tomato Banana Grapes are good to eat Limes are green Apple is wisdom Strawberry shortcake Arrgh! Don’t get scurvy! But at what price? How much did you pay for this lemon? Is this a vegetable? New Zealand The perfect fruit Name Description Price $12 $9 $1 $900 $2 $3 $1 $14 $2 $8 0x0000 0xFFFF 0x911F 0x2AF3 118 Tablets—Ordered Table 119 Apple Banana Grape Orange Lime Strawberry Kiwi Avocado Tomato Lemon Grapes are good to eat Limes are green Apple is wisdom Strawberry shortcake Arrgh! Don’t get scurvy! But at what price? The perfect fruit Is this a vegetable? How much did you pay for this lemon? New Zealand $1 $3 $2 $12 $8 $1 $9 $2 $900 $14 Name Description Price A Z Q H Flexible Schema Posted date Listing id Item Price 6/1/07 424252 Couch $570 6/1/07 763245 Bike $86 6/3/07 211242 Car $1123 6/5/07 421133 Lamp $15 Color Red Condition Good Fair Storage units Routers Tablet Controller REST API Clients Local region Remote regions Tribble Detailed Architecture 121 Tablet Splitting and Balancing 122 Each storage unit has many tablets (horizontal partitions of the table) Tablets may grow over time Overfull tablets split Storage unit may bee a hotspot Shed load by moving tablets to other servers Storage unit Tablet QUERY PROCESSING 123 Accessing Data 124 SU SU SU 1 Get key k 2 Get key k 3 Record for key k 4 Record for key k Bulk Read 125 SU Scatter/ gather server SU SU 1 {k1, k2, … kn} 2 Get k1 Get k2 Get k3 Storage unit 1 Storage unit 2 Storage unit 3 Range Queries in YDOT ? Clust。 for each v in intermediate_values: result += ParseInt(v)。 GFS chooses and returns the offset it writes to and appends the data to each replica at least once Heavily used by Google’s Distributed applications. No need for a distributed lock manager GFS choses the offset, not the client Atomic Record Append: How? ? Follows similar control flow as mutations ? Primary tells secondary replicas to append at the same offset as the primary ? If a replica append fails at any replica, it is retried by the client. So replicas of the same chunk may contain different data, including duplicates, whole or in part, of the same record Atomic Record Append: How? ? GFS does not guarantee that all replicas are bitwise identical. Only guarantees that data is written at least once in an atomic unit. Data must be written at the same offset for all chunk replicas for success to be reported. Detecting Stale Replicas ? Master has a chunk version number to distinguish up to date and stale replicas ? Increase version when granting a lease ? If a replica is not available, its version is not increased ? master detects stale replicas when a ch