【正文】
B/s/worker)1PB, , TB/min, MB/s/workerThis year we enjoyed faster networking and started to pay more attention to permachine efficiency, particularly in I/O. We made sure that all our disk I/Os operations were performed in large 2MB blocks versus sometimes as small as 64kB blocks. We used SSDs for part of the data. That got us the first Petasort in under an hour — 33 minutes, to be exact — and weDataflow.為了減少人的影響,我們采用了一種叫做減少殘余碎片的動態(tài)分區(qū)技術。這也是第一次,我們驗證了輸出的結(jié)果是正確的。我們不再有我們遇到過的GFS文件污染的問題。在前幾年,當我們讀/寫1PB GFS文件時,實際上混排的數(shù)據(jù)只有300TB,因為前幾年的數(shù)據(jù)是用ASCII格式壓縮好的。任何事情的缺失都會造成數(shù)據(jù)丟失的高風險。我們曾經(jīng)在這個博客里記錄過結(jié)果。blogged about the result here. The bottleneck ended up being writing the threeway replicated output GFS, which was the standard we used at Google at the time. Anything less would create a high risk of data loss.2008年我們第一次把注意力集中于調(diào)整。不幸的是,這個基準所使用的文件格式?jīng)]有任何嵌入式校驗供MapReduce使用(谷歌使用的典型MapReduce的文件是有嵌入式校驗的)。我們懷疑這是因為我們用來存儲輸入和輸出的文件是GFS格式(谷歌文件系統(tǒng))的緣故。在那個時候,我們最高興的是這個程序最終完成了排序,盡管我們對排序的結(jié)果有一些疑問(我們沒有驗證排序結(jié)果的正確性)。這是我們有機會去“燃燒”這個cluster,延伸硬件的限制,放棄一些硬盤,而使用一些真正昂貴的設備,了解系統(tǒng)的性能,并贏得(非正式)排序基準。結(jié)合適當?shù)?字典)分區(qū)功能,MapReduce的輸出是一組包含了最終排序數(shù)據(jù)的文件序列。而我們從未正式參加過比賽。tracks official winners for this benchmark. We never entered the official petition.那時候,GraySort是大型排序基準的選擇。These days, GraySort is the large scale sorting benchmark of choice. In GraySort, you must sort at least 100TB of data (as 100byte records with the first 10 bytes being the key), lexicographically, as fast as possible. The site然而,真正有趣的事情在我們進一步擴大數(shù)據(jù)規(guī)模后才開始。reports a TeraSort result. Engineers run 1TB or 10TB sorts as regression tests on a regular basis, because obscure bugs tend to be more visible on a large scale. However, the real fun begins when we increase the scale even further. In this post I’ll talk about our experience with some petabytescale sorting experiments we did a few years ago, including what we believe to be the largest MapReduce job ever: a 50PB sort.我們最初的Map