【正文】
Mapreduce和并行數(shù)據(jù)庫(kù)管理系統(tǒng)結(jié)合的系統(tǒng)。 5 摘要 ? 目前有相當(dāng)大的興趣在基于 MapReduce( MR)模式的大規(guī)模數(shù)據(jù)分析。雖然這個(gè)框架的基本控制流已經(jīng)存在于并行 SQL數(shù)據(jù)庫(kù)管理系統(tǒng)超過(guò) 20年,也有人稱 MR為最新的計(jì)算模型。在本文中,我們描述和比較這兩個(gè)模式。此外,我們?cè)u(píng)估兩個(gè)系統(tǒng)的性能和開發(fā)復(fù)雜度。最后,我們定義一個(gè)包含任務(wù)集的基準(zhǔn)運(yùn)行于 MR開源平臺(tái)和兩個(gè)并行數(shù)據(jù)庫(kù)管理系統(tǒng)上。對(duì)于每個(gè)任務(wù),我們?cè)?100臺(tái)機(jī)子的集群上衡量每個(gè)系統(tǒng)的各個(gè)方面的并行性能。我們的研究結(jié)果揭示了一些有趣的取舍。雖然加載數(shù)據(jù)和調(diào)整并行數(shù)據(jù)庫(kù)管理系統(tǒng)執(zhí)行的過(guò)程比 MR花費(fèi)更多的時(shí)間,但是觀察到的這些數(shù)據(jù)庫(kù)管理系統(tǒng)性能顯著地改善。我們推測(cè)巨大的性能差異的原因,并考慮將來(lái)的系統(tǒng)應(yīng)該從這兩種架構(gòu)中吸取優(yōu)勢(shì)。 6 ? ABSTRACT: There is currently considerable enthusiasm around the MapReduce (MR) paradigm for largescale data analysis. Although the basic control ?ow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new puting model. In this paper, we describe and pare both paradigms. Furthermore, we evaluate both kinds of systems in terms of performance and development plexity. To this end, we de?ne a benchmark consisting of a collection of tasks that we have run on an open source version of MR as well as on two parallel DBMSs. For each task, we measure each system’s performance for various degrees of parallelism on a cluster of 100 nodes. Our results reveal some interesting tradeoffs. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better. We speculate about the causes of the dramatic performance difference and