正文內(nèi)容

slingforbigdata-資料下載頁

2025-09-20 20:10本頁面

【導(dǎo)讀】Commonthemes:. WhyReduce?mostbigdataisbig!–Manyapplications(teles,ISPs,searchengines)can’tkeepeverything. WhySample?Samplingis(usually)easytounderstand. Allworthyofstudy–inothertutorials. Graphsampling

　　

【正文】 or Big Data Cost Optimization for Sampling Several different approaches optimize for different objectives: 1. Fixed Sample Size IPPS Sample – Variance Optimal sampling: minimal variance unbiased estimation 2. Structure Aware Sampling – Improve estimation accuracy for sub queries using topological cost 3. Fair Sampling – Adaptively balance sampling budget over subpopulations of flows – Uniform estimation accuracy regardless of subpopulation size 4. Stable Sampling – Increase stability of sample set by imposing cost on changes Sampling for Big Data IPPS Stream Reservoir Sampling ? Each arriving item: – Provisionally include item in reservoir – If m+1 items, discard 1 item randomly □ Calculate threshold z to sample m items on average: z solves ?i pz(xi) = m □ Discard item i with probability qi =1 – pz(xi) □ Adjust m surviving xi with HorvitzThompson x’i = xi / pi = max{xi,z} ? Efficient Implementation: – Computational cost O(log m ) per item, amortized cost O(log log m) [Cohen, Duffield, Lund, Kaplan, Thorup。 SODA 2020, SIAM J. Comput. 2020] x9 x8 x7 x6 x5 x4 x3 x2 x1 Example: m=9 x10 Recalculate threshold z: ?? ?101i i 9z}xm in {1,z 0 1 Recalculate Discard probs: z}xm in {1, 1q i i ?Adjust weights: z},max{xx39。i i ?x’9 x’8 x’10 x’6 x’5 x’4 x’3 x’2 x’1 Sampling for Big Data Structure (Un)Aware Sampling ? Sampling is oblivious to structure in keys (IP address hierarchy) – Estimation disperses the weight of discarded items to surviving samples ? Queries structure aware: subset sums over related keys (IP subs) – Accuracy on LHS is decreased by discarding weight on RHS ? 0 1 00 01 10 000 001 010 011 100 101 110 111 11 Sampling for Big Data Localizing Weight Redistribution ? Initial weight set {xi : i?S} for some S ? Ω – . Ω = possible IP addresses, S =observed IP addresses ? Attribute “range cost” C({xi : i?R}) for each weight subset R?S – Possible factors for Range Cost: □ Sampling variance □ Topology . height of lowest mon ancestor – Heuristics: R* = Nearest Neighbor {xi , xj} of minimal xixj ? Sample k items from S: – Progressively remove one item from subset with minimal range cost: – While(|S| k) □ Find R*?S of minimal range cost. □ Remove a weight from R* w/ VarOpt [Cohen, Cormode, Duffield。 PVLDB 2020] ? 0 1 00 01 10 000 001 010 011 100 101 110 111 11 No change outside subtree below closest ancestor Order of magnitude reduction in average sub error vs. VarOpt Sampling for Big Data Fair Sampling Across Subpopulations ? Analysis queries often focus on specific subpopulations – . working: different customers, user applications, work paths ? Wide variation in subpopulation size – 5 orders of magnitude variation in traffic on interfaces of access router ? If uniform sampling across subpopulations: – Poor estimation accuracy on subset sums within small subpopulations Sample Color = subpopulation , = interesting items – occurrence proportional to subpopulation size Uniform Sampling across subpopulations: – Difficult to track proportion of interesting items within small subpopulations: Sampling for Big Data Fair Sampling Across Subpopulations ? Minimize relative variance by sharing budget m over subpopulations – Total n objects in subpopulations n1,…,nd with ?ini=n – Allocate budget mi to each subpopulation ni with ?imi=m ? Minimize average population relative variance R = const. ?i1/mi ? Theorem: – R minimized when {mi} are MaxMin Fair share of m under demands {ni} ? Streaming – Problem: don’t know subpopulation sizes {ni} in advance ? Solution: progressive fair sharing as reservoir sample – Provisionally include each arrival – Discard 1 item as VarOpt sample from any maximal subpopulation ? Theorem [Duffield。 Sigmetrics 2020]: – MaxMin Fair at all times。 equality in distribution with VarOpt samples {mi from ni} Sampling for Big Data Stable Sampling ? Setting: Sampling a population over successive periods ? Sample independently at each time period? – Cost associated with sample churn – Time series analysis of set of relatively stable keys ? Find sampling probabilities through cost minimization – Minimize Cost = Estimation Variance + z * E[Churn] ? Size m sample with maximal expected churn D – weights {xi}, previous sampling probabilities {pi} – find new sampling probabilities {qi} to minimize cost of taking m samples – Minimize ?ix2i / qi subject to 1 ≥ qi ≥ 0, ?I qi = m and ?I | pi – qi | ≤ D [Cohen, Cormode, Duffield, Lund 13] Sampling for Big Data Summary of Part 1 ? Sampling as a powerful, general summarization technique ? Unbiased estimation via HorvitzThompson estimators ? Sampling from streams of data – Uniform sampling: reservoir sampling – Weighted generalizations: sample and hold, counting samples ? Advances in stream sampling – The cost principle for sample design, and IPPS methods – Threshold, priority and VarOpt sampling – Extending the cost principle: □ structure aware, fair sampling, stable sampling, sketch guided Graham Cormode, University of Warwick Nick Duffield, Texas Aamp。M University Sampling for Big Data x9 x8 x7 x6 x5 x4 x3 x2 x1 x10 x’9 x’8 x’10 x’6 x’5 x’4 x’3 x’2 x’1 ? 0 1 00 01 10 000 001 010 011 100 101 110 111 11

點(diǎn)擊復(fù)制文檔內(nèi)容

教學(xué)課件相關(guān)推薦

都市精靈pptppt-資料下載頁

【總結(jié)】學(xué)習(xí)目標(biāo)1、能理解并說出“精靈”在文中的含義。2、學(xué)習(xí)本文對比的表現(xiàn)手法。3、能結(jié)合文章內(nèi)容和自己的積累談?wù)勅祟惾绾闻c動(dòng)物和諧相處。作者簡介舒乙：老舍之子，中國現(xiàn)代文學(xué)館館長。1935年生于青島，北京人，滿族。1978年開始業(yè)余文學(xué)創(chuàng)作，

2024-11-24 12:38

認(rèn)識(shí)鐘表pptppt-資料下載頁

【總結(jié)】有個(gè)好朋友，會(huì)走沒有腿，會(huì)說沒有嘴，它會(huì)告訴我們，什么時(shí)候起床，什么時(shí)候睡覺。猜一猜分針時(shí)針6時(shí)4時(shí)12時(shí)4時(shí)4：0012時(shí)6時(shí)6：00分針長長指12，時(shí)針指幾是幾時(shí)！12：001、請你

2024-11-24 12:36

愛蓮說ppt[1]ppt-資料下載頁

【總結(jié)】天然去雕飾，清水出芙蓉.|李白接天蓮葉無窮碧，映日荷花別樣紅——楊萬里是議論文的的一種文體，大多是就一事、一物或一種現(xiàn)象抒發(fā)作者的感想。寫法不拘一格，行文崇尚自

2024-11-24 11:33

白公鵝pptppt-資料下載頁

【總結(jié)】豐子愷筆下的那只白鵝有什么特點(diǎn)？高傲我們來認(rèn)識(shí)俄國作家葉·諾索夫描寫的鵝。有什么不同之處？你認(rèn)為這只鵝會(huì)有哪些特點(diǎn)呢？請你推測一下。自由讀課文，讀準(zhǔn)字音，看看兩只鵝有什么異同，作者的寫法又與豐子愷的寫法有哪些異同？讀書時(shí)有了新發(fā)現(xiàn)，可以在書上寫一寫。

2024-11-24 17:39

戊道子pptppt-資料下載頁

【總結(jié)】陽光宅女—戊道子?姓名：戊道子生日：11-20民族：漢族?籍貫：青島身高：162cm體重：44kg特長：京劇做飯血型：O型星座：天蝎座最喜歡的運(yùn)動(dòng):

2025-08-04 16:51

優(yōu)秀ppt餃子ppt課件-資料下載頁

【總結(jié)】五（4）班高海航目錄?餃子的簡介?餃子的歷史?餃子的名稱?過節(jié)吃餃子?餃子的種類?餃子的俗語?餃子的做法?餃子的意義餃子簡介符號：下一頁能指：餃子所指:飲食文化餃子簡介餃子是深受中國人民喜愛

2025-05-02 12:07

[ppt模板]ppt圖標(biāo)素材-資料下載頁

【總結(jié)】圖標(biāo)圖標(biāo)禁止圖標(biāo)5/交通圖標(biāo)圖標(biāo)圖標(biāo)圖標(biāo)圖標(biāo)圖標(biāo)圖標(biāo)圖標(biāo)圖標(biāo)圖標(biāo)圖標(biāo)圖標(biāo)作圖元素小區(qū)、業(yè)務(wù)平臺(tái)等的表示圖標(biāo)，上面放終端或產(chǎn)品示意圖作圖元素?cái)?shù)據(jù)庫立體部件化組合作圖元素放一些財(cái)務(wù)數(shù)據(jù)或者市場份額（國內(nèi)、國際）、產(chǎn)品歸類、組織結(jié)構(gòu)之類的文字，反白字，加陰影。作圖元素C&C08iN

2025-02-14 00:44

窮人ppt課件ppt(2)-資料下載頁

【總結(jié)】六年級語文《窮人》PPT?制作人：王小剛?制作單位：甘肅省慶陽市鎮(zhèn)原縣上肖鄉(xiāng)姜曹小學(xué)?制作時(shí)間：2020年10月13日?交流QQ:891156421列夫〃托爾斯泰列夫〃托爾斯泰(1828—1910)，偉大的俄國作家。他出身貴族家庭，早年接受典型的貴族教育。1851年參軍，不久開始創(chuàng)

2024-11-24 13:47

[ppt模板]精美ppt模板-資料下載頁

【總結(jié)】青衣單擊添加標(biāo)題——單擊添加標(biāo)題幻燈藝術(shù)POWERPOINT青衣單擊添加標(biāo)題——單擊添加標(biāo)題幻燈藝術(shù)OWERPOINT青衣單擊此處添加標(biāo)題——單擊添加標(biāo)題單擊此處添加標(biāo)題單擊添加標(biāo)題單擊此處添加標(biāo)題單擊添加標(biāo)題單擊此處添加標(biāo)題

2025-01-19 07:22

[ppt模板]愛心ppt-資料下載頁

【總結(jié)】PPT下載網(wǎng)?配色方案修改：?配色方案在【格式】【幻燈片設(shè)計(jì)】【配色方案】【編輯配色方案】下調(diào)整。?LOGO的添加：?Logo添加修改在【視圖】【母版】【幻燈片母版】下調(diào)整。直接選擇logo圖片刪除或修改。?字體格式的設(shè)置：?括標(biāo)題和文本格式的設(shè)置在【視圖

2025-01-19 09:51

[ppt模板]商用ppt-資料下載頁

【總結(jié)】無憂PPT整理發(fā)布2

2025-01-19 08:22

軍事ppt模版ppt課件-資料下載頁

【總結(jié)】SECMINSECMINSECMINSECMINSECMINSECMINSECMINSECMINSECMINSECMINSECMINSTART1949-1984年1985-1999年2022-現(xiàn)在陸軍戰(zhàn)略區(qū)域防衛(wèi)型區(qū)域防衛(wèi)型向全域機(jī)動(dòng)型轉(zhuǎn)變兵力數(shù)量約62

2025-01-17 06:39