【正文】
odity hardware ? Potentially offers flexibility in traffic measurement – Allocate system resources to measurement tasks as needed – Dynamic reconfiguration, fine grained tuning of sampling – Stateful packet inspection and sampling for work security ? Technical challenges: – High rate packet processing in software – Transparent support from modity hardware – OpenSketch: [Yu, Jose, Miao, 2020] ? Same issues in other applications: use of modity programmable HW Sampling for Big Data Stream Sampling: Sampling as Cost Optimization Sampling for Big Data Matching Data to Sampling Analysis ? Generic problem 1: Counting objects: weight xi = 1 Bernoulli (uniform) sampling with probability p works fine – Estimated subset count X’(S) = ,samples in S / p – Relative Variance (X’(S)) = (1/p 1)/X(S) □ given p, get any desired accuracy for large enough S ? Generic problem 2: xi in Pareto distribution, . 8020 law – Small proportion of objects possess a large proportion of total weight □ How to best to sample objects to accurately estimate weight? – Uniform sampling? □ likely to omit heavy objects ? big hit on accuracy □ making selection set S large doesn’t help – Select m largest objects ? □ biased amp。i i ?x’9 x’8 x’10 x’6 x’5 x’4 x’3 x’2 x’1 Sampling for Big Data Structure (Un)Aware Sampling ? Sampling is oblivious to structure in keys (IP address hierarchy) – Estimation disperses the weight of discarded items to surviving samples ? Queries structure aware: subset sums over related keys (IP subs) – Accuracy on LHS is decreased by discarding weight on RHS ? 0 1 00 01 10 000 001 010 011 100 101 110 111 11 Sampling for Big Data Localizing Weight Redistribution ? Initial weight set {xi : i?S} for some S ? Ω – . Ω = possible IP addresses, S =observed IP addresses ? Attribute “range cost” C({xi : i?R}) for each weight subset R?S – Possible factors for Range Cost: □ Sampling variance □ Topology . height of lowest mon ancestor – Heuristics: R* = Nearest Neighbor {xi , xj} of minimal xixj ? Sample k items from S: – Progressively remove one item from subset with minimal range cost: – While(|S| k) □ Find R*?S of minimal range cost. □ Remove a weight from R* w/ VarOpt [Cohen, Cormode, Duffield。M University Sampling for Big Data x9 x8 x7 x6 x5 x4 x3 x2 x1 x10 x’9 x’8 x’10 x’6 x’5 x’4 x’3 x’2 x’1 ? 0 1 00 01 10 000 001 010 011 100 101 110 111 11 。 Sigmetrics 2020]: – MaxMin Fair at all times。 iii ????Sampling for Big Data Minimal Cost Sampling: IPPS IPPS: Inclusion Probability Proportional to Size ? Minimize Cost ?i (xi2 (1/pi – 1) + z2 pi) subject to 1 ≥ pi ≥ 0 ? Solution: pi = pz(xi) = min{1, xi /z} – small objects (xi z) selected with probability proportional to size – large objects (xi ≥ z) selected with probability 1 – Call z the “sampling threshold” – Unbiased estimator xi/pi =max{xi , z} ? Perhaps reminiscent of importance sampling, but not the same: – make no assumptions concerning distribution of the x pz(x) 1 z x Sampling for Big Data Error Estimates and Bounds ? Variance Based: – HT sampling variance for single object of weight xi □ Var(x’i) = x2i (1/pi – 1) = x2i (1/min{1,xi/z} – 1) ≤ z xi – Subset sum X(S)= ?i?S xi is estimated by X’(S) = ?i?S x’i □ Var(X’(S)) ≤ z X(S) ? Exponential Bounds – . Prob*X’(S) = 0+ ≤ exp( X(S) / z ) ? Bounds are simple and powerful – depend only on subset sum X(S), not individual constituents Sampling for Big Data Sampled IP Traffic Measurements ? Packet Sampled NetFlow – Sample packet stream in router to limit rate of key lookup: uniform 1/N – Aggregate sampled packets into flow records by key ? Model: packet stream of (key, bytesize) pairs { (bi, ki) } ? Packet sampled flow record (b,k) where b = Σ {bi : i sampled ∧ ki = k} – HT estimate b/N of total bytes in flow ? Downstream sampling of flow records in measurement infrastructure – IPPS sampling, probability min{1, b/(Nz)} ? Chained variance bound for any subset sum X of flows – Var(X’) ≤ (z + Nbmax) X where bmax = maximum packet byte size – Regardless of how packets are distributed amongst flows [Duffield, Lund, Thorup, IEEE ToIT, 2020] Sampling for Big Data Estimation Accuracy in Practice ? Estimate any subset sum prising at least some fraction f of weight ? Suppose: sample size m ? Analysis: typical estimation error ε (relative standard deviation) obeys ? 2*16 = same storage needed for aggregates over 16 bit address prefixes □ But sampling gives more flexibility to estimate traffic within aggregates kf 1ε ?%%%% 1RSD ε fraction f m = 2**16 samples Estimate fraction f = % with typical relative error 12%: m Sampling for Big Data Heavy Hitters: Exact vs. Aggregate vs. Sampled ? Sampling does not tell you where the interesting features are – But does speed up the ability to find them with existing tools ? Example: Heavy Hitter Detection – Setting: Flow records reporting 10GB/s traffic stream – Aim: find Heavy Hitters = IP prefixes prising ≥ % of traffic – Response time needed: 5 minute ? Compare: – Exact: 10GB/s x 5 minutes yields upwards of 300M flow records – 64k aggregates over 16 bit prefixes: no deeper drilldown possible – Sampled: 64k flow records: any aggreg