# A scalable bootstrap for massive data

@article{Kleiner2011ASB, title={A scalable bootstrap for massive data}, author={Ariel Kleiner and Ameet S. Talwalkar and Purnamrita Sarkar and Michael I. Jordan}, journal={Journal of The Royal Statistical Society Series B-statistical Methodology}, year={2011}, volume={76}, pages={795-816} }

type="main" xml:id="rssb12050-abs-0001"> The bootstrap provides a simple and powerful means of assessing the quality of estimators. However, in settings involving large data sets—which are increasingly prevalent—the calculation of bootstrap-based quantities can be prohibitively demanding computationally. Although variants such as subsampling and the m out of n bootstrap can be used in principle to reduce the cost of bootstrap computations, these methods are generally not robust to specification… Expand

#### 291 Citations

The Big Data Bootstrap

- Computer Science, Mathematics
- ICML
- 2012

The Bag of Little Bootstraps (BLB), a new procedure which incorporates features of both the bootstrap and subsampling to obtain a robust, computationally efficient means of assessing estimator quality, is presented. Expand

Robust, Scalable, and Fast Bootstrap Method for Analyzing Large Scale Data

- Computer Science, Mathematics
- IEEE Transactions on Signal Processing
- 2016

This paper proposes a scalable, statistically robust and computationally efficient bootstrap method, compatible with distributed processing and storage systems and demonstrates scalability, low complexity and robust statistical performance of the method in analyzing large data sets. Expand

A Subsampled Double Bootstrap for Massive Data

- Computer Science, Mathematics
- 2015

A new resampling method, the subsampled double bootstrap, is proposed, which is superior to BLB in terms of running time, more sample coverage, and automatic implementation with less tuning parameters for a given time budget. Expand

SFB 823 A subsampled double bootstrap for massive data

- 2015

The bootstrap is a popular and powerful method for assessing precision of estimators and inferential methods. However, for massive datasets which are increasingly prevalent, the bootstrap becomes… Expand

Fast and robust bootstrap in analysing large multivariate datasets

- Computer Science
- 2014 48th Asilomar Conference on Signals, Systems and Computers
- 2014

The proposed bootstrap method facilitates using highly robust statistical methods in analyzing large scale data sets with significant savings in computation since the method does not require recomputing the estimator for each bootstrap sample but it is done analytically using a smart approximation. Expand

Scalable Statistical Inference Using Distributed Bootstrapping And Iterative ℓ1-Norm Minimization

- Computer Science
- 2018 52nd Asilomar Conference on Signals, Systems, and Computers
- 2018

This paper proposes a scalable distributed boot- strap method that uses iterative estimation equations favoring sparse solution and gives smaller Root MSE and significantly lower bias than bootstrap employing widely used sparse estimator BPDN. Expand

Support for scalable analytics over databases and data-streams

- Computer Science
- 2012

This thesis provides an improved bootstrap approach that uses the Bag of Little Bootstraps along with other recent advances in bootstrap and time- series theory to provide an effective Hadoop-based implementation for assessing a time-series sample quality. Expand

Hyperparameter Selection for Subsampling Bootstraps

- Computer Science, Mathematics
- 2020

A hyperparameter selection methodology is developed, which can be used to select tuning parameters for subsampling methods and finds an analytically simple and elegant relationship between the asymptotic efficiency of various subsampled estimators and their hyperparameters. Expand

Sparsity-promoting bootstrap method for large-scale data

- Computer Science
- 2016 50th Asilomar Conference on Signals, Systems and Computers
- 2016

A scalable nonparametric bootstrap method that operates with smaller number of distinct data points on multiple disjoint subsets of data and is compatible with distributed storage systems and distributed and parallel processing architectures is proposed. Expand

A Bootstrap Metropolis–Hastings Algorithm for Bayesian Analysis of Big Data

- Computer Science, Medicine
- Technometrics
- 2016

The so-called bootstrap Metropolis–Hastings (BMH) algorithm is proposed, which provides a general framework for how to tame powerful MCMC methods to be used for big data analysis, that is, to replace the full data log-likelihood by a Monte Carlo average of the log- likelihoods that are calculated in parallel from multiple bootstrap samples. Expand

#### References

SHOWING 1-10 OF 33 REFERENCES

Richardson Extrapolation and the Bootstrap

- Mathematics
- 1988

Abstract Simulation methods [particularly Efron's (1979) bootstrap] are being applied more and more frequently in statistical inference. Given data (X 1 …, Xn ) distributed according to P, which… Expand

ON THE CHOICE OF m IN THE m OUT OF n BOOTSTRAP AND CONFIDENCE BOUNDS FOR EXTREMA

- Mathematics
- 2008

For i.i.d. samples of size n, the ordinary bootstrap (Efron (1979)) is known to be consistent in many situations, but it may fail in important examples (Bickel, Gotze and van Zwet (1997)). Using… Expand

More Efficient Bootstrap Computations

- Mathematics
- 1990

Abstract This article concerns computational methods for the bootstrap that are more efficient than the straightforward Monte Carlo methods usually used. The bootstrap is considered in its simplest… Expand

The Jackknife and the Bootstrap for General Stationary Observations

- Mathematics
- 1989

We extend the jackknife and the bootstrap method of estimating standard errors to the case where the observations form a general stationary sequence. We do not attempt a reduction to i.i.d. values.… Expand

The stationary bootstrap

- Mathematics
- 1994

Abstract This article introduces a resampling procedure called the stationary bootstrap as a means of calculating standard errors of estimators and constructing confidence regions for parameters… Expand

Gap bootstrap methods for massive data sets with an application to transportation engineering

- Mathematics
- 2012

In this paper we describe two bootstrap methods for massive data sets. Naive applications of common resampling methodology are often impractical for massive data sets due to computational burden and… Expand

Bootstrapping General Empirical Measures

- Mathematics
- 1990

It is proved that the bootstrapped central limit theorem for empirical processes indexed by a class of functions F and based on a probability measure P holds a.s. if and only if F CLT (P ) and ∫ F dP… Expand

How Many Bootstraps

- Computer Science
- 1985

This document proposes an adaptive sequential method that estimates the accuracy of the bootstrap based on the current bootstrap samples until the estimated accuracy is high enough. Expand

An Introduction to the Bootstrap

- 2007

15 Empirical Bayes Method, 2nd edition J.S. Maritz and T. Lwin (1989) Symmetric Multivariate and Related Distributions K.-T. Fang, S. Kotz and K. Ng (1989) Ieneralized Linear Models, 2nd edition P.… Expand

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

- Computer Science
- NSDI
- 2012

Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks. Expand