What is Bootstrapping
Bootstrapping is a method for estimating the properties of some quantity — like its expected value, its variance, etc — using resampling with replacement.
Basically the idea is that, when we have a sample, we can estimate quantities (e.g. the mean) from that sample. But we know that the sample isn’t going to be a perfect representation of the population, and that if we obtained data from multiple samples, our estimate of the mean (or whatever other quantity) would differ. Bootstrapping gives us a tool to estimate this variability in our quantity of interest without having to collect multiple samples.
It can be a nice approach for obtaining (more) stable estimates with small data sets, with datasets that are non-normal, or with data where outliers might bias the estimates. A benefit of bootstrapping is that it makes no assumptions about the distribution of your data, hence its robustness to outliers, small data, non-normality, etc.
How It Works
Bootstrapping works by resampling with replacement from a sample, estimating the quantity of interest, and then repeating this process lots of times — often 1,000 or more. After all of these repetitions, we then have a distribution of the quantity of interest, so we can get a sense of its expected value as well as its standard error. This approach lends itself well to constructing confidence intervals, too.
The general process is:
- If you have a dataset x (vector, matrix, whatever) with i observations, resample i observations with replacement from x. Note that you don’t have to retain i samples in your new sample, but it’s kind of the default approach.
- Estimate your quantity of interest (e.g. mean, quantile, regression coefficient, whatever) on your resampled dataset.
- Repeat the process n times, where n is a fairly large number (usually at least 1,000).
- Use your n estimates as the distribution of your quantity. You can use this to calculate the mean, standard error, confidence intervals, etc.
Implementation
Below is a basic demonstration (in Julia) of bootstrapping to estimate various percentiles of a distribution.