Are you happy with your logging solution? Would you help us out by taking a 30-second survey? Click here

quantiles

Approximations to Histograms

Subscribe to updates I use quantiles


Statistics on quantiles

Number of watchers on Github 28
Number of open issues 2
Main language Rust
Average time to merge a PR 2 days
Open pull requests 0+
Closed pull requests 0+
Last commit over 1 year ago
Repo Created over 3 years ago
Repo Last Updated almost 2 years ago
Size 1020 KB
Organization / Authorpostmates
Latest Release0.2.0
Contributors4
Page Updated
Do you use quantiles? Leave a review!
View open issues (2)
View on github
Fresh, new opensource launches 🚀🚀🚀
Trendy new open source projects in your inbox! View examples

Subscribe to our mailing list

Evaluating quantiles for your project? Score Explanation
Commits Score (?)
Issues & PR Score (?)

quantiles

Build Status Codecov Crates.io

This crate is intended to be a collection of approxiate quantile algorithms that provide guarantees around space and computation. Recent literature has advanced approximation techniques but none are generally applicable and have fundamental tradeoffs.

Initial work was done to support internal Postmates projects but the hope is that the crate can be generally useful.

The Algorithms

CKMS - Effective Computation of Biased Quantiles over Data Streams

This is an implementation of the algorithm presented in Cormode, Korn, Muthukrishnan, Srivastava's paper Effective Computation of Biased Quantiles over Data Streams. The ambition here is to approximate quantiles on a stream of data without having a boatload of information kept in memory. This implementation follows the IEEE version of the paper. The authors' self-published copy of the paper is incorrect and this implementation will not make sense if you follow along using that version. Only the 'full biased' invariant is used. The 'targeted quantiles' variant of this algorithm is fundamentally flawed, an issue which the authors correct in their Space- and Time-Efficient Deterministic Algorithms for Biased Quantiles over Data Streams

use quantiles::CKMS;

let mut ckms = CKMS::<u16>::new(0.001);
for i in 1..1001 {
    ckms.insert(i as u16);
}

assert_eq!(ckms.query(0.0), Some((1, 1)));
assert_eq!(ckms.query(0.998), Some((998, 998)));
assert_eq!(ckms.query(0.999), Some((999, 999)));
assert_eq!(ckms.query(1.0), Some((1000, 1000)));

Queries provide an approximation to the true quantile, +/- n. In the above, is set to 0.001, n is 1000. Minimum and maximum quantiles--0.0 and 1.0--are already precise. The error for the middle query is then +/- 0.998. (This so happens to be the exact quantile, but that doesn't always hold.)

For an error this structure will require T*(floor(1/(2*)) + O(1/ log n)) + f64 + usize + usize words of storage, where T is the specialized type.

In local testing, insertion per point takes approximately 4 microseconds with a variance of 7%. This comes to 250k points per second.

Misra Gries - -approximate frequency counts

Misra-Gries calculates an -approximate frequency count for a stream of N elements. The output is the k most frequent elements.

  1. the approximate count f'[e] is smaller than the true frequency f[e] of e, but by at most N, i.e., (f[e] - N) f'[e] f[e]
  2. any element e with a frequency f[e] N appears in the result set

The error bound = 1/(k+1) where k is the number of counters used in the algorithm. When k = 1 i.e. a single counter, the algorithm is equivalent to the Boyer-Moore Majority algorithm.

If you want to check for elements that appear at least N times, you will want to perform a second pass to calculate the exact frequencies of the values in the result set which can be done in constant space.

use quantiles::misra_gries::*;

let k: usize = 3;
let numbers: Vec<u32> = vec![1,3,2,1,3,4,3,1,2,1];
let counts = misra_gries(numbers.iter(), k);
let bound = numbers.len() / (k+1);
let in_range = |f_expected: usize, f_approx: usize| {
  f_approx <= f_expected &&
  (bound >= f_expected || f_approx >= (f_expected - bound))
};
assert!(in_range(4usize, *counts.get(&1).unwrap()));
assert!(in_range(2usize, *counts.get(&2).unwrap()));
assert!(in_range(3usize, *counts.get(&3).unwrap()));

Greenwald Khanna - -approximate quantiles

Greenwald Khanna calculates -approximate quantiles. If the desired quantile is `, the -approximate quantile is any element in the range of elements that rank between(-)Nand(+)N`

The stream summary datastructure can cope with up to max[usize] observations.

The beginning and end quantiles are clamped at the Minimum and maximum observed elements respectively.

This page explains the theory: http://www.mathcs.emory.edu/~cheung/Courses/584-StreamDB/Syllabus/08-Quantile/Greenwald.html

use quantiles::greenwald_khanna::*;

let epsilon = 0.01;

let mut stream = Stream::new(epsilon);

let n = 1001;
for i in 1..n {
    stream.insert(i);
}
let in_range = |phi: f64, value: u32| {
  let lower = ((phi - epsilon) * (n as f64)) as u32;
  let upper = ((phi + epsilon) * (n as f64)) as u32;
  (epsilon > phi || lower <= value) && value <= upper
};
assert!(in_range(0f64, *stream.quantile(0f64)));
assert!(in_range(0.1f64, *stream.quantile(0.1f64)));
assert!(in_range(0.2f64, *stream.quantile(0.2f64)));
assert!(in_range(0.3f64, *stream.quantile(0.3f64)));
assert!(in_range(0.4f64, *stream.quantile(0.4f64)));
assert!(in_range(1f64, *stream.quantile(1f64)));

Upgrading

0.2 -> 0.3

This release introduces two new algorithms, Greenwald Khanna and Misra Gries. The existing CKMS has been moved from root to its own submodule. You'll need to update your imports from

use quantiles::CMKS;

to

use quantiles::ckms::CKMS;
quantiles questions on Stackoverflow (View All Questions)
  • loop through quantiles to create unique column in dataset
  • Colorize cells of a column based on its quantiles using DT package and do it for any column
  • How can I adjust the lengths and widths of bean lines and quantiles in beanplots using R?
  • ddply multiple quantiles by group
  • Plot Additional Quantiles on Seaborn Violin Plots
  • Lower and upper quantiles within grouping factors
  • Calendar heatmap: how to define the range of % for each color instead of using quantiles
  • Plotting Quantiles values of boxplot in R inside a for loop
  • Pig: how to compute quantiles for several variables in a Group By?
  • Access p-value from quantile regression for multiple quantiles in R
  • Replace outliers by quantiles in R
  • R - Wrong quantiles probs returned for edcf(x)
  • Find T^2 quantiles for multivariate data in R
  • Standardized residuals x Theoretical Quantiles plot in ggplot2 package
  • Estimating quantiles of conditional distribution in R
  • Pandas GroupBy Doesn't Persist After Applying Quantiles?
  • obtaining quantiles from complete gaussian fit of data in R
  • Nested 'ifelse'-statement for quantiles
  • create boxplots of quantiles against binary endpoint
  • Calculate percentiles/quantiles for a timeseries with resample or groupby - pandas
  • Plot quantiles in R
  • Incorrect local quantiles with the raster package's focal function
  • converting back from equicoordinate quantiles mvtnorm package R
  • Difference in median and other quantiles between two samples
  • How to find quantiles of an empirical cumulative density function (ECDF)
  • Counting quantiles for subgroups in weighted sample in R
  • R mark quantiles in a plot
  • quantiles dependent on additional variables
  • How can I specify probability values when invoking quantiles within apply in R?
  • Get date quantiles in pandas
quantiles list of languages used
quantiles latest release notes

Make CKMS serializable, impl AddAssign (#7)

This commit fiddles with the interface of CKMS a bit by adding the AddAssign trait--useful for merging two CKMS's together or for using CKMS in a numeric context--and serde serialization. It's now possible to flush this happy thing to disk or over the wire.

A few bugs were corrected:

  • It's now possible to set the error bounds, which previously was hard-coded at 0.001 and
  • partial_cmp is used in place of home-grown comparision.

Insert into samples immediately

The CKMS paper discusses buffering points for batch or 'cursor' insertion. After some experimentation it was determined that this buffering decreased throughput in the present implementation. A benchmark to stress the machine to its limits is desirable.

On the plus side, there is no longer any need for a flush_0 function and the user will not be tripped up by a lack of points that have been inserted but not removed from buffer.

Adjust query_1 to be immutable

Previously a call to query_1 would privately invoke flush_0. This commit prefers to avoid this to allow query_1 to be immutable which improves the ergonomics of the library. This does mean, potentially, that a client may have to call flush_0 themselves, as noted in the docs.

Other projects in Rust