Number of watchers on Github  28 
Number of open issues  2 
Main language  Rust 
Average time to merge a PR  2 days 
Open pull requests  0+ 
Closed pull requests  0+ 
Last commit  over 1 year ago 
Repo Created  over 3 years ago 
Repo Last Updated  almost 2 years ago 
Size  1020 KB 
Organization / Author  postmates 
Latest Release  0.2.0 
Contributors  4 
Page Updated  20180317 
Do you use quantiles? Leave a review!  
View open issues (2)  
View on github  
Fresh, new opensource launches 🚀🚀🚀  
Trendy new open source projects in your inbox!
View examples

This crate is intended to be a collection of approxiate quantile algorithms that provide guarantees around space and computation. Recent literature has advanced approximation techniques but none are generally applicable and have fundamental tradeoffs.
Initial work was done to support internal Postmates projects but the hope is that the crate can be generally useful.
This is an implementation of the algorithm presented in Cormode, Korn,
Muthukrishnan, Srivastava's paper Effective Computation of Biased Quantiles
over Data Streams
. The ambition here is to approximate quantiles on a stream of
data without having a boatload of information kept in memory. This
implementation follows the
IEEE version
of the paper. The authors' selfpublished copy of the paper is incorrect and
this implementation will not make sense if you follow along using that
version. Only the 'full biased' invariant is used. The 'targeted quantiles'
variant of this algorithm is fundamentally flawed, an issue which the authors
correct in their Space and TimeEfficient Deterministic Algorithms for Biased
Quantiles over Data Streams
use quantiles::CKMS;
let mut ckms = CKMS::<u16>::new(0.001);
for i in 1..1001 {
ckms.insert(i as u16);
}
assert_eq!(ckms.query(0.0), Some((1, 1)));
assert_eq!(ckms.query(0.998), Some((998, 998)));
assert_eq!(ckms.query(0.999), Some((999, 999)));
assert_eq!(ckms.query(1.0), Some((1000, 1000)));
Queries provide an approximation to the true quantile, +/ n. In the above, is set to 0.001, n is 1000. Minimum and maximum quantiles0.0 and 1.0are already precise. The error for the middle query is then +/ 0.998. (This so happens to be the exact quantile, but that doesn't always hold.)
For an error this structure will require T*(floor(1/(2*)) + O(1/ log n)) +
f64 + usize + usize
words of storage, where T is the specialized type.
In local testing, insertion per point takes approximately 4 microseconds with a variance of 7%. This comes to 250k points per second.
MisraGries calculates an approximate frequency count for a stream of N elements. The output is the k most frequent elements.
The error bound = 1/(k+1) where k is the number of counters used in the algorithm. When k = 1 i.e. a single counter, the algorithm is equivalent to the BoyerMoore Majority algorithm.
If you want to check for elements that appear at least N times, you will want to perform a second pass to calculate the exact frequencies of the values in the result set which can be done in constant space.
use quantiles::misra_gries::*;
let k: usize = 3;
let numbers: Vec<u32> = vec![1,3,2,1,3,4,3,1,2,1];
let counts = misra_gries(numbers.iter(), k);
let bound = numbers.len() / (k+1);
let in_range = f_expected: usize, f_approx: usize {
f_approx <= f_expected &&
(bound >= f_expected  f_approx >= (f_expected  bound))
};
assert!(in_range(4usize, *counts.get(&1).unwrap()));
assert!(in_range(2usize, *counts.get(&2).unwrap()));
assert!(in_range(3usize, *counts.get(&3).unwrap()));
Greenwald Khanna calculates approximate quantiles.
If the desired quantile is `, the approximate
quantile is any element in the range of elements that rank
between
()Nand
(+)N`
The stream summary datastructure can cope with up to max[usize] observations.
The beginning and end quantiles are clamped at the Minimum and maximum observed elements respectively.
This page explains the theory: http://www.mathcs.emory.edu/~cheung/Courses/584StreamDB/Syllabus/08Quantile/Greenwald.html
use quantiles::greenwald_khanna::*;
let epsilon = 0.01;
let mut stream = Stream::new(epsilon);
let n = 1001;
for i in 1..n {
stream.insert(i);
}
let in_range = phi: f64, value: u32 {
let lower = ((phi  epsilon) * (n as f64)) as u32;
let upper = ((phi + epsilon) * (n as f64)) as u32;
(epsilon > phi  lower <= value) && value <= upper
};
assert!(in_range(0f64, *stream.quantile(0f64)));
assert!(in_range(0.1f64, *stream.quantile(0.1f64)));
assert!(in_range(0.2f64, *stream.quantile(0.2f64)));
assert!(in_range(0.3f64, *stream.quantile(0.3f64)));
assert!(in_range(0.4f64, *stream.quantile(0.4f64)));
assert!(in_range(1f64, *stream.quantile(1f64)));
This release introduces two new algorithms, Greenwald Khanna
and Misra
Gries
. The existing CKMS has been moved from root to its own submodule. You'll
need to update your imports from
use quantiles::CMKS;
to
use quantiles::ckms::CKMS;
Make CKMS serializable, impl AddAssign (#7)
This commit fiddles with the interface of CKMS a bit by adding the AddAssign traituseful for merging two CKMS's together or for using CKMS in a numeric contextand serde serialization. It's now possible to flush this happy thing to disk or over the wire.
A few bugs were corrected:
Insert into samples immediately
The CKMS paper discusses buffering points for batch or 'cursor' insertion. After some experimentation it was determined that this buffering decreased throughput in the present implementation. A benchmark to stress the machine to its limits is desirable.
On the plus side, there is no longer any need for a flush_0 function and the user will not be tripped up by a lack of points that have been inserted but not removed from buffer.
Adjust query_1 to be immutable
Previously a call to query_1 would privately invoke flush_0. This commit prefers to avoid this to allow query_1 to be immutable which improves the ergonomics of the library. This does mean, potentially, that a client may have to call flush_0 themselves, as noted in the docs.