统计
集中趋势度量
这些例子计算了包含在一个 Rust 数组中,数据集的集中趋势度量。对于空数据集,可能没有要计算的平均值、中值或常见值,因此,每个函数都返回一个给调用方处理的[Option
]。
第一个示例,计算平均值(集合中,总和/数量),通过生成对data
引用的迭代器,并使用[sum
]和[len
]分别确定值的总值和计数。
fn main() { let data = [3, 1, 6, 1, 5, 8, 1, 8, 10, 11]; let sum = data.iter().sum::<i32>() as f32; let count = data.len(); let mean = match count { positive if positive > 0 => Some(sum /count as f32), _ => None }; println!("Mean of the data is {:?}", mean); }
第二个示例使用 QuickSelect 算法,计算中间值,这避免了[sort
]的完整排序,只对已知可能包含中间值的数据集分区,进行排序。这用到[cmp
]和[Ordering
],简洁决定下一个要检查的分区,以及在每个步骤中,[split_at
]为下一个分区进行静态选择。
use std::cmp::Ordering; fn partition(data: &[i32]) -> Option<(Vec<i32>, i32, Vec<i32>)> { match data.len() { 0 => None, _ => { let (pivot_slice, tail) = data.split_at(1); let pivot = pivot_slice[0]; let (left, right) = tail.iter() .fold((vec![], vec![]), |mut splits, next| { { let (ref mut left, ref mut right) = &mut splits; if next < &pivot { left.push(*next); } else { right.push(*next); } } splits }); Some((left, pivot, right)) } } } fn select(data: &[i32], k: usize) -> Option<i32> { let part = partition(data); match part { None => None, Some((left, pivot, right)) => { let pivot_idx = left.len(); match pivot_idx.cmp(&k) { Ordering::Equal => Some(pivot), Ordering::Greater => select(&left, k), Ordering::Less => select(&right, k - (pivot_idx + 1)), } }, } } fn median(data: &[i32]) -> Option<f32> { let size = data.len(); match size { even if even % 2 == 0 => { let fst_med = select(data, (even/2) - 1); let snd_med = select(data, even/2); match (fst_med, snd_med) { (Some(fst), Some(snd)) => Some((fst + snd) as f32/2.0), _ => None } }, odd => select(data, odd/2).map(|x| x as f32) } } fn main() { let data = [3, 1, 6, 1, 5, 8, 1, 8, 10, 11]; let part = partition(&data); println!("Partition is {:?}", part); let sel = select(&data, 5); println!("Selection at ordered index {} is {:?}", 5, sel); let med = median(&data); println!("Median is {:?}", med); }
最后一个示例是计算常见值,使用可变的[HashMap
],从集合中,收集每个不同整数的计数,会用到一个[fold
]以及[entry
]API。[HashMap
]中,会用[max_by_key
]搞到最常见的值。
use std::collections::HashMap; fn main() { let data = [3, 1, 6, 1, 5, 8, 1, 8, 10, 11]; let frequencies = data.iter().fold(HashMap::new(), |mut freqs, value| { *freqs.entry(value).or_insert(0) += 1; freqs }); let mode = frequencies .into_iter() .max_by_key(|&(_, count)| count) .map(|(value, _)| *value); println!("Mode of the data is {:?}", mode); }
标准偏差
此示例计算一组测量集合的标准偏差,和 z 分数。
标准偏差,定义为方差的平方根。此处()用 f32
的 [sqrt
]计算,方差则是,每次测量与平均值[mean
]之间差的平方的总和[sum
]。
z 分数是(单个测量值 - 数据集的[mean
] / 标准方差)。
fn mean(data: &[i32]) -> Option<f32> { let sum = data.iter().sum::<i32>() as f32; let count = data.len(); match count { positive if positive > 0 => Some(sum/count as f32), _ => None, } } fn std_deviation(data: &[i32]) -> Option<f32> { match (mean(data), data.len()) { (Some(data_mean), count) if count > 0 => { let variance = data.iter().map(|value| { let diff = data_mean - (*value as f32); diff * diff }).sum::<f32>()/count as f32; Some(variance.sqrt()) }, _ => None } } fn main() { let data = [3, 1, 6, 1, 5, 8, 1, 8, 10, 11]; let data_mean = mean(&data); println!("Mean is {:?}", data_mean); let data_std_deviation = std_deviation(&data); println!("Standard deviation is {:?}", data_std_deviation); let zscore = match (data_mean, data_std_deviation) { (Some(mean), Some(std_deviation)) => { let diff = data[4] as f32 - mean; Some(diff/std_deviation) }, _ => None }; println!("Z-score of data at index 4 (with value {}) is {:?}", data[4], zscore); }