BayesKNN.jl
BayesKNN.jl implements the probabilistic nearest-neighbour classifier of Holmes and Adams (2002). It samples the posterior distribution of the neighbourhood size k and neighbour-strength parameter beta, then averages over posterior draws to produce class probabilities for new observations.
Quick Start
Rows are observations and columns are predictors. Put predictors on a common scale before fitting.
using BayesKNN, Random
X = [0.0; 0.2; 3.8; 4.0;;]
y = ["a", "a", "b", "b"]
fit = fit_bayesknn(X, y; k_values = [1, 2], nsamples = 1_000, rng = MersenneTwister(1))
pred = predict_proba(fit, [0.1; 3.9;;])Input Convention
Xtrain and Xtest must be real-valued matrices with observations in rows and predictors in columns. Labels may be strings, numbers, or other sortable values. Missing labels and non-finite predictor values are rejected.
Beta Prior
The neighbour-strength parameter beta is constrained to be nonnegative. By default, fit_bayesknn uses truncated(Normal(0.0, 5.0), 0.0, Inf). You can pass a different prior with nonnegative support:
using Distributions
fit = fit_bayesknn(X, y; beta_prior = Gamma(2.0, 1.0))API Reference
BayesKNN.BayesKNNFit — Type
BayesKNNFitContainer for a fitted multiclass probabilistic nearest-neighbour model.
Fields:
chain: named tuple(beta::Vector{Float64}, k_idx::Vector{Int})of posterior draws.Xtrain: training predictor matrix, observations in rows.ytrain: encoded training labels as integers1, 2, ..., M.classes: sorted original class labels; columnmof any probability matrix corresponds toclasses[m].k_values: candidate neighbourhood sizes used during fitting.train_order:n_train × (n_train - 1)matrix of neighbour indices for each training point, sorted by increasing distance.cumcounts_train:n_train × length(k_values) × Marray of cumulative class counts among neighbours for each training point.tree: KDTree built from the training data, reused for test-set queries.diagnostics: named tuple of convergence diagnostics. Always containsess_betaandess_k(effective sample size forbetaandk). Whennchains > 1, also containsrhat_betaandrhat_k(split-R-hat); values near 1.0 indicate good mixing, values above 1.1 suggest the chains have not converged.fake_fitand other test helpers may storenothing.
BayesKNN.fit_bayesknn — Function
fit_bayesknn(
Xtrain::AbstractMatrix,
ytrain::AbstractVector;
k_values = collect(1:min(size(Xtrain, 1) - 1, 50)),
beta_prior = truncated(Normal(0.0, 5.0), 0.0, Inf),
beta_step::Float64 = 0.5,
nsamples::Int = 5_000,
discard_initial::Int = 1_000,
nchains::Int = 1,
rng = Random.default_rng(),
)Fit the multiclass probabilistic nearest-neighbour model.
Arguments:
Xtrain::AbstractMatrix: real-valued training predictors, with observations in rows and predictors in columns. Values must be finite. Predictors are assumed to already be standardized or otherwise on an appropriate common scale.ytrain::AbstractVector: training labels, with length equal tosize(Xtrain, 1). Labels must not be missing and must be sortable because class labels are stored in sorted order.
Keyword arguments:
k_values::AbstractVector{<:Integer}: candidate neighbourhood sizes. Defaults tocollect(1:min(size(Xtrain, 1) - 1, 50)). Values must satisfy1 <= k < size(Xtrain, 1).beta_prior: prior distribution for the nonnegative neighbour-strength parameterbeta. Defaults totruncated(Normal(0.0, 5.0), 0.0, Inf). The prior must supportrand,logpdf,minimum, andmaximum, and its support must not include negative values.beta_step::Float64: standard deviation of the log-scale random-walk proposal forbeta. Defaults to0.5. Increase if the acceptance rate forbetais too high; decrease if it is too low.nsamples::Int: number of posterior samples returned per chain. Defaults to5_000. The chain runs fordiscard_initial + nsamplessteps in total.discard_initial::Int: number of initial samples discarded as burn-in. Defaults to1_000. These steps are run but not stored.nchains::Int: number of MCMC chains. Defaults to1. Whennchains > 1, chains are sampled in parallel using available threads; if Julia was started with one thread, a warning is issued and chains run serially.rng: random number generator. Defaults toRandom.default_rng().
Returns a BayesKNNFit containing the posterior draws, encoded training labels, original class labels, candidate k values, neighbour information, and training KDTree.
BayesKNN.predict_proba — Function
predict_proba(fit::BayesKNNFit, Xtest::AbstractMatrix)Posterior predictive class probabilities for each test point.
Arguments:
fit::BayesKNNFit: fitted model returned byfit_bayesknn.Xtest::AbstractMatrix: real-valued test predictors, with observations in rows and predictors in columns. Values must be finite, andsize(Xtest, 2)must equalsize(fit.Xtrain, 2).
Returns a named tuple with:
p_mean:n_test × Mmatrix of posterior mean probabilitiesp_lo:n_test × Mmatrix of 2.5% quantilesp_hi:n_test × Mmatrix of 97.5% quantilesp_draws:n_draws × n_test × Marray of posterior predictive probabilitiesclasses: original class labels corresponding to columns of the probability matrices
BayesKNN.predict_class — Function
predict_class(fit::BayesKNNFit, Xtest::AbstractMatrix)Posterior class predictions for each test point.
Arguments:
fit::BayesKNNFit: fitted model returned byfit_bayesknn.Xtest::AbstractMatrix: real-valued test predictors, with observations in rows and predictors in columns. Values must be finite, andsize(Xtest, 2)must equalsize(fit.Xtrain, 2).
Returns a named tuple with:
yhat_encoded: predicted encoded labels in1:Myhat: predicted original labelsp_mean: posterior mean class probabilitiesp_lo: lower credible limits for class probabilitiesp_hi: upper credible limits for class probabilitiesclasses: original class labels corresponding to columns of the probability matrices
BayesKNN.posterior_k_pmf — Function
posterior_k_pmf(fit)Posterior PMF over candidate neighbourhood sizes.
Returns a named tuple with:
k: candidate neighbourhood sizesposterior_prob: posterior probabilities corresponding tok
BayesKNN.posterior_beta_summary — Function
posterior_beta_summary(fit)Posterior summary for the neighbour-strength parameter beta.
BayesKNN.anomaly_score — Function
anomaly_score(
fit::BayesKNNFit,
Xtest::AbstractMatrix;
threshold = nothing,
)Distance-based anomaly scores for test points using the posterior over k.
For each posterior draw of k, the anomaly score is the distance from the test point to its kth nearest training neighbour.
Arguments:
fit::BayesKNNFit: fitted model returned byfit_bayesknn.Xtest::AbstractMatrix: real-valued test predictors, with observations in rows and predictors in columns. Values must be finite, andsize(Xtest, 2)must equalsize(fit.Xtrain, 2).threshold: optional distance threshold; if provided, the function also returns the posterior probability that thekth neighbour distance exceeds this threshold
Returns a named tuple with:
mean_distancemedian_distanceq025_distanceq975_distancep_gt_thresholddistance_draws
Citation
Holmes, C. C. and Adams, N. M. (2002). A probabilistic nearest neighbour method for statistical pattern recognition. Journal of the Royal Statistical Society: Series B, 64(2), 295-306.