BayesKNN.jl

BayesKNN.jl implements the probabilistic nearest-neighbour classifier of Holmes and Adams (2002). It samples the posterior distribution of the neighbourhood size k and neighbour-strength parameter beta, then averages over posterior draws to produce class probabilities for new observations.

Quick Start

Rows are observations and columns are predictors. Put predictors on a common scale before fitting.

using BayesKNN, Random

X = [0.0; 0.2; 3.8; 4.0;;]
y = ["a", "a", "b", "b"]

fit = fit_bayesknn(X, y; k_values = [1, 2], nsamples = 1_000, rng = MersenneTwister(1))
pred = predict_proba(fit, [0.1; 3.9;;])

Input Convention

Xtrain and Xtest must be real-valued matrices with observations in rows and predictors in columns. Labels may be strings, numbers, or other sortable values. Missing labels and non-finite predictor values are rejected.

Beta Prior

The neighbour-strength parameter beta is constrained to be nonnegative. By default, fit_bayesknn uses truncated(Normal(0.0, 5.0), 0.0, Inf). You can pass a different prior with nonnegative support:

using Distributions

fit = fit_bayesknn(X, y; beta_prior = Gamma(2.0, 1.0))

API Reference

BayesKNN.BayesKNNFitType
BayesKNNFit

Container for a fitted multiclass probabilistic nearest-neighbour model.

Fields:

  • chain: named tuple (beta::Vector{Float64}, k_idx::Vector{Int}) of posterior draws.
  • Xtrain: training predictor matrix, observations in rows.
  • ytrain: encoded training labels as integers 1, 2, ..., M.
  • classes: sorted original class labels; column m of any probability matrix corresponds to classes[m].
  • k_values: candidate neighbourhood sizes used during fitting.
  • train_order: n_train × (n_train - 1) matrix of neighbour indices for each training point, sorted by increasing distance.
  • cumcounts_train: n_train × length(k_values) × M array of cumulative class counts among neighbours for each training point.
  • tree: KDTree built from the training data, reused for test-set queries.
  • diagnostics: named tuple of convergence diagnostics. Always contains ess_beta and ess_k (effective sample size for beta and k). When nchains > 1, also contains rhat_beta and rhat_k (split-R-hat); values near 1.0 indicate good mixing, values above 1.1 suggest the chains have not converged. fake_fit and other test helpers may store nothing.
source
BayesKNN.fit_bayesknnFunction
fit_bayesknn(
    Xtrain::AbstractMatrix,
    ytrain::AbstractVector;
    k_values = collect(1:min(size(Xtrain, 1) - 1, 50)),
    beta_prior = truncated(Normal(0.0, 5.0), 0.0, Inf),
    beta_step::Float64 = 0.5,
    nsamples::Int = 5_000,
    discard_initial::Int = 1_000,
    nchains::Int = 1,
    rng = Random.default_rng(),
)

Fit the multiclass probabilistic nearest-neighbour model.

Arguments:

  • Xtrain::AbstractMatrix: real-valued training predictors, with observations in rows and predictors in columns. Values must be finite. Predictors are assumed to already be standardized or otherwise on an appropriate common scale.
  • ytrain::AbstractVector: training labels, with length equal to size(Xtrain, 1). Labels must not be missing and must be sortable because class labels are stored in sorted order.

Keyword arguments:

  • k_values::AbstractVector{<:Integer}: candidate neighbourhood sizes. Defaults to collect(1:min(size(Xtrain, 1) - 1, 50)). Values must satisfy 1 <= k < size(Xtrain, 1).
  • beta_prior: prior distribution for the nonnegative neighbour-strength parameter beta. Defaults to truncated(Normal(0.0, 5.0), 0.0, Inf). The prior must support rand, logpdf, minimum, and maximum, and its support must not include negative values.
  • beta_step::Float64: standard deviation of the log-scale random-walk proposal for beta. Defaults to 0.5. Increase if the acceptance rate for beta is too high; decrease if it is too low.
  • nsamples::Int: number of posterior samples returned per chain. Defaults to 5_000. The chain runs for discard_initial + nsamples steps in total.
  • discard_initial::Int: number of initial samples discarded as burn-in. Defaults to 1_000. These steps are run but not stored.
  • nchains::Int: number of MCMC chains. Defaults to 1. When nchains > 1, chains are sampled in parallel using available threads; if Julia was started with one thread, a warning is issued and chains run serially.
  • rng: random number generator. Defaults to Random.default_rng().

Returns a BayesKNNFit containing the posterior draws, encoded training labels, original class labels, candidate k values, neighbour information, and training KDTree.

source
BayesKNN.predict_probaFunction
predict_proba(fit::BayesKNNFit, Xtest::AbstractMatrix)

Posterior predictive class probabilities for each test point.

Arguments:

  • fit::BayesKNNFit: fitted model returned by fit_bayesknn.
  • Xtest::AbstractMatrix: real-valued test predictors, with observations in rows and predictors in columns. Values must be finite, and size(Xtest, 2) must equal size(fit.Xtrain, 2).

Returns a named tuple with:

  • p_mean: n_test × M matrix of posterior mean probabilities
  • p_lo: n_test × M matrix of 2.5% quantiles
  • p_hi: n_test × M matrix of 97.5% quantiles
  • p_draws: n_draws × n_test × M array of posterior predictive probabilities
  • classes: original class labels corresponding to columns of the probability matrices
source
BayesKNN.predict_classFunction
predict_class(fit::BayesKNNFit, Xtest::AbstractMatrix)

Posterior class predictions for each test point.

Arguments:

  • fit::BayesKNNFit: fitted model returned by fit_bayesknn.
  • Xtest::AbstractMatrix: real-valued test predictors, with observations in rows and predictors in columns. Values must be finite, and size(Xtest, 2) must equal size(fit.Xtrain, 2).

Returns a named tuple with:

  • yhat_encoded: predicted encoded labels in 1:M
  • yhat: predicted original labels
  • p_mean: posterior mean class probabilities
  • p_lo: lower credible limits for class probabilities
  • p_hi: upper credible limits for class probabilities
  • classes: original class labels corresponding to columns of the probability matrices
source
BayesKNN.posterior_k_pmfFunction
posterior_k_pmf(fit)

Posterior PMF over candidate neighbourhood sizes.

Returns a named tuple with:

  • k: candidate neighbourhood sizes
  • posterior_prob: posterior probabilities corresponding to k
source
BayesKNN.anomaly_scoreFunction
anomaly_score(
    fit::BayesKNNFit,
    Xtest::AbstractMatrix;
    threshold = nothing,
)

Distance-based anomaly scores for test points using the posterior over k.

For each posterior draw of k, the anomaly score is the distance from the test point to its kth nearest training neighbour.

Arguments:

  • fit::BayesKNNFit: fitted model returned by fit_bayesknn.
  • Xtest::AbstractMatrix: real-valued test predictors, with observations in rows and predictors in columns. Values must be finite, and size(Xtest, 2) must equal size(fit.Xtrain, 2).
  • threshold: optional distance threshold; if provided, the function also returns the posterior probability that the kth neighbour distance exceeds this threshold

Returns a named tuple with:

  • mean_distance
  • median_distance
  • q025_distance
  • q975_distance
  • p_gt_threshold
  • distance_draws
source

Citation

Holmes, C. C. and Adams, N. M. (2002). A probabilistic nearest neighbour method for statistical pattern recognition. Journal of the Royal Statistical Society: Series B, 64(2), 295-306.