Penalizing complexity priors for Bayesian inference of circular models

Abstract

Advancements in computational power and methodologies have enabled research on massive datasets. However, tools for analyzing data with directional or periodic characteristics, such as wind directions and customers’ arrival time in 24-hour clock, remain underdeveloped. While statisticians have proposed circular distributions for such analyses, significant challenges persist in constructing circular statistical models, particularly in the context of Bayesian methods. These challenges stem from limited theoretical development and a lack of historical studies on prior selection for circular distribution parameters.

In this article, we propose a framework for selecting hyperpriors that contracts to a simpler model in circular scenarios, especially when there is insufficient information to guide prior selection. We introduce well-examined Penalized Complexity (PC) priors for the most widely used circular distributions. Comprehensive comparisons with existing hyperpriors in the literature are conducted through simulation studies and a practical case study. Final, we discuss the contributions and implications of our work, providing a foundation for further advancements in constructing Bayesian circular statistical models.

Keywords

Bayesian analysis circular distribution concentration parameter directional statistics Penalized Complexity prior

1 Introduction

Advancements in computational power have provided researchers with access to massive and diverse datasets across various fields, enabling the exploration of complex scientific and practical challenges. However, tools for analyzing data with complex structures, such as directional or periodic characteristics, remain underdeveloped and many of these datasets require specialized methods for analysis. For instance, wind directions and bird migration paths are naturally represented on a compass, while hospital patient arrival times and passenger density fluctuations follow a 24 –hour clock (Mardia and Jupp, 2009). The orientation of earthquake epicentres, directions of cosmic rays and stellar objects in astronomy (Cabella and Marinucci, 2009; Ley and Verdebout, 2017; Pewsey and García-Portugués, 2021), joint angles prone to injury, protein structure with angular measures in bioinformatics (Boomsma et al., 2008; Mardia et al., 2018), typhoon trajectories and the analysis of angular components in multivariate extreme value statistics are additional examples of data that align with circular scales. These examples underscore the growing need for statistical frameworks capable of handling circular data.

Directional statistics offers appropriate tools for analyzing such data, with circular distributions—probability distributions defined on the circumference of a circle (Jammalamadaka, 2001) and characterized by angular measures or radians—playing a central role. These methods are crucial for modelling data in fields such as meteorology, earth sciences, bioinformatics, ecology, medicine (Pardo et al., 2016; Vuollo et al., 2016), genetics, neurology, astronomy (Cabella and Marinucci, 2009; Marinucci and Peccati, 2011), image analysis (Jung et al., 2011; Esteves et al., 2018), text mining (Dhillon and Modha, 2001; Banerjee et al., 2005), machine learning (Sra, 2016) and beyond (Ley and Verdebout, 2017; Pewsey and García-Portugués, 2021).

Applying Bayesian methods to circular data, however, poses significant challenges, particularly in the selection of priors—a pivotal step in Bayesian analysis. Unlike Euclidean distributions, circular distributions have unique parameterizations and behaviour, often requiring specialized approaches. For example, in Wallace and Dowe (1993), priors for the concentration parameter κ of von Mises (vM) distribution are constructed through the techniques of Minimum Message Length (MML) (Section 2.1). These priors are also used in Dowe et al. (1996) and Marrelec and Giron (2024). Another popular prior for vM distribution is the joint conjugate prior proposed in Damien and Walker (1999). In addition, in the general procedure for Bayesian analysis with wrapped distributions proposed by Ravindran and Ghosh (2011), the Beta (a,a) prior is employed. Nuñez-Antonio et al. (2011) proposed to fit the Bayesian circular model through the projected normal distribution with normal priors. A prior for the location parameter is often more intuitive to formulate, while priors for the hyperparameters like dispersion parameters are notoriously hard to conceptualize. Moreover, when specific values of the hyperparamaters result in model complexity reduction, care should be taken in the prior construction.

Since circular variables are defined on a compact and curved space rather than in linear Euclidean space, their geometric properties make it difficult to develop strong intuition about their behaviuor. This further complicates prior selection, increasing the risk of using priors that either dominate the posterior or possibly lead to poorly performing models. When a model can be viewed as a complex model containing a simpler counterpart, then a prior allocating insufficient mass to the simpler model can result in inferring a complex model that is not supported by the data. A detailed discussion of the simpler model is presented in Section 3.1.

To address these challenges, we set two goals for this research. First, we establish a prior selection framework for the hyperparameters of a general circular model, inspired by the Penalized Complexity (PC) prior framework proposed by Simpson et al. (2017). The PC prior framework is a reliable choice for constructing default priors, as it balances model complexity and prior informativeness, particularly in scenarios with limited prior knowledge, where objective or uninformative priors are often considered. Second, we derive the explicit expressions for the PC priors for the most commonly used circular distributions’ hyperparameters and provide a way to quantify prior information through a user-defined parameter.

The structure of this article is as follows: Section 2 reviews the most commonly used circular distributions, their properties and existing priors in the literature. Section 3 introduces the proposed framework for prior selection, the procedure for deriving PC priors for circular distributions and the formulations for widely used circular distributions. Section 4 evaluates the proposed priors through comparison studies and simulation studies, while Section 5 demonstrates their application to real-world datasets. Final, Section 6 discusses broader implications and potential directions for future research.

2 Preliminaries

A proper circular distribution should have a probability density function $p (x ∣ ξ)$ that satisfies $p (x ∣ ξ) = p (x + 2 π k ∣ ξ)$ for any integer $k$ and $x \in [0, 2 π)$ , where $ξ$ is the parameters for the probability density function. Most circular distributions are constructed by using one of the following four general approaches: Wrapping, conditioning, projection and perturbation (Ley and Verdebout, 2017).

Circular distributions in the wrapped family are constructed by taking a distribution defined on ℝ and wrapping it around a circle (modulus by 2π, e.g., wrapped normal distribution, wrapped Cauchy (WC) distribution, wrapped double exponential distribution). The conditioning approach obtains circular distributions through constructing joint distribution of polar coordinates (radius r and angle θ) and finding the conditional distribution $p (θ ∣ r)$ through restricting the radius r = 1 (e.g., vM distribution). The projection approach projects an distribution on $ℝ^{2}$ onto the unit circle (e.g., projected normal distribution). The perturbation approach provides flexible choices to extend a circular density to a more general form through multiplying it with a proper function (e.g., cardioid distribution). Details can be found in Chapter 2.2 of Ley and Verdebout (2017).

Notably, many of these distributions include the circular uniform distribution as a special case (Figure 1), defined by a probability density function given by $p_{U} (x) = \frac{1}{2 π}, x \in [0, 2 π)$ . Three of the most widely used circular distributions—the vM distribution, the cardioid distribution and the WC distribution—serve as the foundation for many generalizations and extensions in circular statistics (Ley and Verdebout, 2017), such as the Jones-Pewsey distribution (Jones and Pewsey, 2005) and the Kato-Jones distribution (Kato and Jones, 2010), underscoring their importance in the field. One commonality between these three distributions is that they all include two parameters: one location parameter and one concentration parameter. The concentration parameter is essentially a scaling parameter for circular distributions. When data is distributed on a unit circle (or sphere/hypersphere), the concept of the standard deviation, as defined for linear data, becomes meaningless, since its interpretation is no longer clear, particularly when the standard deviation exceeds half a circle (π). To describe the spread or sparsity of circular data, most circular distributions introduce a ‘concentration’ parameter, which quantifies the extent of concentration (or dispersion) within the data, indicating how densely the data are clustered around a mean direction (or how widely they are spread). In this work, we focus on the vM distribution, the cardioid distribution, and the WC distribution and on their Bayesian inference.

Figure 1

Relationship between popular circular distributions. $κ, l$ and ρ are the concentration parameters for von Mises, cardioid and wrapped Cauchy distributions.

2.1 Von Mises distribution

The vM (Mardia and Jupp, 2009), often referred to as the circular normal distribution (Gumbel et al., 1953), is the circular analogue of the normal distribution. It is one of the most widely used and versatile circular distributions (Mardia and Jupp, 2009). Its probability density function is given by:

\begin{matrix} p_{V M} (x ∣ μ, κ) = \frac{1}{2 π I_{0} (κ)} exp \{κ c o s (x - μ)\}, μ \in [0, 2 π), κ \in [0, \infty), \end{matrix}

(2.1)

where μ is the location parameter, $I_{a} (\cdot)$ is the modified Bessel function of the first kind of order $a \in ℕ$ and κ is the concentration parameter. The plot for this density is given in Figure 2.

Figure 2

vM density for small (left) and large (right) κ values with μ = π.

The vM distribution includes the circular uniform distribution as a special case when $κ = 0$ , while $κ \to \infty$ indicates that the data are highly concentrated around the mean direction μ. Thus, κ can be interpreted as analogous to the precision of the normal distribution $(1 / σ^{2})$ . Since the Gamma distribution is a common prior choice for the precision in normal distributions, the Gamma ( $a, b$ ) distribution with density given in Equation 2.2 is naively considered as a prior for κ.

\begin{matrix} p (x | a, b) = \frac{b^{a}}{Γ (a)} x^{a - 1} exp \{- b x\} \end{matrix}

(2.2)

Guttorp and Lockhart (1988) proposed a joint conjugate prior for μ and κ, expressed as:

\begin{matrix} p (μ, κ) \propto {[I_{0} (κ)]}^{- c} exp \{κ R_{0} cos (x - μ_{0})\} \end{matrix}

(2.3)

where $c, R_{0}$ and $μ_{0}$ are prior hyperparameters. When $c \in ℕ, c$ can be interpreted as representing c prior observations concentrated around the direction $μ_{0}$ , with $R_{0}$ corresponding to the component of the resultant vector in the known direction (Damien and Walker, 1999).

Despite its intuitive interpretation, deriving the posterior distribution under this prior requires introducing additional hyperparameters and several latent variables during the sampling process (see Damien and Walker 1999 for details). This construction, however, increases computational complexity and often results in greater Monte Carlo variability of the posterior estimates, owing to the mixing and dependence among latent components. In some cases, particularly when c and R₀ are not well chosen or when their combinations convey weak prior information, it may also lead to wider posterior credible intervals.

Dowe et al. (1996) and Marrelec and Giron (2024) both considered two different priors for κ originally proposed by Wallace and Dowe (1993) derived from MML. These priors are defined as:

\begin{matrix} h_{2} (κ) = \frac{2}{π (1 + κ^{2})} and h_{3} (κ) = \frac{κ}{{(1 + κ^{2})}^{3 / 2}} . \end{matrix}

(2.4)

MML is a Bayesian information-theoretic restatement of Occam's razor, favoring models that describe the data most efficiently. It encodes both the model and the data as one message, where a more complex model requires a longer description and is only preferred if it leads to a shorter overall message. In this sense, priors derived from MML, such as h₂ (κ and h₃ (κ, naturally penalize overly complex or highly concentrated models unless the data provide strong evidence to support them.

2.2 Cardioid distribution

The cardioid distribution is a cardioid perturbation of the circular uniform distribution (Ley and Verdebout, 2017) with probability density function given by

\begin{matrix} p_{C} (x | μ, l) = \frac{1}{2 π} (1 + 2 l c o s (x - μ)), μ \in [0, 2 π), l \in [0, 1 / 2) . \end{matrix}

(2.5)

Here, $l$ is the concentration parameter which controls the deviation from the circular uniform distribution. Larger values of $l$ result in stronger departures from uniformity (see the left-hand-side plot in Figure 3). It is obvious from the plot that the cardioid density is much less concentrated compared with the vM density, even when the concentration parameter $l$ for the former density approaches its supremum. This property enable the cardioid distribution to be more appropriate for dispersed data (heavy-tailed data) compared with vM distribution. Thus, the cardioid distribution is primarily used as a small concentration approximation to the vM distribution, as noted by Mardia and Jupp (2009). Two reasonable prior choices for $l$ could be the Uniform (0,0.5) prior and the 0.5 × Beta (a,b) prior, since both priors fit the requirement of support being (0,0.5).

Figure 3

Cardioid density (left) and wrapped Cauchy density (right) with $μ = π$ for different concentration parameter values.

2.3 Wrapped Cauchy distribution

The WC distribution is one of the rare wrapped distribution that has an analytic density function, since its probability density can be expressed in closed form (Ley and Verdebout, 2017). Further details of the WC distribution are discussed in Chapter 3.5.7 of Mardia and Jupp (2009). The probability density function is given by:

\begin{matrix} p_{W C} (x ∣ μ, ρ) = \frac{1}{2 π} \frac{1 - ρ^{2}}{1 + ρ^{2} - 2 ρ cos (x - μ)}, where μ \in [0, 2 π), ρ \in [0, 1) . \end{matrix}

(2.6)

The parameter ρ serves as a measure of concentration, with $ρ = 0$ corresponding to the circular uniform distribution, indicating no concentration and as $ρ \to 1$ , the data become increasingly concentrated, converging to a point mass distribution. The density plot for different ρ values is given in the right-hand side of Figure 3.

For Bayesian analysis with a WC distribution, one reasonable choice for the prior for $ρ \in [0,1]$ is a Beta (a, b) prior, which has support in [0,1]. Ravindran and Ghosh (2011) mentioned that Beta (a,a) is a class of non-informative priors for ρ. However, the Beta prior behaves markedly differently with different choices for a and b and therefore they need to be chosen carefully.

The circular distributions and priors discussed above show the diversity and utility of existing methods for modelling circular data. However, many of the priors currently in use are either heuristic or tailored to specific cases, lacking a unified framework. This motivates the need for a cohesive approach for default prior selection which can account for the unique properties of circular distributions, as presented in the next section.

3 Methodology

This section introduces a framework for constructing contraction hyperpriors for circular distributions that favors simpler circular models. The framework is presented in Section 3.1 together the corresponding prior formulation framework that satisfies them. Specific priors for widely used circular distributions are proposed in Section 3.2.

3.1 Penalizing complexity prior for circular models

When there is insufficient information supporting the need for a complicated model, the principle of Occam's razor (MacKay, 2003) suggests favoring simpler alternatives. In this context, model complexity should be carefully considered when specifying prior distributions. One effective way to penalize Bayesian model complexity is through the choice of an appropriate prior, ensuring that it exerts a reasonable influence on the effective complexity of the model. Motivated by this idea, Simpson et al. (2017) proposed the PC prior framework, which is built upon four principles: (1) Occam's razor; (2) measure of complexity; (3) constant rate penalization; and (4) user-defined scaling.

Apart from the first principle, the measure of complexity principle states that the prior should be constructed using a model complexity measure defined by the ‘distance’ d between the model of interest and a simpler base model. The constant rate penalization principle specifies that the PC prior density decays at a constant rate $r \in (0, 1)$ with respect to an increment δ in the distance d, satisfying

\begin{matrix} \frac{p_{d} (d + δ)}{p_{d} (d)} = r^{δ}, d, δ \geq 0. \end{matrix}

(3.1)

Finally, the user-defined scaling principle suggests that users should have some understanding of the data, or of the expected model complexity, in order to define the strength of penalization accordingly.

The PC prior framework has been applied in a range of fields. For instance, it has been used in spatial and spatio-temporal modelling (Fuglstad et al., 2019; Cabral et al., 2023; Rodriguez Avellaneda et al., 2025), time-series analysis (Sørbye and Rue, 2017), survival modelling (Van Niekerk et al., 2021) and nonparametric regression (Ventrucci and Rue, 2016). These applications illustrate how PC priors have been used to control model complexity across different modelling contexts, motivating their extension to circular data.

In Euclidean space, statistical measures such as standard deviation and variance are intuitive and straightforward in terms of interpretability. However, as discussed in Section 2, these quantities lose their intuitive meanings for data collected on a circular domain (or transformed into circular form) and concentration parameters are used. The parameterization of the concentration parameter varies among circular distributions but typically includes the value ‘0’ as a special case, representing no concentration (or maximum dispersion). In most common cases, the distribution is simplified to the circular uniform distribution. It should be noted that the mean direction is not likelihood identifiable in the circular uniform distribution.

In contrast, as the concentration parameter increases, most circular distributions tend toward a point mass, representing extreme concentration around a specific direction (mean). An exception is the cardioid distribution which tends toward a cardioid curve rather than collapses into a point mass.

Although most circular distributions exhibit desirable properties when their concentration parameters approach the boundary values, the inference for these distributions can be unstable because the data are defined on a limited range ( $[0, 2 π)$ ). As a result, small variations in the data can correspond to large changes on the circular scale, making the concentration parameter highly sensitive. In particular, when the data deviate from the boundary cases, that is, when they are neither strongly clustered nor close to circular uniformity, the underlying model structure becomes more complex and the estimation of the concentration parameters becomes difficult and less stable.

This sensitivity and instability indicate the need for an approach to regularizing model complexity. One natural way to achieve this is to view the boundary cases as simpler circular models, which can be regarded as the base model in the PC prior framework, that is, the simplest models representing minimal structure and the most straightforward interpretation. The notion of model complexity can then be described by the measure of complexity principle, where the model complexity is quantified as the distance from the base model. We let $p (x ∣ ξ)$ be the model of interest, where x is the data and ξ is the parameter of interest and let $p (x ∣ ξ_{0})$ represent the base model, with $ξ = ξ_{0}$ be the corresponding parameter value at the base model. Simpson et al. (2017) proposed a reasonable distance measure and in circular scenarios it can be written as

\begin{matrix} d (ξ) = \sqrt{KLD (p (x | ξ) | | p (x | ξ_{0}))} = \sqrt{\int_{0}^{2 π} p (x | ξ) log (\frac{p (x | ξ)}{p (x | ξ_{0})}) d x}, \end{matrix}

(3.2)

where $KLD (\cdot ‖ \cdot)$ denotes the Kullback-Leibler divergence (KLD) (Kullback and Leibler, 1951) between two distributions, which is one of the most widely used quantitative measures of the difference between probability distributions.

Following the formulation of Simpson et al. (2017), the PC prior for a model parameter ξ is defined as an exponential probability density function with respect to the distance, as follows:

\begin{matrix} p (d) = λ exp \{- λ d\} \Leftrightarrow p (ξ) = λ exp \{- λ d (ξ)\} |\frac{\partial d (ξ)}{\partial ξ}| . \end{matrix}

(3.3)

where λ is the scaling parameter of the PC prior, defined such that $P_{P C} (Q (ξ) > U) = α$ , with $Q (ξ)$ being a function of $ξ, U$ being a reasonable value of $Q (ξ)$ and $α \in (0, 1)$ a value representing probability. The formulations of $Q (ξ)$ are discussed later in this section.

To illustrate how the PC prior behaves with respect to the distance measure, Figure 4 is presented. The left-hand-side plot shows the distance $d (κ)$ between the vM distribution and the circular uniform distribution. If we express the prior density for $κ, p (κ)$ , in terms of distance, p(d), then from the right-hand-side plot of Figure 4, we can observe that the prior density decreases as the distance increases. Since PC priors are exponential with respect to distance, for any concentration parameter ξ of a circular distribution, the density of its PC prior will always be penalized as the distance between that circular distribution and the base model increases. Moreover, because the prior is defined to be exponential on the distance scale, this penalization occurs at a constant rate with respect to increments in distance. Therefore, the PC prior satisfies the constant rate penalization principle and inherently favors simpler models, penalizing the prior density exponentially as the distance $d (ξ)$ increases. The scaling parameter, λ, of the PC prior determines the strength of penalization for deviations away from the base model. To satisfy the user-defined scaling principle and to appropriately select the scaling parameter, users need to have some understanding of the data and its properties. A general strategy for selecting λ is to set $Q (ξ) = ψ$ , where $ψ \in [0, 1]$ denotes the mean resultant length. Following Ley and Verdebout (2017) and Lund et al. (2017), it is defined as

\begin{matrix} ψ = \frac{1}{n} \sqrt{{(\sum_{i = 1}^{n} cos (x_{i}))}^{2} + {(\sum_{i = 1}^{n} sin (x_{i}))}^{2}}, \end{matrix}

(3.4)

Figure 4

Distance $(d (k))$ between vM distribution and circular uniform distribution (left) and the PC prior density in distance scale for different λ values (right).

for data $x_{i}, i = 1, 2, \dots, n$ . Alternative definitions of this quantity also exist, such as moving the factor $\frac{1}{n}$ inside the two squared summations (Mardia and Jupp, 2009). This quantity is a measure of the concentration around the circle. A value of $ψ = 0$ corresponds to a circular uniform distribution where there is no preferred direction, while $ψ = 1$ indicates that all observations coincide at the same angle (maximum concentration). Under this interpretation, $P_{P C} (Q (ξ) > U) = α$ expresses the belief that ‘the probability that the mean resultant length of the data exceeds U, is α’.

This interpretation is model-independent, as the user can also express ψ explicitly as a function of the model parameter ξ and therefore interpret $Q (ξ)$ in terms of ξ. For instance, in the vM distribution, $ψ (κ) = \frac{I_{1} (κ)}{I_{0} (κ)}$ , whilst for the cardioid and WC distributions, ψ equals to the concentration parameters $l$ and ρ, respectively. Conceptually, the mean resultant length serves as a sample-based indicator of data concentration, while the concentration parameter of a circular distribution controls the shrinkage behaviour of the corresponding probability density. Therefore, the mean resultant length links the circular distribution to the observed data, incorporating the information from the data into the density function.

This unified approach covers various circular models under one framework. However, it may not be intuitive for users inexperienced with circular data. Therefore, an alternative, more intuitive approach is also suggested.

For most circular distributions, where the bounds of the concentration parameter ξ correspond to the circular uniform distribution and a point mass, a suitable transformation $Q (ξ)$ can often be found. For instance, one could define $Q (ξ)$ such that circular uniform density corresponds to $Q (ξ) \to 2 π$ and the point mass corresponds to $Q (ξ) \to 0$ . Under this transformation, $Q (ξ) \in (0, 2 π)$ , making $P_{P C} (Q (ξ) > U) = α$ represent ‘the probability that the data fall outside a radian U around the mean direction being equal to α. The small values of U indicate a high concentration to a point mass, whilst large values of U mean that the user believes the data spread widely. This transformation is intuitive and allows users to gain insights into appropriate values for U and α by visualizing the data.

It is worth noting that the proposed PC priors are flexible in terms of informativeness. That is, they can be either weakly or strongly informative (Simpson et al., 2017), depending on the user's belief about the data. As an illustration, consider selecting the value of λ using the mean resultant length $(Q (ξ) = ψ)$ . If the user believes that ‘the mean resultant length of the data is nearly impossible to be higher than ${0.8}^{'}$ , then one can set U = 0.8 and $α = 0.001$ to obtain a strongly informative prior. Conversely, if the user has no prior knowledge about the concentration of the data, λ can be chosen by setting U = 0.5 and $α = 0.5$ , indicating uncertainty about whether the mean resultant length of the data should be larger or smaller than the median, which consequently yields a weakly informative prior. Therefore, the PC prior can always be employed regardless of whether sufficient prior information is available.

3.2 Specific cases

3.2.1 Von Mises distribution

The first step in deriving the PC prior is to compute the KLD. For the vM distribution as defined in Equation 2.1, the KLD is given below:

\begin{matrix} KLD (V M ‖ V M_{0}) = \int_{0}^{2 π} p (x | μ, κ) l o g (\frac{p (x | μ, κ)}{p (x | μ, κ_{0})}) d x \\ = l o g (I_{0} (κ_{0})) - l o g (I_{0} (κ)) + \frac{κ I_{1} (κ)}{I_{0} (κ)} - \frac{κ_{0} I_{1} (κ)}{I_{0} (κ)} . \end{matrix}

(3.5)

Recall the behaviour of the vM density with respect to changes in the concentration parameter κ (Section 2.1). Two natural choices for base model for the concentration parameter are the circular uniform distribution ( $κ_{0} = 0$ ), representing no concentration and the point mass ( $κ_{0} ≫ κ$ ), representing high concentration. With these two chosen base models, the corresponding PC priors for κ can be computed and are given in Proposition 1 and Proposition 2, respectively. The proofs are given in Appendix A.1.

Proposition 1. The PC prior for concentration parameter κ of the $v M$ distribution with the base model at $κ_{0} = 0$ has density

\begin{matrix} p (κ) = λ exp \{- λ \sqrt{\frac{κ I_{1} (κ)}{I_{0} (κ)} - log (I_{0} (κ))}\} \frac{\frac{κ (I_{0} (κ) + I_{2} (κ))}{2 I_{0} (κ)} - \frac{κ I_{1} {(κ)}^{2}}{I_{0} {(κ)}^{2}}}{2 \sqrt{\frac{κ I_{1} (κ)}{I_{0} (κ)} - log (I_{0} (κ))}}, \end{matrix}

(3.6)

where $λ > 0$ , and the corresponding CDF is

\begin{matrix} F (κ) = 1 - e x p \{- λ \sqrt{\frac{κ I_{1} (κ)}{I_{0} (κ)} - log (I_{0} (κ))}\} . \end{matrix}

(3.7)

Proposition 2. The PC prior for concentration parameter κ of the $v M$ distribution with the base model at $κ_{0} ≫ κ$ has density

\begin{matrix} p (κ) = λ exp \{- λ \sqrt{1 - \frac{I_{1} (κ)}{I_{0} (κ)}}\} \frac{\frac{I_{0} (κ) + I_{2} (κ)}{2 I_{0} (κ)} - \frac{I_{1} {(κ)}^{2}}{I_{0} {(κ)}^{2}}}{2 \sqrt{1 - \frac{I_{1} (κ)}{I_{0} (κ)}}}, \end{matrix}

(3.8)

where $λ > 0$ , and the corresponding CDF is

\begin{matrix} F (κ) = exp \{- λ \sqrt{1 - \frac{I_{1} (κ)}{I_{0} (κ)}}\} . \end{matrix}

(3.9)

The plots for the density of both PC prior and the corresponding log-log density are presented in Figure 5 (prior with $κ_{0} = 0$ base model) and Figure 6 (prior with $κ_{0} \to \infty$ base model). The plots show that both PC priors vary with the value of scaling parameter λ. For the PC prior with the base model being circular uniform density ( $κ_{0} = 0$ ), the prior density reaches the peak at $κ = 0$ , which indicates that this prior prefers a more uniformly distributed model. As for the PC prior with the base model being a point mass (Figure 6), it assigns more density away from $κ = 0$ when λ increases. In other words, this PC prior believes that the data are more concentrated compared with the PC prior with circular uniform base model (PCU). Since $κ \in [0, \infty)$ , following the discussion on interpretable transformation in Section 3.1, we suggest the transformation $Q (κ) = \frac{2 π}{1 + κ} \in [0, 2 π]$ , which represents the radian of a circle. Therefore, the scaling parameter λ for the PC prior for the concentration parameter of vM distribution can be computed and calibrated by

\begin{matrix} P (Q (κ) > U) = P (\frac{2 π}{1 + κ} > U) = P (κ < \frac{2 π}{U} - 1) = F (\frac{2 π}{U} - 1) = α . \end{matrix}

(3.10)

Figure 5

$κ_{0} = 0$ base model PC prior density (left) and log-log density (right).

Figure 6

$κ_{0} \to \infty$ base model PC prior density (left) and log-log density (right).

Therefore, the expressions for computing λ for PC priors for κ are given by

λ \{\begin{matrix} - \frac{\log (1 - α)}{\frac{(\frac{2 π}{U} - 1) I_{1} (\frac{2 π}{U} - 1)}{I_{0} (\frac{2 π}{U} - 1)} - \log (I_{0} (\frac{2 π}{U} - 1))} & for base model at κ_{0} = 0; \\ - \frac{\log (1 - α)}{\sqrt{1 - \frac{I_{1} (\frac{2 π}{U} - 1)}{I_{0} (\frac{2 π}{U} - 1)}}} & for base model at κ_{0} \to \infty; \end{matrix}

(3.11)

3.2.2 Cardioid distribution

For the cardioid distribution as defined in (2.5), we have the KLD as

\begin{matrix} KLD (C ‖ C_{0}) = \int_{0}^{2 π} p (x | μ, l) \log (\frac{p (x | μ, l)}{p (x | μ, l_{0})}) d x \\ = 1 - \sqrt{1 - 4 l^{2}} - \frac{l}{l_{0}} (1 - \sqrt{1 - 4 l_{0}^{2}}) + \log (l) - \log (l_{0}) \\ + \frac{1}{2} \log (1 + \sqrt{1 - 4 l^{2}}) - \frac{1}{2} \log (1 - \sqrt{1 - 4 l^{2}}) \\ + \frac{1}{2} \log (1 - \sqrt{1 - 4 l_{0}^{2}}) - \frac{1}{2} \log (1 + \sqrt{1 - 4 l_{0}^{2}}) . \end{matrix}

(3.12)

The base models for the concentration parameter $l$ of cardioid distribution are unusual, as $l$ controls the extent of the cardioid curve deviating away from the circular uniform distribution. Therefore, the two base models for $l$ are the circular uniform distribution $l_{0} = 0$ and the most un-uniform cardioid curve $(l_{0} \to \frac{1}{2})$ and the corresponding PC prior for $l$ are given in Proposition 3 and Proposition 4. The proofs are given in Appendix A.2.

Proposition 3. The PC prior for concentration parameter $l$ of the cardioid distribution with the base model at $l_{0} = 0$ has density

\begin{matrix} p (l) = λ exp \{- λ d (l)\} \frac{2 l}{(1 - exp \{- λ \sqrt{1 - log (2)}\}) (1 + \sqrt{1 - 4 l^{2}}) d (l)}, \end{matrix}

(3.13)

where $λ > 0$ , and

\begin{matrix} d (l) = \sqrt{1 - \sqrt{1 - 4 l^{2}} + log (l) + \frac{1}{2} log (\frac{1 + \sqrt{1 - 4 l^{2}}}{1 - \sqrt{1 - 4 l^{2}}})} . \end{matrix}

The corresponding CDF is

\begin{matrix} F (l) = \frac{1 - exp \{- λ d (l)\}}{1 - exp \{- λ \sqrt{1 - log (2)}\}} . \end{matrix}

(3.15)

Proposition 4. The PC prior for concentration parameter $l$ of the cardioid distribution with the base model at $l_{0} \to \frac{1}{2}$ has density

\begin{matrix} p (l) = λ exp \{- λ d (l)\} \frac{2 l + \sqrt{1 - 4 l^{2}} - 1}{2 l d (l)} . \end{matrix}

(3.16)

where $λ > 0$ , and

\begin{matrix} d (l) = \sqrt{1 + log (2) - 2 l - \sqrt{1 - 4 l^{2}} + log (l) + \frac{1}{2} log (\frac{1 + \sqrt{1 - 4 l^{2}}}{1 - \sqrt{1 - 4 l^{2}}})} . \end{matrix}

The corresponding CDF is

\begin{matrix} F (l) = exp \{- λ d (l)\} . \end{matrix}

(3.18)

The plots for the density of both PC prior and the respected $log - log$ density are presented in Figure 7 (prior with $l_{0} = 0$ base model) and Figure 8 (prior with $l_{0} \to 0.5$ base model). Figure 7 illustrates that the PC prior with a uniform base model assigns higher density at $l = 0$ for larger values of λ. However, there is always a sharp increase in density as $l \to {0.5}^{-}$ . In contrast, for the PC prior with $l_{0} \to 0.5$ as the base model (Figure 8), the prior density increases monotonically with $l$ . Additionally, the rate of increase becomes steeper as $l$ approaches 0.5. The parameter $l$ serves as a shape-concentration parameter, controlling the ‘cardiodity’ of the density. Unlike the vM and WC distributions, it is not appropriate to define $Q (l)$ using similar ideas due to the unique role of $l$ in determining the cardioid shape. Instead, we propose the transformation $Q (l) = 2 l \in (0, 1)$ , which provides a meaningful interpretation as the ‘cardiodity rate’ from 0 to 1. Using this transformation, the scaling parameter λ in Equation 3.16 can be calculated by:

\begin{matrix} P (Q (l) > U) = P (2 l > U) = P (l > \frac{U}{2}) = 1 - F (\frac{U}{2}) = α \end{matrix}

(3.19)

Figure 7

$l_{0} = 0$ base model PC prior density (left) and log-log density (right).

This transformation ensures an intuitive and interpretable way to set U, representing the threshold for the ‘cardiodity rate’ and α, which controls the weight assigned to the tail of the PC prior.

Through calculation, the expressions for computing λ of PC priors of κ are given by

λ = \{\begin{matrix} - \frac{log (α + (1 - α) exp \{- λ \sqrt{1 - log 2}\})}{d (\frac{U}{2})} & for base model at l_{0} = 0; \\ - \frac{log (1 - α)}{d (\frac{U}{2})} & for base model at l_{0} \to \frac{1}{2}, \end{matrix}

(3.20)

where $d (\cdot)$ for both cases are given in Proposition 3 and Proposition 4, respectively.

3.2.3 Wrapped Cauchy distribution

Recall that $ρ = 1$ is not well defined for the WC density (Section 2.3) and also that the WC distribution is preferred as a model with less concentration, when compared with vM distribution. WC distribution is thus usually employed when we believe the data are not strongly concentrated. Thus, it is more reasonable to consider the circular uniform distribution ( $ρ_{0} = 0$ ) as the only base model for concentration parameter ρ of the WC distribution. Therefore, the KLD between $p (x ∣ μ, ρ)$ and $p (x ∣ μ, ρ_{0} = 0)$ is

\begin{matrix} KLD (W C ‖ W C_{0}) = \int p (x | μ, ρ) log (\frac{p (x | μ, ρ)}{p (x | μ, ρ_{0})}) d x \\ = - log (1 - ρ^{2}) - log (1 - ρ_{0}^{2}) + 2 log (1 - ρ_{0} ρ) \\ = - log (1 - ρ^{2}) \end{matrix}

(3.21)

With this result, the PC prior for ρ is given in Proposition 5.

Proposition 5. The PC prior for concentration parameter ρ of the WC distribution with the base model at $ρ_{0} = 0$ has density

\begin{matrix} p (ρ) = λ exp \{- λ \sqrt{- log (1 - ρ^{2})}\} \frac{ρ}{(1 - ρ^{2}) \sqrt{- log (1 - ρ^{2})}}, \end{matrix}

(3.22)

where $λ > 0$ , and the corresponding CDF is

\begin{matrix} F (ρ) = 1 - e x p \{- λ \sqrt{- log (1 - ρ^{2})}\} . \end{matrix}

(3.23)

The proof is provided in Appendix A. 3 and the density and $log - log$ density plots of this PC prior are shown in Figure 9. The density plot indicates that higher values of λ result in greater prior density at $ρ = 0$ and the prior density increases sharply as ρ approaches one. In the $log - log$ scale, the prior behaviour reveals a consistent trend across different values of λ, suggesting similar characteristics despite variations in λ. Although the point mass boundary for WC distribution $ρ = 1$ is not well-defined, ρ still controls the concentration behaviour of the distribution from uniform to approximate point mass. Therefore, to compute the scaling parameter of the PC prior, the same idea of interpretable transformation for a parameter could be employed. For $ρ \in [0, 1)$ , we propose to use the transformation $Q (ρ) = 2 π (1 - ρ) \in [0, 2 π]$ . Now, $Q (ρ) \to 2 π$ when $ρ \to 0$ and $Q (ρ) \to 0$ when $ρ \to 1$ . Then, by solving $P (Q (ρ) > U) = F (1 - \frac{U}{2 π}) = α$ , we have $λ = - log (1 - α) / \sqrt{- log (\frac{U}{π} - \frac{U^{2}}{4 π^{2}})}$ .

Figure 8

$l_{0} \to 0.5$ base model PC prior density (left) and log-log density (right).

Figure 9

PC prior (for) density (left) and log-log density (right).

4 Simulation study and comparisons

The logic behind our proposed prior selection framework is to assess whether a prior favors simpler models and discourages unnecessarily complex specifications.

The logic behind our proposed prior selection framework is to assess whether a prior favors simpler models and discourages unnecessarily complex specifications. To further illustrate the way of using our framework and to demonstrate the advantages of employing PC priors for circular models, this section compares the proposed priors in Ssubsubsection 3.2.1 with existing priors from the literature. The comparison focuses on their behaviour in both the parameter scale and the distance scale (Subsection 4.1). Additionally, a comprehensive set of simulation studies is presented in Subsection 4.2.

4.1 Properties of PC and other common priors

In this section we illustrate the behaviour of the proposed PC prior as well as other priors from the literature for the three cases models under consideration.

4.1.1 Von Mises distribution

Figure 10 illustrates the behaviour of the PC prior, the popular Gamma(1, b) prior, h₂ prior Equation 2.4 and h₃ prior Equation 2.4 in both the parameter scale and the distance scale. The distance measure is defined between the vM density and the circular uniform density. For fairness, the scaling parameter λ of the PC prior and the rate parameter b of the Gamma(1, b) prior are calibrated so that both priors share the same median.

Figure 10

Priors for κ in parameter (left) and distance scale (right) with the base model at $κ_{0} = 0$ . The solid line is the $P C (λ = 0.92)$ prior, the dashed line is the Gamma $(1,0.34)$ , the dotted line is the $h_{2}$ prior and the dot-dash line is the $h_{3}$ prior.

From the plot of the density on the distance scale, it is evident that the h₃ prior assigns zero density at d = 0, indicating that it never suggests the data being uniform. This behaviour implies a preference for more complex models. The Gamma (1, b) prior has reasonable density at the base model, whereas it exhibits a non-monotonic density with a global maximum around $d \approx 1$ rather than at d = 0, which indicates limited contraction toward the base model. The h₂ prior does not penalize the density exponentially with respect to distance by nature; however, it has a reasonable behaviour with respect to distance, that is, its density peaks at d = 0 and decreases monotonically as $d \to \infty$ .

When the base model is at point mass ( $κ_{0} \to \infty$ ), Figure 11 reveals that only the PC prior has non-zero density at the simplest case $(d = 0)$ , whereas the Gamma (1, b) h₂ and h₃ priors assign zero density to d = 0, which limits their ability to regularize toward the base model in this scenario. Nonetheless, except for the PC prior, the other priors appear to have non-smooth density curves, with zero density for d > 1. In other words, these priors cannot support a model that has moderate distance from the point mass either.

Figure 11

Priors for κ in parameter (left) and distance scale with the base model at $κ_{0} \to \infty$ (right). The solid line is the $P C (λ = 1.26)$ prior, the dashed line is the Gamma $(1,0.34)$ , the dotted line is the $h_{2}$ prior and the dot-dash line is the $h_{3}$ prior.

Recall that the h₂ and h₃ priors are constructed based on the MML criterion, which is intended to penalize model complexity. From the plots given above, we can see that the MML-based priors assign reasonable density to small distances from the base models but do not always allow the model to reach the boundary when the distance measure is defined by Equation 3.2.

Based on these observations, we recommend using the PC prior for the concentration parameter κ of the vM distribution, particularly when limited prior information or belief is available. However, in cases where the data exhibit low concentration, the h₂ prior can also be a reasonable alternative.

4.1.2 Cardioid distribution

Figures 12 and 13 show the boundary behaviour for the priors for parameter $l$ with the base models being at $l_{0} = 0$ and $l_{0} \to 0.5$ , respectively. The plots conclude that only PC priors perform well at the boundaries, assigning non-zero density at $d = 0$ , whilst the density of uniform and $0.5 \times$ Beta priors are not even smooth. This feature illustrates the possible inherent caveat when priors are specified to be smooth and well-behaved on the parameter scale.

Figure 12

Priors for $l$ in parameter (left) and distance scale with the base model at $l_{0} = 0$ (right). The solid line is the $P C (λ = 2.86)$ prior, the dashed line is the Beta $(5,2)$ prior and the dotted line is the uniform prior.

Figure 13

Priors for $l$ in parameter (left) and distance scale with the base model at $l_{0} \to 0.5$ (right). The solid line is the $P C (λ = 1.13)$ prior, the dashed line is the Beta $(5,2)$ prior and the dotted line is the uniform prior.

4.1.3 Wrapped Cauchy distribution

The Beta prior is one of the most commonly used choices for the concentration parameter ρ of the WC distribution. The prior density on both the parameter scale and the distance scale is illustrated in Figure 14. The behaviour of the Beta prior is particularly interesting, as it exhibits different trends near the two boundaries depending on the choice of the a and b hyperparameters. The right-hand-side plot shows that the Beta prior assigns non-zero density at d = 0 only when a < 1, so that it does not exclude the base model a priori. Therefore, a Beta prior with a < 1 can be a suitable prior choice for ρ. When users have sufficient information to set an informative prior, a Beta prior with a < 1 is a reasonable choice, as it maintains the ability to prevent ‘overconfidence’. However, selection of the hyperparameter b significantly influences the behaviour of the Beta prior. Thus, we should always be careful when employing this Beta prior in practice.

Figure 14

Beta prior and PC $(λ = 1)$ prior density in parameter $(p (ρ)$ , left) and distance scale $(p (d)$ , right).

4.2 Simulation study

In former subsections, we illustrate the way of employing our proposed framework and the behaviour of the discussed priors are illustrated and examined. In this section, we further show the performance of the priors and the robustness of the proposed circular PC priors through comprehensive simulation studies. The simulation study is conducted through comparing the posteriors of the concentration parameter ξ (for vM distribution, $l$ for cardioid distribution and ρ for WC distribution) under different priors. For consistency, $p (μ)$ is chosen to be the circular uniform density, thus, the joint posterior is $p (μ, ξ ∣ x) \propto p (x ∣ μ, ξ) p (μ) p (κ)$ .

In the study, posterior samples are obtained using the No-U-Turn Sampler, an adaptive variant of Hamiltonian Monte Carlo, implemented through the Stan interface (Carpenter et al., 2017; Stan Development Team, 2025). Sampling is performed using the rstan package version 2.32.7 together with Stan version 2.32.2 to ensure reproducibility. For consistency, the data are sampled from $p (x ∣ μ, ξ)$ (vM, cardioid and WC distribution, respectively) with $N = 100, 300, 1000$ sample sizes and location parameter $μ = π$ . In addition, the result is obtained by averaging over 100 replicates of the experiment. For reproducibility, the random seed is fixed at 520 across all experiments, with individual seeds for the 100 replicates ranging sequentially from 521 to 620. The code used in the simulation study is available at https://github.com/XiangYEstats/PCpriors-circular.

The convergence behaviour of the model is assessed using the Gelman-Rubin convergence diagnostic, $\hat{R}$ (Gelman and Rubin, 1992) (see details in the Posterior Analysis section of Stan Development Team 2025). An $\hat{R}$ value close to 1 indicates good model convergence. Models with $\hat{R} > 1.01$ raise convergence concerns and $\hat{R} > 1.1$ is generally considered unacceptable.

The simulation study for the priors of the vM distribution is presented below, while the simulation studies for the priors of the cardioid and WC distributions are presented in Appendix B.

For $p (κ)$ we consider the Gamma (1, b) (exponential) priors, the $h_{2}$ and the $h_{3}$ prior (Equation 2.4, MML priors) and the (PCU, Equation 3.6) and PC prior with point mass base model (PCP, Equation 3.8). The data for the simulation study are sampled with: $κ = 0.02, 1, 7, 59$ . The hyperparameter for Gamma $(1, b)$ prior has values of $b = 0.01, 0.05, 0.1, 1$ and 5. The scaling parameters for both PCU and PCP priors are set with $U = \frac{π}{2}$ and $α = 0.01, 0.1, 0.3, 0.5, 0.7, 0.9$ and 0.99. The $\hat{R}$ values for this simulation study are shown in Figure 15. The plots indicate that all priors exhibit good convergence behaviour when using the Stan model, with the maximum $\hat{R}$ value being less than 1.004.

Figure 15

$\hat{R}$ values for κ in the simulation study. The left plot shows the maximum $\hat{R}$ for each setting and the right plot shows the $\hat{R}$ density by prior. For each setting, the $\hat{R}$ values for PCU, PCP, and exponential priors are averaged over scaling parameter values, and the $h_{2}$ and $h_{3}$ priors are averaged and presented under ‘MML’.

From the results in Figure 16, we can see that both PCU and PCP priors perform consistently across different values of α (resulting in different values of λ), whereas the posterior for Gamma ( $1, b$ ) priors behaves differently, especially when the sample size is small and the concentration is high. This indicates that the Gamma $(1, b)$ prior is sensitive to the choice of hyperparameter, whereas the PC priors show less sensitivity to the values of U and α. Notably, the $h_{2}$ prior also performs well in fitting the model. However, based on the discussion in Subsubsection 4.1.1, it is not recommended to employ the $h_{2}$ prior when the data are highly concentrated, as it may favor models that are excessively distant from the base model.

Figure 16

Simulation study for priors on the concentration parameter κ of the vM distribution across different sample sizes and true κ values. The horizontal dashed line indicates the true κ, with points showing posterior means and vertical bars indicating $95 %$ posterior credible intervals for each prior class.

Through the comparative study presented in this section, we have illustrated a strategy for selecting priors that discourage unnecessarily complex model specifications in circular distributions, emphasizing that the PC prior is consistently an appropriate choice. We acknowledge that other priors may also perform adequately and may outperform the PC prior when sufficient prior information is available to specify them appropriately. The procedure outlined in the previous sections offers a systematic way to examine prior behaviour on the distance scale, allowing practitioners to assess whether a chosen prior may favor model specifications that are more complex than warranted by the available information.

5 Application

In this section, we present a real-data case study to show the practical performance of the discussed priors. The da used in this study are the wind data (Figure 17) stored in ‘circular’ package (Lund et al., 2017) in R, recorded by a meteorological station in a place named ‘Col de la Roa’ in the Italian Alps. The dataset contains 310 measures of daily wind direction from 29 January 2001 and 31 March 2001, covering the data span from 3 a.m. to 4 a.m. of the day. As a circular analogue of normal distribution, vM distribution is employed for our model and the Bayesian model is the same as expressed in Subsection 4.2. For comparison, the Stan model with $p (k)$ being PC prior, Gamma (1, b) prior (exponential prior), $h_{2}$ prior and $h_{3}$ prior for κ are constructed.

Figure 17

Wind data.

We choose b =0.1, 1, 2, 5 and 10 for the Gamma (1, b) prior. From Figure 17, we can observe that this dataset is not highly concentrated. Therefore, for the PC prior, the circular uniform base model would be a more natural choice compared with the point mass base model. However, for the purpose of comparison and illustration, models with PC prior with both base models are constructed. We choose the scaling parameter λ following the procedure discussed in Section 3.1 with $U = π, \frac{π}{2}, \frac{π}{4}, \frac{π}{8}, \frac{π}{16}$ , indicating different radians around $x = 0$ . The α value is chosen to be 0.2, 0.35,0.5,0.65 and 0.8, respectively.

After fitting the model, the posterior mean and the 95% credible interval for κ are given in Figure 18. In the plot, the horizontal dashed line indicates the frequentist maximum likelihood estimate of κ, with points showing posterior means and vertical bars indicating 95% posterior credible intervals for each prior class. From the plot, the PCU and PCP fit the data well in terms of the stability of credible interval and posterior mean. For this dataset, as a more natural choice, the PCU prior is indeed robust than PCP prior, since the value of scaling parameter λ does not vary much with the same given U and α values.

Figure 18

Posterior means and credible intervals of κ.

We evaluate predictive performance using the expected log predictive density (ELPD), leave-one-out information criterion (LOOIC) (Vehtari et al., 2017), Watanabe-Akaike information criterion (WAIC) (Watanabe and Opper, 2010) and deviance information criterion (DIC) (Spiegelhalter et al., 2002). The ELPD, LOOIC and WAIC are computed through R package loo (Vehtari et al., 2015) version 2.9.0.

The ELPD evaluates out-of-sample predictive accuracy by summing the ELPD for each observation. Under leave-one-out cross-validation (LOO), it is defined as

{ELPD}_{loo} = \sum_{i = 1}^{N} log p (y_{i} | y_{- i})

where $p (y_{i} ∣ y_{- i})$ denotes the posterior predictive density of observation i given the data with that observation excluded. Higher ELPD values indicate better predictive performance.

In the R package loo, ${ELPD}_{loo}$ is approximated using Pareto-smoothed importance sampling leave-one-out cross-validation (PSIS-LOO) (Vehtari et al., 2017, 2024). The computation is based on the matrix of pointwise log-likelihood contributions evaluated at posterior draws. The reliability of the importance sampling approximation is assessed using Pareto- k diagnostics. Following Vehtari et al. (2017), values $k \leq 0.7$ indicate reliable approximation, values $0.7 < k \leq 1$ suggest potential instability and values $k > 1$ indicate that the approximation may fail and alternative approaches such as exact leave-one-out cross validation for those observations should be considered. We report the estimated ELPD together with its standard error (SE (ELPD)), as well as differences relative to the best-performing model ( $△ ELPD$ ) and their associated standard errors (SE ( $△ ELPD$ )).

The LOOIC is defined as LOOIC $= - 2 {ELPD}_{loo}$ , so that smaller values indicate better predictive performance. WAIC is computed from the same pointwise log-likelihood matrix and is defined as WAIC $= - 2 (lppd - p_{WAIC})$ where

lppd = \sum_{i = 1}^{N} log (\frac{1}{S} \sum_{s = 1}^{S} p (y_{i} | θ^{(s)})) and p_{WAIC} = \sum_{i = 1}^{N} {Var}_{s = 1}^{S} (log p (y_{i} ∣ θ^{(s)})) .

We report LOOIC and WAIC together with its standard error and differences relative to the best-performing model and their associated standard errors. The DIC is difined as DIC $D I C = \bar{D} + p_{D}$ , where $D (θ) = - 2 log p (y | θ), \bar{D}$ is its posterior mean and $p_{D}$ denotes the effective number of parameters. Since DIC does not naturally provide a standard error, we report only differences in DIC (ΔDIC) across models.

The predictive performance metrics are provided in Appendix C. Note that this comparison is intended to further illustrate the stability of the model, ensuring that the estimation does not deteriorate under different prior specifications. Therefore, we do not focus on identifying which prior yields the best predictive fit. For all models evaluated in this case study, all Pareto- k values are $\leq 0.7$ , indicating that the importance sampling approximation is reliable and exact leave-one-out refitting is not required.

In this application, all models have similar performance with respect to the chosen predictive metrics. The $△ ELPD, △ LOOIC, △ WAIC$ and $△ DIC$ values suggest that the predictive performance under the PCP and PCU priors is more stable across different hyperparameter settings compared with the Exp prior. Figure 18 further shows that the posterior mean and credible interval under the Exp prior are quite sensitive to the choice of λ and there is limited guidance for selecting the hyperparameters of the Gamma ( $1, b$ ) prior. In addition, the $h_{2}$ and $h_{3}$ priors both perform well in this example. As discussed in Subsubsection 4.1.1, the $h_{2}$ prior also favors a circular uniform base model. Therefore, when the data have low concentration, the $h_{2}$ prior appears to be a reasonable choice for κ.

This practical example demonstrates that the PC priors have stable and comparable performance in circular scenarios, while ensuring that the simpler model is favored a priori.

6 Conclusion and outlook

A framework for selecting priors for parameters of circular distributions is essential for constructing robust Bayesian circular models and for avoiding unnecessarily complex specifications when the parameter space is difficult to interpret. The framework proposed in this article focuses on assessing whether a prior adequately controls model complexity, thereby supporting a more reliable construction of Bayesian circular models. Additionally, it provides a simple and practical way to choose prior hyperparameters so that the resulting priors can be either strongly or weakly informative, depending on the level of prior knowledge available.

The derived PC priors are inherently designed to regularize model complexity and remain stable choices, particularly when prior knowledge is limited or uncertain. In the empirical studies considered in this article, posterior sampling under the PC prior showed no evidence of mixing or numerical instability, including in boundary scenarios. It is worth noting that the advantage of the PC prior lies in its control of model complexity, which helps guard against poorly supported model specifications rather than aiming to achieve the best possible fit. However, when reliable and well-calibrated prior information is available, carefully specified informative priors tailored to a particular setting may provide improved performance. In these situations, the proposed framework can be employed to evaluate prior behaviour with respect to model complexity and assess the suitability of alternative priors. The present analysis does not include a direct numerical diagnostic for excessive model complexity. Instead, model complexity is assessed theoretically by examining whether a prior places sufficient mass near the chosen base model, which can be clearly seen when representing the prior on the distance scale. While a dedicated quantitative metric could further support this assessment, introducing such tools is beyond the scope of this work and may be explored in future research.

Building upon the prior selection strategies and the PC prior for circular distributions addressed in this article, we develop a general Bayesian regression framework capable of handling circular responses, circular covariates, linear covariates or any combination of these (Ye et al., 2026). Future efforts will also focus on extending the prior selection framework to accommodate joint priors and providing conditional PC priors, which are important for handling multi-parameter and multi-modal distributions and models.

Footnotes

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The authors received no financial support for the research, authorship and/or publication of this article.

Supplementary materials

ORCID ID

Xiang Ye

References

Banerjee

, Dhillon

, Ghosh

, Sra

and Ridgeway

(2005) Clustering on the unit hypersphere using von Mises-Fisher distributions. Journal of Machine Learning Research , 6, 1345–1382.

Boomsma

, Mardia

, Taylor

, Ferkinghoff-Borg

, Krogh

and Hamelryck

(2008) A generative, probabilistic model of local protein structure. Proceedings of the National Academy of Sciences , 105, 8932–8937.

Cabella

and Marinucci

(2009) Statistical challenges in the analysis of cosmic microwave background radiation. The Annals of Applied Statistics , 3, 61–95.

Cabral

, Bolin

and Rue

(2023) Controlling the flexibility of non-Gaussian processes through shrinkage priors. Bayesian Analysis , 18, 1223–1246.

Carpenter

, Gelman

, Hoffman

, Lee

, Goodrich

, Betancourt

, Brubaker

, Guo

, Li

and Riddell

(2017) Stan: A probabilistic programming language. Journal of Statistical Software , 76, 1–32.

Damien

and Walker

(1999) A full Bayesian analysis of circular data using the von mises distribution. The Canadian Journal of Statistics/La Revue Canadienne de Statistique , 27, 291–298.

Dhillon

and Modha

(2001) Concept decompositions for large sparse text data using clustering. Machine Learning , 42, 143–175.

Dowe

, Oliver

, Baxter

and Wallace

(1996) Bayesian estimation of the von Mises concentration parameter. In Maximum Entropy and Bayesian Methods: Santa Fe, New Mexico, USA, 1995 Proceedings of the Fifteenth International Workshop on Maximum Entropy and Bayesian Methods, pages 51–60. New York: Springer.

Esteves

, Allen-Blanchette

, Makadia

and Daniilidis

(2018) Learning SO(3) equivariant representations with spherical CNNs. In Proceedings of the European Conference on Computer Vision (ECCV), pages 52–68. Cham:

Springer

10.

G-A

Fuglstad

, Simpson

, Lindgren

and Rue

(2019) Constructing priors that penalize the complexity of Gaussian random fields. Journal of the American Statistical Association , 114, 445–452.

11.

Gelman

and Rubin

(1992) Inference from iterative simulation using multiple sequences. Statistical Science , 7, 457–472.

12.

Gumbel

, Greenwood

and Durand

(1953) The circular normal distribution: Theory and tables. Journal of the American Statistical Association , 48, 131–152.

13.

Guttorp

and Lockhart

(1988) Finding the location of a signal: A Bayesian analysis. Journal of the American Statistical Association , 83, 322–330.

14.

Jammalamadaka

(2001) Topics in Circular Statistics . Vol. 336. Singapore: World Scientific.

15.

Jones

and Pewsey

(2005) A family of symmetric distributions on the circle. Journal of the American Statistical Association , 100, 14221428.

16.

Jung

, Foskey

and Marron

(2011) Principal arc analysis on direct product manifolds. The Annals of Applied Statistics , 5, 578–603.

17.

Kato

and Jones

(2010) A family of distributions on the circle with links to, and applications arising from, Möbius transformation. Journal of the American Statistical Association , 105, 249–262.

18.

Kullback

and Leibler

(1951) On information and sufficiency. The Annals of Mathematical Statistics , 22, 79–86.

19.

Ley

and Verdebout

(2017) Modern Directional Statistics . Boca Raton, FL: Chapman and Hall/CRC.

20.

Lund

, Agostinelli

and Agostinelli

(2017) Package ‘circular’. Repository CRAN , 775, 20–135.

21.

MacKay

(2003) Information Theory, Inference and Learning Algorithms . Cambridge: Cambridge University Press.

22.

Mardia

and Jupp

(2009) Directional Statistics . Chichester: John Wiley and Sons.

23.

Mardia

, Foldager

and Frellsen

(2018) Directional statistics in protein bioinformatics. In Applied Directional Statistics, pages 17–40. Boca Raton, FL: Chapman and Hall/CRC.

24.

Marinucci

and Peccati

(2011) Random Fields on the Sphere: Representation, Limit Theorems and Cosmological Applications . Vol. 389. Cambridge: Cambridge University Press.

25.

Marrelec

and Giron

(2024) Estimating the concentration parameter of a von mises distribution: A systematic simulation benchmark. Communications in Statistics-Simulation and Computation , 53, 117–129.

26.

Nuñez-Antonio

, Gutiérrez-Peña

and Escarela

(2011) A Bayesian regression model for circular data based on the projected normal distribution. Statistical Modelling , 11, 185201.

27.

Pardo

, Real

, Krishnaswamy

, López-Higuera

, Pogue

and Conde

(2016) Directional kernel density estimation for classification of breast tissue spectra. IEEE Transactions on Medical Imaging , 36, 64–73.

28.

Pewsey

and García-Portugués

(2021) Recent advances in directional statistics. Test , 30, 1–58.

29.

Ravindran

and Ghosh

(2011) Bayesian analysis of circular data using wrapped distributions. Journal of Statistical Theory and Practice , 5, 547–561.

30.

Rodriguez Avellaneda

, Chacón-Montalván

and Moraga

(2025) Multivariate disaggregation modelling of air pollutants: A case study of PM2.5, PM10 and ozone prediction in Portugal and Italy. The American Statistician , 79, 1–26.

31.

Simpson

, Rue

, Riebler

, Martins

and Sorbye

(2017) Penalising model component complexity: A principled, practical approach to constructing priors. Statistical Science , 32, 1–28.

32.

Sørbye

and Rue

(2017) Penalised complexity priors for stationary autoregressive processes. Journal of Time Series Analysis , 38, 923–935.

33.

Spiegelhalter

, Best

, Carlin

and Van der Linde

(2002) Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 64, 583–639.

34.

Sra

(2016) Directional statistics in machine learning: A brief review. In Applied Directional Statistics. pages 145–160. Boca Raton, FL:

Chapman and Hall/CRC

35.

Team

Stan Development

(2025) Stan: The R interface to Stan . R package version 2.32.7. URL https://mc-stan.org/.

36.

Van Niekerk

, Bakka

and Rue

(2021) A principled distance-based prior for the shape of the Weibull model. Statistics and Probability Letters , 174, 109098.

37.

Vehtari

, Gabry

, Magnusson

, Yao

, Bürkner

, Paananen

and Gelman

(2015) loo: Efficient Leave-one-out Cross-validation and WAIC for Bayesian Models . R package version 2.32.0.

Vienna:

The R Foundation for Statistical Computing.

38.

Vehtari

, Gelman

and Gabry

(2017) Practical Bayesian model evaluation using leave-oneout cross-validation and WAIC. Statistics and Computing , 27, 1413–1432.

39.

Vehtari

, Simpson

, Gelman

, Yao

and Gabry

(2024) Pareto smoothed importance sampling. Journal of Machine Learning Research , 25, 1–58.

40.

Ventrucci

and Rue

(2016) Penalized complexity priors for degrees of freedom in Bayesian P-splines. Statistical Modelling , 16, 429–453.

41.

Vuollo

, Holmström

, Aarnivala

, Harila

, Heikkinen

, Pirttiniemi

and Valkama

(2016) Analyzing infant head flatness and asymmetry using kernel density estimation of directional surface data from a craniofacial 3D model. Statistics in Medicine , 35, 4891–4904.

42.

Wallace

and Dowe

(1993) MML Estimation of the von Mises Concentration Parameter . Clayton, Victoria: Monash University, Department of Computer Science.

43.

Watanabe

and Opper

(2010) Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research , 11, 3571–3594.

44.

, Van Niekerk

and Rue

(2026) A Bayesian regression framework for circular models with INLA. arXiv preprint arXiv:2602.08413

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

8.68 MB

Penalizing complexity priors for Bayesian inference of circular models

Abstract

Keywords

1 Introduction

2 Preliminaries

Figure 1

Relationship between popular circular distributions. κ , l and ρ are the concentration parameters for von Mises, cardioid and wrapped Cauchy distributions.

vM density for small (left) and large (right) κ values with μ = π.

Cardioid density (left) and wrapped Cauchy density (right) with μ = π for different concentration parameter values.

3.1 Penalizing complexity prior for circular models

Distance d k between vM distribution and circular uniform distribution (left) and the PC prior density in distance scale for different λ values (right).

3.2.1 Von Mises distribution

κ 0 = 0 base model PC prior density (left) and log-log density (right).

κ 0 → ∞ base model PC prior density (left) and log-log density (right).

l 0 = 0 base model PC prior density (left) and log-log density (right).

l 0 → 0.5 base model PC prior density (left) and log-log density (right).

PC prior (for) density (left) and log-log density (right).

4.1 Properties of PC and other common priors

4.1.1 Von Mises distribution

Figure 10

Priors for κ in parameter (left) and distance scale (right) with the base model at κ 0 = 0 . The solid line is the P C ( λ = 0.92 ) prior, the dashed line is the Gamma ( 1,0.34 ) , the dotted line is the h 2 prior and the dot-dash line is the h 3 prior.

Priors for κ in parameter (left) and distance scale with the base model at κ 0 → ∞ (right). The solid line is the P C ( λ = 1.26 ) prior, the dashed line is the Gamma ( 1,0.34 ) , the dotted line is the h 2 prior and the dot-dash line is the h 3 prior.

Figure 12

Priors for l in parameter (left) and distance scale with the base model at l 0 = 0 (right). The solid line is the P C ( λ = 2.86 ) prior, the dashed line is the Beta ( 5,2 ) prior and the dotted line is the uniform prior.

Priors for l in parameter (left) and distance scale with the base model at l 0 → 0.5 (right). The solid line is the P C ( λ = 1.13 ) prior, the dashed line is the Beta ( 5,2 ) prior and the dotted line is the uniform prior.

Figure 14

Beta prior and PC ( λ = 1 ) prior density in parameter ( p ( ρ ) , left) and distance scale ( p ( d ) , right).

Figure 15

Figure 17

Wind data.

Posterior means and credible intervals of κ.

Footnotes

Declaration of conflicting interests

Funding

Supplementary materials

ORCID ID

References

Supplementary Material

Relationship between popular circular distributions. $κ, l$ and ρ are the concentration parameters for von Mises, cardioid and wrapped Cauchy distributions.

Cardioid density (left) and wrapped Cauchy density (right) with $μ = π$ for different concentration parameter values.

Distance $(d (k))$ between vM distribution and circular uniform distribution (left) and the PC prior density in distance scale for different λ values (right).

$κ_{0} = 0$ base model PC prior density (left) and log-log density (right).

$κ_{0} \to \infty$ base model PC prior density (left) and log-log density (right).

$l_{0} = 0$ base model PC prior density (left) and log-log density (right).

$l_{0} \to 0.5$ base model PC prior density (left) and log-log density (right).

Priors for κ in parameter (left) and distance scale (right) with the base model at $κ_{0} = 0$ . The solid line is the $P C (λ = 0.92)$ prior, the dashed line is the Gamma $(1,0.34)$ , the dotted line is the $h_{2}$ prior and the dot-dash line is the $h_{3}$ prior.

Priors for κ in parameter (left) and distance scale with the base model at $κ_{0} \to \infty$ (right). The solid line is the $P C (λ = 1.26)$ prior, the dashed line is the Gamma $(1,0.34)$ , the dotted line is the $h_{2}$ prior and the dot-dash line is the $h_{3}$ prior.

Priors for $l$ in parameter (left) and distance scale with the base model at $l_{0} = 0$ (right). The solid line is the $P C (λ = 2.86)$ prior, the dashed line is the Beta $(5,2)$ prior and the dotted line is the uniform prior.

Priors for $l$ in parameter (left) and distance scale with the base model at $l_{0} \to 0.5$ (right). The solid line is the $P C (λ = 1.13)$ prior, the dashed line is the Beta $(5,2)$ prior and the dotted line is the uniform prior.

Beta prior and PC $(λ = 1)$ prior density in parameter $(p (ρ)$ , left) and distance scale $(p (d)$ , right).