Big data problems in Geostatistics have generated substantial interest over the last decade. Gaussian Processes (GPs) based models are among the most popular for modeling location-referenced spatial data. Fitting GP based models to large datasets, however, requires matrix operations that quickly become prohibitive (see, e.g., S. Banerjee (2017)). An early approach of approximating the likelihood of GP based models (Vecchia 1988) has attracted much attention recently due to its proven effectiveness (also see Stein (2014)). More recently, innovated by Vecchiaâ€™s approach, A. Datta et al. (2016) proposed an approach of constructing a GP called Nearest Neighbor Gaussian Process (NNGP) based upon a given GP (parent GP). The merit of NNGP is that it introduces sparsity in the precision matrix of its finite realization, which lessens the storage requirement and computational burden in the model fitting, while the inference obtained from an NNGP based model is comparable to its parent GP based model. Some of the practical implementations of the NNGP is discussed in Zhang, Datta, and Banerjee (2018), while Katzfuss and Guinness (2017) discuss certain generalizations.

In this note, we focus on discussing an interesting claim about the performance of two NNGP based models: the response NNGP and latent NNGP models. We use Kullback-Leibler divergence (KL-D) as the criterion in the performance comparison. The note is organized as follows. First, we introduce the KL-D and discuss the KL-D for Gaussian models with latent variables. Second, we clarify a result from Katzfuss and Guinness (2017) comparing the response NNGP and latent NNGP models with some numerical examples. Third, we suggest a heuristic explanation on why in general we observe that the latent NNGP model outperforms the response NNGP model in approximating the parent GP based model.

The Kullback-Leibler divergence (KL-D) is widely used as a measure of the difference between two probability distributions. The KL-D from probability distribution \(Q\) to probability distribution \(P\) is defined as

\[\begin{equation} D_{KL}(P || Q) = \int \log{\frac{dP}{dQ}}dP\; , \; \end{equation}\] where \(\frac{dP}{dQ}\) is the Radon-Nikodym derivative of \(P\) with respect to \(Q\). In particular, for two Gaussian distributions \(P, Q\) on \(\mathcal{R}^N\) with mean vector \(\mu_P, \mu_Q\) and covariance matrix \(\Sigma_P, \Sigma_Q\), respectively, the KL-D from \(Q\) to \(P\) is given by:

\[\begin{equation}\label{eq: KL-D_Gaussian} D_{KL}(P || Q) = -\frac{1}{2}\{\log{det\Sigma_P} - \log{det\Sigma_Q} + N - tr(\Sigma_Q^{-1}\Sigma_P) - (\mu_P - \mu_Q)^{\top} \Sigma_Q^{-1} (\mu_P - \mu_Q)\} \end{equation}\]

A KL-D of zero indicates that \(P, Q\) satisfy \(P = Q\) almost everywhere. The higher the KL-D is, the more differently the two distributions behave. Thus, the value of KL-D is popularly recognized as a criterion in comparing the performance of models. A valid comparison based on KL-D requires that the distributions based on models are all defined on the same space. For hierarchical models, latent variables are often introduced to model unobserved variables, which results in a distribution on the joint space of the latent and observed variables. It is common to use the joint distribution of the latent and observed variables to calculate KL-D for theoretical comparison. However, in practice, the hierarchical models with latent variables are not preferred for obtaining inference in Markov Chain Monte Carlo (MCMC) algorithms. The latent variables are treated as parameters to be sampled since they are unobserved, but needed in each iteration of MCMC algorithms. This enlarges the dimension of the parameter space and dramatically lowers the sampling efficacy in MCMC algorithms. It is popular to integrate out latent variables and use the marginal distribution on the collapsed space to obtain the inference in MCMC algorithms. While one model may be better than another based on the KL-Ds on the joint space of observed and latent variables, the question remains whether the model will still be better on the collapsed space of observed variables. The following example shows that marginalization neither preserves volume nor keeps the order of KL-D.

Assume the true model is \(y | w \sim N(w, 1)\) with latent variable \(w \sim N(0, 1)\), now consider two models

\[\begin{equation} \begin{aligned} model1: y | w &\sim N(w, 0.5)\; , \; w \sim N(0, 0.5) \; \\ model2: y | w &\sim N(w, 2.5)\; , \; w \sim N(0, 0.5) \end{aligned} \end{equation}\] The true distribution \(P\) and the probability distributions for two models \(Q_1, Q_2\) follows

\[\begin{equation} \begin{aligned} P(y, w) \sim N \left(\begin{bmatrix}0\\0\end{bmatrix}, \begin{bmatrix}2.0 & 1.0\\ 1.0 & 1.0 \end{bmatrix} \right) ; \; &P(y) \sim N(0, 2.0)\\ Q_1(y, w) \sim N \left(\begin{bmatrix}0\\0\end{bmatrix}, \begin{bmatrix}1.0 & 0.5\\ 0.5 & 0.5 \end{bmatrix} \right) ; \; &Q_1(y) \sim N(0, 1.0)\\ Q_2(y, w) \sim N \left(\begin{bmatrix}0\\0\end{bmatrix}, \begin{bmatrix}3.0 & 0.5\\ 0.5 & 0.5 \end{bmatrix} \right) ; \; &Q_2(y) \sim N(0, 3.0) \end{aligned} \end{equation}\] Then through \eqref{eq: KL-D_Gaussian}, the KL-D from \(Q1\) to \(P\) and the KL-D from \(Q2\) to \(P\) satisfy:

\[\begin{equation}\label{eq: KL-D_com1} D_{KL}(P(y, w) || Q_1(y, w)) < D_{KL}(P(y, w) || Q_2(y, w)) ; \; D_{KL}(P(y) || Q_1(y)) > D_{KL}(P(y) || Q_2(y)) \end{equation}\]

This indicates that \(Q_1\) is closer to \(P\) than \(Q_2\) on the joint space of observed and latent variables, while \(Q_1\) is further from \(P\) than \(Q_2\) on the collapsed space of observed variables. Now, let the second model be:

\[\begin{equation} model2: y | w \sim N(w, 1.5)\; , \; w \sim N(0, 1.5) \end{equation}\] With the correponding \(Q_2\)

\[\begin{equation} Q_2(y, w) \sim N \left(\begin{bmatrix}0\\0\end{bmatrix}, \begin{bmatrix}3.0 & 1.5\\ 1.5 & 1.5 \end{bmatrix} \right) ; \; Q_2(y) \sim N(0, 3.0) \end{equation}\] Then the KL-D comparison becomes:

\[\begin{equation} D_{KL}(P(y, w) || Q_1(y, w)) > D_{KL}(P(y, w) || Q_2(y, w)); \; D_{KL}(P(y) || Q_1(y)) > D_{KL}(P(y) || Q_2(y))\; , \end{equation}\] which shows that \(Q_1\) is closer to \(P\) than \(Q_2\) on both the joint and the collapsed spaces. The above example illustrates that the comparison made on the augmented space may not hold on the collapsed space when using KL-Ds. In the next section, we will show that a claim in (Katzfuss and Guinness 2017) may fail when we use KL-Ds on a collapsed space.

The preceding section illustrates the comparison of KL-Ds with a trivial hierarchical model. In this section, we discuss the performance of NNGP based models using the KL-Ds from them to their parent GP based models. Consider \(\{y(s): s \in \mathcal{D}\}\) be the process of interest over domain \(\mathcal{D} \subset R^d, d \in N^+\), and \(y(s)\) can be decomposed as \(y(s) = w(s) + \epsilon(s)\) for some latent process \(w(s)\) and white noise process \(\epsilon(s)\). Assume that \(y(s)\) and \(w(s)\) are known to follow certain GPs. We use an NNGP derived from the true GP in modeling as an alternative of the true GP. Depending on which process NNGP is assigned for, there will be a response NNGP model and a latent NNGP model. The former assigns NNGP for \(y(s)\) and the latter assigns NNGP for \(w(s)\). Using a Directed Acyclic Graph (DAG) built on an augmented latent space, Katzfuss and Guinness (2017) show that the KL-D from the latent NNGP model to the true model is no more than that of the response NNGP model, suggesting that the latent NNGP model is better than the response NNGP model. However, as pointed out in the last section, the claim based on KL-Ds on an augmented space is not guaranteed to hold on a collapsed space. Here we provide an example with numerical results to show a response NNGP model that might outperform a latent NNGP model on a collapsed space.

Assume the observed location set is \(S = \{s_1, s_2, s_3\}\), \(w(s_1, s_2, s_3)\) has covariance matrix \(\sigma^2 R\) with correlation matrix:

\[\begin{equation} R = \begin{bmatrix} 1& \rho_{12} & \rho_{13}\\ \rho_{12}& 1 & \rho_{23} \\ \rho_{13}& \rho_{23} & 1 \end{bmatrix} \end{equation}\]

Let us suppress the connection between knots \(s_2\) and \(s_3\) in the DAG corresponding to the finite realization of the NNGP on \(S\). Then the covariance matrix of \(y(s_1, s_2, s_3)\) of the response NNGP model \(\Sigma_R\) and that of the latent NNGP model \(\Sigma_l\) have the following forms:

\[\begin{equation} \Sigma_R = \sigma^2 \begin{bmatrix} 1 + \delta^2 & \rho_{12} & \frac{\rho_{12} \rho_{23}}{ 1 + \delta^2}\\ \rho_{12}& 1 + \delta^2 & \rho_{23} \\ \frac{\rho_{12} \rho_{23}}{ 1 + \delta^2}& \rho_{23} & 1 + \delta^2 \end{bmatrix}\; , \; \Sigma_l = \sigma^2 \begin{bmatrix} 1 + \delta^2 & \rho_{12} & \rho_{12} \rho_{23}\\ \rho_{12}& 1 + \delta^2 & \rho_{23} \\ \rho_{12} \rho_{23}& \rho_{23} & 1 + \delta^2 \; , \end{bmatrix} \end{equation}\]

where \(\delta^2 = \frac{\tau^2}{\sigma^2}\) is the noise-to-signal ratio with \(\tau^2\) as the variance of the noise process \(\epsilon(s)\). By the sufficient and necessary condition of \(R\) being positive-definite:

\[\begin{equation} 1- (\rho_{12}^2 + \rho_{13}^2 + \rho_{23}^2) + 2\rho_{12}\rho_{13}\rho_{23} > 0 \; ,\; 1 - \rho_{12}^2 > 0\; , \end{equation}\] it is easy to show that \(\Sigma_R\) and \(\Sigma_l\) are also positive-definite. If \(\rho_{13} = \frac{\rho_{12} \rho_{23}}{ 1 + \delta^2}\), then the KL-D from the response NNGP model to the true model always equals zero, which is no more than the KL-D from the latent NNGP model to the true model. If \(\rho_{13} = \rho_{12} \rho_{23}\), then the KL-D of the latent NNGP model to the true model always equals zero, which reverses the relationship.

This chunk shows a numerical study of the above example:

```
# Parameters settings:
sigma.sq = 2; tau.sq = 1; rho12 = 0.6; rho23 = 0.6; deltasq = tau.sq / sigma.sq
# Set the value of rho13
rho13 = rho12 * rho23/(1 + deltasq)
# Covariance matrix of GP (Sigma), response NNGP (Sigma_R) and latent NNGP (Sigma_l)
R <- matrix(c(1, rho12, rho13, rho12, 1, rho23, rho13, rho23, 1), nrow = 3)
Sigma <- sigma.sq*(R + (deltasq)*diag(3))
Sigma_R <- sigma.sq*matrix(c(1 + deltasq, rho12, rho12*rho23/(1 + deltasq),
rho12, 1 + deltasq, rho23,
rho12*rho23/(1 + deltasq), rho23, 1 + deltasq), nrow = 3)
Sigma_l <- sigma.sq*matrix(c(1 + deltasq, rho12, rho12*rho23,
rho12, 1 + deltasq, rho23,
rho12*rho23, rho23, 1 +deltasq), nrow = 3)
# Test whether all the above matrics are positive definited or not
((min(eigen(R)$values) > 0) & (min(eigen(Sigma)$values) > 0) &
(min(eigen(Sigma_R)$values) > 0) & (min(eigen(Sigma_l)$values) > 0))
```

`[1] TRUE`

```
# KL-D_R - KL-D_l
diff = 0.5 * (determinant(Sigma_R)$modulus[1] - determinant(Sigma_l)$modulus[1]
+ 3 - sum(diag(chol2inv(chol(Sigma_l)) %*% Sigma))); diff
```

`[1] -0.004597478`

```
# Reset the value of rho13
rho13 = rho12 * rho23
# Covariance matrix of GP (Sigma), response NNGP (Sigma_R) and latent NNGP (Sigma_l)
R <- matrix(c(1, rho12, rho13, rho12, 1, rho23, rho13, rho23, 1), nrow = 3)
Sigma <- sigma.sq*(R + (deltasq)*diag(3))
Sigma_R <- sigma.sq*matrix(c(1 + deltasq, rho12, rho12*rho23/(1 + deltasq),
rho12, 1 + deltasq, rho23,
rho12*rho23/(1 + deltasq), rho23, 1 + deltasq), nrow = 3)
Sigma_l <- sigma.sq*matrix(c(1 + deltasq, rho12, rho12*rho23,
rho12, 1 + deltasq, rho23,
rho12*rho23, rho23, 1 +deltasq), nrow = 3)
# Test whether all the above matrics are positive definited or not
((min(eigen(R)$values) > 0) & (min(eigen(Sigma)$values) > 0) &
(min(eigen(Sigma_R)$values) > 0) & (min(eigen(Sigma_l)$values) > 0))
```

`[1] TRUE`

```
# KL-D_R - KL-D_l
diff = 0.5 * (determinant(Sigma_R)$modulus[1] - determinant(Sigma_l)$modulus[1]
+ 3 - sum(diag(chol2inv(chol(Sigma_l)) %*% Sigma))); diff
```

`[1] 0.00455584`

More examples were found in the trial and error of this study. We illstruate a simulation with 500 observations below, where the KL-D on the collapsed space shows that the response NNGP model is better than the latent NNGP model.

```
setwd("/Users/luzhang/Documents/github/interpolation_spatial/KL-D_com")
rm(list = ls())
library(MASS)
library(fields)
library(spBayes)
library(spNNGP) # Build neighbor index
source("NNmatrix.R") # Build matrix including nearest neighbor information
source("ADfunctionsMatern.R") # Build AD
#------------------------------- Parameter set -------------------------------#
sigma.sq = 2
tau.sq = 0.25
phi = 16
N = 500
nu = 1.5
set.seed(124)
coords <- cbind(runif(N), runif(N))
#---------------------- Build neighbor index by NNMatrix ----------------------#
M = 3 # Number of Nearest Neighbors
NN.matrix <- NNMatrix(coords = coords, n.neighbors = M, n.omp.threads = 2)
str(NN.matrix)
```

```
List of 5
$ ord : int [1:500] 66 458 445 431 169 340 269 382 364 136 ...
$ coords.ord: num [1:500, 1:2] 0.000451 0.00244 0.00391 0.004332 0.006786 ...
$ NN_ind : num [1:499, 1:3] 1 1 2 4 1 1 7 3 4 9 ...
$ NN_distM : num [1:499, 1:3] 0 0.00971 0.00971 0.27809 0.00971 ...
$ NN_dist : num [1:499, 1:3] 0.00971 0.21106 0.27809 0.16939 0.00679 ...
```