Geometric methods for the estimation of structured covariances



Consider a zero-mean, real-valued, discrete-time, stationary random process {x(t), ~ tin mathbb{Z} }. Let

 r(t):=E(x(k)x(k+t)),~ k,tinmathbb{Z},

denote the autocorrelation function, and

 T:= left[ begin{array}{cccc} 	r(0) & r(1) & ldots& r(n-1) 	r(-1)& r(0) & ldots & r(n-2) 	vdots& vdots & ddots & vdots 	r(-n+1)& r(-n+2)& ddots& r(0) 	end{array}right]

the covariance of the finite observation vector

 {{bf x}} = [x(0),~ x(1),~ldots,~x(n-1)]'

i.e. T=E({{bf x}{bf x}}'). Estimating T from an observation record is often the first step in spectral analysis.

 hat{T} :=frac{1}{m}sum_{k=1}^m {bf x}_k {bf x}_k^prime

may not have the Toeplitz structure. We seek a Toeplitz structured estimate by solving

 min_{{color{red} T}in mathcal{T}} d({color{red}T}, {color{blue}hat{T}}), hspace*{3cm}(1)

where mathcal{T}:=left{T: Tgeq 0, T~ mbox{being Toeplitz} right}. In this, d represents a suitable notion of distance. In the following, we overview some notions of distance and the relation among them. If the distance is not symmetric, we use d(T||hat{T}) instead. This approach can be carried out verbatim in approximating state covariances.

Likelihood divergence

Assuming that {x(t), tin mathbb{Z}} is Gaussian, and letting {bf x}:= [{bf x}_1,ldots, {bf x}_m], the joint density function is

 p({{bf x}}; T)=(2pi)^{- frac{mn} {2}} |T|^{-frac{m}{2}} exp(-frac{1}{2}{rm trace}(T^{-1}hat{T}) ),

where hat{T}=frac{1}{m}{bf x}{bf x}^prime is the sample covariance matrix. Given hat{T}, it is natural to seek a Tin mathcal{T} such that p({bf x}; T), equivalently log p({bf x}; T), is maximal. Alternatively, one may consider the likelihood divergence

 begin{array}{rl} d_{rm L}(T||hat{T}) :=& frac{1}{m}(log p({{bf x}}; hat{T})-log p({{bf x}}; T)) 				 =& frac{1}{2}left(-log|hat{T}| +log|T|+ {rm trace} (hat{T}T^{-1})-n right). end{array}

Using d_{rm L}(T||hat{T}), then the relevant optimization problem in (1) is

 min_{{color{red} T}in mathcal{T}} left{ -log|{color{blue}hat{T}}| +log|{color{red} T}|+ {rm trace} ({color{blue}hat{T}}{color{red}T}^{-1})-nright},

which is not convex. A necessary condition for a local minimum is given in [Burg, et al.]:

 {rm trace} left( ( T^{-1}hat{T} T^{-1}-T^{-1}) Q right)=0,

for all Toeplitz Q. An iterative algorithm for solving the above equation and obtain a (local) minimum is proposed in [Burg, et al.]. This is implemented in CovEst_blw.m.

Kullback-Leibler divergence

The Kullback-Leibler (KL) divergence between two probability density functions p, hat{p} on mathbb{R}^{n} is

 d_{rm KL}(p|| hat{p}) :=int_{mathbb{R}^n} plog (frac{p}{hat{p}}) dx.

In the case where p and hat{p} are normal with zero-mean and covariance T and hat{T}, respectively, their KL divergence becomes

 d_{rm KL}(p|| hat{p}) =frac{1}{2}left(log|hat{T}| -log|T|+ {rm trace} (That{T}^{-1})-n right),

while switching the order d_{rm KL}(hat{p}|| p)= d_{rm L}(T||hat{T}) given earlier. If d_{rm KL}(p|| hat{p}) is used in (1), then the optimization problem becomes

 min_{{color{red} T}in mathcal{T}} left{ log|{color{blue}hat{T}}| -log|{color{red}T}|+ {rm trace} ({color{red}T}{color{blue}hat{T}}^{-1})-nright},

which is convex. This is implemented in CovEst_KL.m which requires the installation of cvx.

Fisher metric and geodesic distance

The KL divergence induces the Fisher metric, which is the quadratic term of d_{rm KL}(p|| p+delta)

 g_{p,{rm Fisher}}(delta)=int frac{delta^2}{p}dx.

For probability distributions parameterized by a vector theta, the corresponding metric is often referred to as the Fisher-Rao metric and given by

 g_{p,{rm Fisher-Rao}}(delta_theta)=delta_{theta}'Eleft[ left( frac{partial log p}{partialtheta}right)left( frac{partial log p}{partialtheta}right)'  right]delta_{theta}.

For zero-mean Gaussian distributions parametrized by the corresponding covariance matrices, it becomes the Rao metric

 g_{T, {rm Rao}}(Delta)=left|T^{-1/2} Delta T^{-1/2} right|_F^2.

The geodesic distance between two points T and hat{T} is the log-deviation distance

 d_{rm Log}(T,hat{T})=|log (hat{T}^{-1/2}T hat{T}^{-1/2}) |_F.

When d_{rm Log}(T,hat{T}) is used in (1), the corresponding optimization problem is not convex. Linearization of the objective function about hat{T} may be used instead, this leads to

 min_{{color{red}T}in mathcal{T}}left{ |{color{blue}hat{T}}^{-1/2}{color{red}T} {color{blue}hat{T}}^{-1/2}-I |_Fright}

which is convex.

Bures metric and Hellinger distance

The Fisher information metric is the unique Riemannian metric for which stochastic maps are contractive [N. Cencov, 1982]. In quantum mechanics, a similar property has been sought for density matrices, which are positive semi-definite matrices with trace equal to one. There are several metrics for which stochastic maps are contractive. They take the form

 {rm trace}(Delta D_{T}(Delta)).

For example, in the Bures metric

 D_{T,{rm Bures}}(Delta):=M, ~mbox{where~}frac12 (TM+MT)=Delta.

The Bures metric can also be written as

 g_{T, {rm Bures}}(Delta) =min_W Big{ ~|Y|_F^2 ~mid~ Delta=frac12(YW'+WY'),  ~mbox{and}~T=WW' Big}.

If {rm trace}(T)=|W|_F^2=1, we regard W as a point on the unit sphere, and this point is not unique. The geodesic distance between two density matrices is the smallest great-circle distance between corresponding points, which is call Bures length. The smallest “straight line” distance is the Hellinger distance which is

 begin{array}{rl} d_{rm H}(T, hat{T}):=!&min_{U, V} left{|T^{frac12}U!-!hat{T}^{frac12}V |_Fmid UU'!=!I,VV'!=!Iright} =&min_{U} left{|T^{frac12}U!-!hat{T}^{frac12} |_F mid UU'=Iright}. end{array}

The distance has a closed form

 d_{rm H}(T, hat{T})=left({rm trace}(T+hat{T}-2(hat{T}^frac12 T hat{T}^frac12)^frac12) right)^frac12.

When the above distance is used in (1), the corresponding optimization problem is actually equivalent to a linear matrix inequality (LMI) [1]. This will be shown in the following section.

Transportation distance

Let {bf x} and bf{y} be random variables in mathbb{R}^n having pdf's p_x and p_y. A formulation of the Monge-Kantorovich transportation problem (with a quadratic cost) is as follows. Determine

 d^2_{rm W_2} (p_x, p_y)!:=!inf_{p}left{E (|{bf x}-{bf y}|^{2})~mid~ int_x p!=!p_y,int_y p!=!p_x right},

where p is the joint distribution for {bf x} and bf{y}. The metric d_{{rm W}_2} is known as the Wasserstein metric. For {bf x} and bf{y} being zero-mean normal with covariances T and hat{T} and also jointly normal, we obtain

 d^2_{rm{W_2}}(p, hat{p})= min_{S}left{ {rm trace}(T+hat{T}-S-S') ~mid~ left[           begin{array}{cc}          T & S           S' & hat{T}           end{array}          right]geq 0  right},

where S:=E({bf x} bf{y}^prime). The distance d_{rm{W_2}}(p, hat{p}) has a closed form which is given by

 d_{rm{W_2}}(p, hat{p})=left({rm trace}(T+hat{T}-2(hat{T}^frac12 T hat{T}^frac12)^frac12) right)^frac12.

So d_{rm{W_2}}(p, hat{p})=d_{rm H}(T, hat{T}). So using either d_{rm{W_2}}(p, hat{p}), or d_{rm H}(T, hat{T}), the relevant optimization problem (1) becomes

 min_{{color{red}T}in mathcal{T},; S}left{{rm trace}({color{red}T}+{color{blue}hat{T}}-S-S') ~mid~  left[                                                                 begin{array}{cc}                                                                                            {color{red}T} & S                                                                                             S' & {color{blue}hat{T}}                                                                                           end{array}                                                                                     right]geq 0right}.

It is an LMI thus a convex optimization problem. This is implemented in CovEst_transp.m.

An alternative interpretation for the transportation distance is as follows. We postulate the stochastic model

 {bf hat x}={bf x}+{bf v}

where {bf v} represents noise, and hat{T} and T the covariances of {bf hat x} and {bf x}, respectively. By seeking the least possible noise variance while allowing possible coupling between {bf x} and {bf v} brings us the minimum transportation distance. Analogous rationale, albeit with different assumptions has been used to justify different methods. For instance, assuming that {bf hat x}= {bf x}+{bf v} while {bf x} and {bf v} are independent leads to

 min_{Tinmathcal{T}}left{ {rm trace}(hat{T}-T)~mid~ hat{T}-Tgeq0right}.

This is implemented in CovEst_Q.m (SCovEst_Q.m for state-covariance estimation). Then, also, assuming a “symmetric” noise contribution as in

 { bf hat {x}}+{bf hat{v}}={bf x}+{bf v},

where the noise vectors hat{bf v} and {bf v} are independent of {bf x} and {bf hat{x}}, we are called to solve

 min_{Tin {mathcal{T}}, Q,hat{Q}}left{ {rm trace}(hat{Q}+Q)~mid~ hat{T}+hat{Q}=T+Q,~Q,hat{Q}geq0right},

where hat{Q} and Q designate covariances of hat{bf v} and {bf v}, respectively. This is implemented in CovEst_QQh.m (SCovEst_QQh.m for state covariance estimation).

Mean-based covariance approximation

We investigate the concept of the mean as a way to fuse together sampled statistics from different sources into a structured one. In particular, given a set of positive semi-definite matrices {hat{T}_i, ~i=1,ldots, N }, we consider the mean-based structured covariance

 T^*:=argmin_{Tin{mathcal{T}}}~ frac{1}{N}sum_{i=1}^N  d^2(T,hat{T}),

for a metric d(T,hat{T}) or a divergence d(T||hat{T}) (and a suitable choice of power). Below, we consider each of the distances mentioned earlier.

Mean-based on KL divergence and likelihood

The optimization problem based on d_{rm KL} is

 min_{Tin mathcal{T} } ~!!left{frac{1}{N}sum_{i=1}^{N}left(log |hat{T}_i|!-!log |T|!+!{rm trace}(That{T}_i^{-1})!-!nright)right},

and is equivalent to

 min_{Tin mathcal{T} }~left{-log |T|+{rm trace}(T hat{T}^{-1})right},

where hat{T}=left(frac{1}{N}sum_{i=1}^N hat{T}_i^{-1}right)^{-1} is the harmonic mean of the hat{T}_i's. On the other hand, if the likelihood divergence d_{rm L} is used, the optimization problem

 min_{Tin mathcal{T} }~!! left{frac{1}{N}sum_{i=1}^{N}left(-log|hat{T}_i|!+!log|T|!+!{rm trace} (hat{T}_i T^{-1})!-!nright)right}

is equivalent to

 min_{Tin mathcal{T} }~left{log|T|+{rm trace} (hat{T} T^{-1})right},

where hat{T}=frac{1}{N}sum_{i=1}^N hat{T}_i is the arithmetic mean of hat{T}_i's.

Mean based on log-deviation

In this case

 min_{Tin mathcal{T}}~left{frac{1}{N}sum_{i=1}^{N} |log (hat{T}_i^{-1/2}T hat{T}_i^{-1/2}) |_F^2right}

is not convex in T. If the admissible set mathcal{T} is relaxed to the set of positive definite matrices, the minimizer is precisely the geometric mean, and it is the unique positive definite solution to the following equation

 sum_{i=1}^N log(hat{T}_i^{-1}T)=0.

When N=2, T is the unique geometric mean between two positive definite matrices

 T=hat{T}_1^frac12 (hat{T}_1^{-frac12}hat{T}_2hat{T}_1^{-frac12})^frac12 hat{T}_1^frac12.

One may consider instead the linearized optimization problem

 min_{Tin mathcal{T}}~left{sum_{i=1}^N |hat T_i^{-1/2}That T_i^{-1/2}-I|_F^2 right}

which is convex in T.

Mean based on transportation/Bures/Hellinger distance

Using the transportation/Bures/Hellinger distance, the optimization problem becomes

 min_{Tin mathcal{T},; S_i}~bigg{frac{1}{N}sum_{i=1}^N{rm trace}(T+hat{T}_i-S_i-S_i') ~mid ~  left[                                                                 begin{array}{cc}                                                                                            T & S_i                                                                                             S_i' & hat{T}_i                                                                                           end{array}                                                                                     right]geq 0,~forall~ i=1,cdots, N bigg}.

When T is only constrained to be positive semi-definite, it has been shown that the transportation mean T is the unique solution of the following equation

  frac{1}{N}sum_{i=1}^N (T^frac12 hat{T}_i T^frac12)^frac12=T.

The routine Tmean.m provides a solution to the mean based Toeplitz estimation problem which is based on cvx


Identifying a spectral line in noise

We compare the performance of the likelihood estimation and the transportation-based method in identifying a single spectral line in white noise. Consider a single “observation record”

 x(t)=cos(frac{pi}{4}t+phi)+v(t),; t=0,1,ldots, 10,

with the “noise” v generated from a zero-mean Gaussian distribution. Hence the covariance matrix

 hat{T}=left[ begin{array}{c} x(0)x(1) vdots x(10)end{array}right]left[ begin{array}{cccc} x(0)&x(1)& cdots& x(10)end{array}right]

is 11times 11 and of rank equal to 1. We use CovEst_blw and CovEst_transp to approximate hat{T} with a covariance having admissible structure. We compare the corresponding maximum-entropy spectral estimates (subplot 2&3) with the spectrum obtained using pburg (matlab command) shown in subplot 1. The true spectral line is marked by an arrow in each plot.


Averaging multiple spectra

Consider 3 different runs

     x_k(t)=cos(omega_k t+phi_k)+v_k(t),~t=0,1,ldots, 10

with omega_1=frac{pi}{4}-frac{pi}{100}, omega_2=frac{pi}{4}, omega_3=frac{pi}{4}+frac{pi}{100} and phi_1=frac{pi}{4}, phi_2=frac{2pi}{4}, phi_3=frac{3pi}{4} an v_1, v_2 and v_3 generated from zero-mean Gaussian distribution with variance 0.0025. We consider the three rank one sample covariances. We estimate the likelihood and transportation mean of these three sampled covariance with the requirement that the mean also be Toeplitz (using CovEst_blw and Tmean). We plot the corresponding maximum entropy spectra. As a comparison, we also take the average of three observation records, and compute the Burg's spectrum based on the averaged data. The three results are shown in the following figure. Both the likelihood and the Burg's methods have two peaks near the central frequency frac{pi}{4}, whereas the transportation-based method has only one peak at approximately the mean of the three instantiations of the underlying sinusoid.




[1] L. Ning, X. Jiang and T.T. Georgiou, “Geometric methods for the estimation of structured covariances,” submitted,