Quantum learning: asymptotically optimal classification ... - IOPscience

4 downloads 31 Views 1MB Size Report
Dec 23, 2010 ... New Journal of Physics. Quantum learning: asymptotically optimal classification of qubit states. M ˘ad ˘alin Gut ˘a1,3 and Wojciech Kotłowski2.
Related content

Quantum learning: asymptotically optimal classification of qubit states To cite this article: Mdlin Gu and Wojciech Kotowski 2010 New J. Phys. 12 123032

- Asymptotically optimal quantum channel reversal for qudit ensembles and multimode Gaussian states P Bowles, M Gu and G Adesso - Optimal, reliable estimation of quantum states Robin Blume-Kohout - Optimal estimation of one-parameter quantum channels Mohan Sarovar and G J Milburn

View the article online for updates and enhancements.

Recent citations - Asymptotically optimal quantum channel reversal for qudit ensembles and multimode Gaussian states P Bowles et al - Quantum-teleportation benchmarks for independent and identically distributed spin states and displaced thermal states GuÅ£Ä, MÄdÄlin et al

This content was downloaded from IP address 185.107.94.33 on 05/12/2017 at 19:23

New Journal of Physics The open–access journal for physics

Quantum learning: asymptotically optimal classification of qubit states ˘ alin ˘ Mad Gu¸ta˘ 1,3 and Wojciech Kotłowski2 1 School of Mathematical Sciences, University of Nottingham, University Park, Nottingham, NG7 2RD, UK 2 CWI, Science Park 123, 1098 XG, Amsterdam, The Netherlands E-mail: [email protected] New Journal of Physics 12 (2010) 123032 (21pp)

Received 15 April 2010 Published 23 December 2010 Online at http://www.njp.org/ doi:10.1088/1367-2630/12/12/123032

Abstract. Pattern recognition is a central topic in learning theory, with numerous applications such as voice and text recognition, image analysis and computer diagnosis. The statistical setup in classification is the following: we are given an i.i.d. training set (X 1 , Y1 ), . . . , (X n , Yn ), where X i represents a feature and Yi ∈ {0, 1} is a label attached to that feature. The underlying joint distribution of (X, Y ) is unknown, but we can learn about it from the training set, and we aim at devising low error classifiers f : X → Y used to predict the label of new incoming features. In this paper, we solve a quantum analogue of this problem, namely the classification of two arbitrary unknown mixed qubit states. Given a number of ‘training’ copies from each of the states, we would like to ‘learn’ about them by performing a measurement on the training set. The outcome is then used to design measurements for the classification of future systems with unknown labels. We found the asymptotically optimal classification strategy and show that typically it performs strictly better than a plug-in strategy, which consists of estimating the states separately and then discriminating between them using the Helstrom measurement. The figure of merit is given by the excess risk equal to the difference between the probability of error and the probability of error of the optimal measurement for known states. We show that the excess risk scales as n −1 and compute the exact constant of the rate.

3

Author to whom any correspondence should be addressed.

New Journal of Physics 12 (2010) 123032 1367-2630/10/123032+21$30.00

© IOP Publishing Ltd and Deutsche Physikalische Gesellschaft

2 Contents

1. Introduction 2. Classical and quantum learning 2.1. Classical learning . . . . . . . . . . . . . . . . . . . . . 2.2. Quantum learning . . . . . . . . . . . . . . . . . . . . . 2.3. Local minimax formulation of optimality . . . . . . . . 3. Local formulation of the qubit state classification problem 3.1. The loss function . . . . . . . . . . . . . . . . . . . . . 3.2. The training set . . . . . . . . . . . . . . . . . . . . . . 4. Optimal classifier 4.1. Plug-in classifier based on optimal state estimation . . . 4.2. The case of unknown priors . . . . . . . . . . . . . . . . 5. Conclusions Acknowledgments Appendix. Local asymptotic normality References

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 4 4 7 8 9 9 12 13 14 15 15 16 16 20

1. Introduction

Statistical learning theory is a broad research field encompassing statistics and computer science, whose general goal is to devise algorithms that have the ability to learn from data [1]–[4]. One of the central learning problems is how to recognize patterns [5], with practical applications in speech and text recognition, image analysis, computer-aided diagnosis and data mining. The paradigm of quantum information theory is that quantum systems carry a new type of information with potentially revolutionary applications such as faster computation and secure communication [6]. Motivated by these theoretical challenges, quantum engineering is developing new tools for controlling and accurately measuring individual quantum systems [7]. In the process of engineering exotic quantum states, statistical validation has become a standard experimental procedure [8, 9] and quantum statistical inference has passed from its purely theoretical status in the 1970s [10, 11] to a more practically oriented theory at the interface between the classical and quantum worlds [12]–[15]. In this paper, we put forward a new type of quantum statistical problem inspired by learning theory, namely quantum state classification. In a nutshell, the problem is the following: we are given n 1 quantum systems prepared identically in an unknown state ρ and n 2 systems prepared in a different unknown state σ , and we would like to design a measurement that discriminates between future copies of ρ and σ as best as possible. If ρ and σ and their a priori probabilities π0 and π1 are known, then the optimal strategy for discriminating between them is the Helstrom measurement [11]: measure ‘π0 ρ–π1 σ and guess “ρ” ’ on a positive result and ‘σ ’ on a negative one. However, even this optimal strategy may fail with probability Pe = (1 − Tr(|π1 σ − π0 ρ|))/2. In our problem, the states ρ and σ are themselves unknown, so we must learn the measurement that distinguishes them. How could such a scenario occur? Suppose we send one bit of information through an unknown noisy quantum channel. To decode the information we need to discriminate between the output states ρ and σ corresponding New Journal of Physics 12 (2010) 123032 (http://www.njp.org/)

3 to the two inputs 0 and 1. However, the states are unknown, so we need to learn how to distinguish them by probing the channel with several known inputs and measuring the output states. For example, we can estimate ρ and σ from these labelled outputs and use the Helstrom measurement of the estimated states as a classifier for future unlabelled states. One of our results is that this ‘plug-in’ strategy is typically not optimal, and is outperformed by a direct learning method that aims at estimating the Helstrom measurement without passing through state estimation. In particular, the optimal learning measurement is collective with respect to all n 1 + n 2 training systems. The natural figure of merit is called excess risk, the difference between the expected error probability and the error probability of the Helstrom measurement. We show that for qubits the excess risk of both the optimal and the plug-in methods decreases as n −1 , where n is the size of the training set, and we compute the exact constant factors. Our analysis is valid for arbitrary mixed qubit states and is performed in a pointwise, local minimax (rather than Bayesian) setting, which captures the behaviour of the risk around any pair of states. The key theoretical tool is the recently developed theory of local asymptotic normality (LAN) for quantum states [23]–[26], which is an extension of the classical concept in mathematical statistics introduced by Le Cam [27]. Roughly, LAN says that the collective state of n i.i.d. quantum systems can be approximated by a joint Gaussian state of a classical and a quantum continuous variable system. This was used to derive optimal state estimation strategies for arbitrary mixed states of arbitrary finite dimension, and also in finding quantum teleportation benchmarks for multiple qubit states [28]. In this paper, LAN is used to identify the (asymptotically) optimal measurement on the training set as linear measurement on a two-mode continuous variable system. Similarly in the case of state estimation, such collective measurements perform strictly better than the local ones [29, 30]. Moreover, the optimal learning measurement is different from the optimal state estimation measurement, emphasizing once again that different quantum decision problems cannot be solved optimally simultaneously. Related work. Somewhat similar ideas have already appeared in the physics [16]–[19] and learning [20]–[22] literature but here we emphasize the close connection with learning and we aim at going beyond the special models based on group symmetry and pure states. Sasaki and Carlini [16] defined a quantum matching machine, which aims at pairing a given ‘feature’ state with the closest out of a set of ‘template’ states. The problem is formulated in a Bayesian framework with uniform priors over the feature and template pure states, which are considered to be unknown. Bergou and Hillery [17] introduced a discrimination machine, which corresponds to our setup in the special case when the training set is of size n = 1. The papers [18, 19] deal with the problem of quantum state identification as defined in this paper. The special case of Bayesian risk with uniform priors over pure states was solved in [18], with the small difference that the learning and classification steps are done in a single measurement over n + 1 systems. However, as in the case of state estimation [31], the proof relies on the special symmetry of the problem and does not cover mixed states. Finally, the concept of quantum classification was already proposed in a series of papers [20]–[22]. However, the authors mostly focused on problem formulation, reduction between different problem classes and general issues regarding learnability. Other related papers that fall outside the scope of our investigation include [32, 33]. This paper is organized as follows. Section 2 gives a short overview of the classical classification setup and introduces its quantum analogue. In section 3, we reformulate the New Journal of Physics 12 (2010) 123032 (http://www.njp.org/)

4 classification problem in the asymptotic (local) framework, as an estimation problem with quadratic loss for a Gaussian training set. The main result is theorem 5 of section 4, which gives the minimax excess risk for the case of known priors. The case of unknown priors is treated in section 4.2. In section 4.1, it is shown that the optimal classifier typically outperforms the naive plug-in procedure based on optimal state estimation. The geometry of the problem is captured by the Bloch ball illustrated in figure 3. The asymptotic results rely on the theory of LAN for quantum states, which is briefly reviewed in the appendix. 2. Classical and quantum learning

This section gives a formal definition of the quantum classification problem, in analogy with its classical counterpart. The next subsection reviews some basic concepts from classical learning that are helpful for understanding the quantum version but can be skipped at the first reading. 2.1. Classical learning Let (X, Y ) be a pair of random variables with joint distribution P over the measure space (X × {0, 1}, 6). In supervised learning the goal is to learn to predict the unknown label Y by observing the object X . In a first stage, we are given a training set of n i.i.d. pairs {(X 1 , Y1 ), . . . , (X n , Yn )} with distribution P, from which we would like to ‘learn’ about P. In the second stage, we are presented with a new sample X and asked to guess its unseen label Y . For this we construct a (random) classifier, hˆ n : X → {0, 1}, which depends on the data (X 1 , Y1 ), . . . , (X n , Yn ). Its overall accuracy is measured in terms of the expected error rate according to the data distribution P, Pe (hˆ n ) = P(hˆ n (X ) 6= Y ) = E[1hˆ n (X )6=Y ], where 1C is the indicator function equal to 1 if C is true and 0 otherwise. However, the error rate itself does not give a good indication of the performance of the learning method. Indeed, even an ‘oracle’ who knows P exactly has typically a non-zero error: in this case the optimal hˆ is the Bayes classifier h ∗ , which chooses the label that is more probable with respect to conditional distribution P(Y |X ), ( 0 if η(X ) 6 1/2, h ∗ (X ) = (1) 1 if η(X ) > 1/2, where η(X ) := P(Y = 1|X ) will be referred to as the regression function. The Bayes risk is Pe (h ∗ ) = E[E[1hˆ ∗ (X )6=Y |X ]] = 21 (1 − E[|1 − 2η(X )|]). An alternative view of the Bayes classifier that fits more naturally in the quantum setup is the following. We decompose P into the prior distribution P(Y ) and conditional distributions P0 (X ) := P(X |Y = 0) and P1 (X ) := P(X |Y = 1). We are thus in a Bayesian setup where the New Journal of Physics 12 (2010) 123032 (http://www.njp.org/)

5 hypotheses are chosen randomly with prior distribution, and the optimal solution is to choose the hypothesis with higher posterior probability: ( 0 if π0 p0 (X ) > π1 p1 (X ), h ∗ (X ) = 1 if π0 p0 (X ) 6 π1 p1 (X ), where pi are the densities of P(X |Y = i) with respect to some common reference measure, and πi = P(Y = i). The decision rule h ∗ can be easily verified to be identical to the previously defined Bayes classifier. The Bayes risk can be written as P∗e = 12 (1 − kπ0 p0 − π1 p1 k1 ).

(2)

Returning to the classification setup where P is unknown, we see that a more informative performance measure for hˆ n is the excess risk, R(hˆ n ) = Pe (hˆ n ) − Pe (h ∗ ) > 0, (3) which measures how much worse the procedure hˆ n performs compared with the performance of the oracle classifier. In statistical learning theory, one is primarily interested in consistent classifiers, for which the excess risk converges to 0 as n → ∞, and then in finding classifiers with fast convergence rates [2, 3]. But how do we compare different learning procedures? One can always design algorithms that work well for certain distributions and badly for others. Here we take the statistical approach and consider that all prior information about the data is encoded in the statistical model {Pθ : θ ∈ 2}, i.e. the data come from a distribution that depends on some unknown parameter θ belonging to a parameter space 2. The latter may be a subset of Rk (parametric) or a large class of distributions with certain ‘smoothness’ properties (nonparametric). One can then define the maximum excess risk of hˆ n , Rmax (hˆ n ) := sup Rθ (hˆ n ), (4) θ ∈2

where Rθ denotes the excess risk when the underlying distribution is Pθ . A procedure h˜ n is called minimax if its maximum excess risk is smaller than that of any other procedure Rmax (h˜ n ) = inf Rmax (hˆ n ) = inf sup Rθ (hˆ n ). (5) hˆ n θ∈2

hˆ n

Example 1. Let (X, Y ) ∈ {0, 1}2 with unknown parameters η(0), η(1) and P(X = 0), satisfying η(0) < 1/2 and η(1) > 1/2. Then the Bayes classifier is h ∗ (0) = 0 and h ∗ (1) = 1. On the other hand, from the training sample one can estimate η(i) and obtain the concentration result, P[ηˆ n (0) < 1/2

and

ηˆ n (1) > 1/2] = 1 − O(exp(−cn)).

Thus the plug-in estimator hˆ n obtained by replacing η by ηˆ n in (1) is equal to h ∗ with high probability and the excess risk is exponentially small. The crucial feature leading to exponentially small risk was the fact that the regression function η(X ) is bounded away from the critical value 1/2. Indeed by (1) the Bayes classifier depends only on the sign of η(X ) − 1/2, so if |η(X ) − 1/2| >  > 0, then the sign can be estimated with exponentially small error and the corresponding classifier will be the Bayes one with high probability. This situation is rather special but shows that the behaviour of the excess risk depends on the properties of η around the value 1/2. A different behaviour is observed in the next example where N (m, V ) denotes the normal distribution with mean m and variance V . New Journal of Physics 12 (2010) 123032 (http://www.njp.org/)

6

π0 p0 (x)

0.2

π1 p1 (x)

0.15

0.1

0.05

0 0

2

4 a+b 2

6

8

a ˆ+ˆ b 2

Figure 1. Likelihood functions for two normal distributions with means a and b.

The Bayes risk is the area of the orange triangle. The excess risk is the area of the green triangle. Example 2. Let (X, Y ) ∈ R × {0, 1} be random variables with P(X |Y = 0) = N (a, 1) and P(X |Y = 1) = N (b, 1) for some unknown means a < b, and P(Y = 0) = 1/2. From figure 1, we can see that p0 (x) 6 p1 (x) if and only if x > (a + b)/2, so that the Bayes classifier is ( 0 if x < (a + b)/2, h ∗ (x) = 1 if x > (a + b)/2. The Bayes risk is equal to the orange area under the two curves. Again a natural classifier is obtained by estimating the midpoint (a + b)/2 and plugging into the above formula. The √ ˆ additional error is the area of the green triangle. Since (aˆ + b)/2 − (a + b)/2 ≈ 1/ n one can deduce that R(hˆ n ) = O(n −1 ), and it can be shown that this rate of convergence is optimal [34]. From this example, we see that the rate is determined by the behaviour of the regression function’s distribution around 1/2, namely in this case P(|η(x) − 1/2| 6 t) = O(t),

t > 0,

which is called the margin condition. Roughly speaking, in a parametric model satisfying the margin condition, the excess risk goes to zero as O(n −1 ). In non-parametric models (which are the main focus of learning theory), arbitrarily slow rates are possible depending on the complexity of the model and the behaviour of the regression function [34]. New Journal of Physics 12 (2010) 123032 (http://www.njp.org/)

7 According to Vapnik [3], one of the principles of statistical learning is: ‘when solving a problem of interest, do not solve a more general problem as an intermediate step.’ This is interpreted as saying that learning procedures that estimate first the statistical model (or regression function) and then plug this estimate into the Bayes classifier are less efficient than ˆ methods that aim at constructing h(x) directly. Recently, it has been shown [34] that this is not necessarily the case if some type of margin condition is assumed and that plug-in estimators hˆ PLUG−IN (x) = 1η(x)>1/2 can achieve ‘fast’ n −1 rates. In this paper, we show that at least in what ˆ concerns the constant in front of the rate, direct quantum learning performs better than plug-in methods based on optimal state estimation. This phenomenon stems from the incompatibility between the optimal measurements for estimation and learning. 2.2. Quantum learning We now consider the quantum counterpart of the learning problem, the classification of quantum states. In this case, X is replaced by a Hilbert space of dimension d. To find the counterpart of P we write P(d x, y) = P(d x|y)P(y) and replace the conditional distributions P(d x|y = 0) and P(d x|y = 1) by density matrices ρ and σ , while P(y) describes prior probabilities over the states, usually denoted by π y := P(Y = y). The training set consists of n i.i.d. pairs {(τ1 , Y1 ), . . . , (τn , Yn )}, where τi = ρ if Yi = 0 and τi = σ if Yi = 1. Thus we are randomly given copies of ρ and σ together with their labels, but we do not know what ρ and σ are. After a permutation, the joint state of the training set can be concisely written as ρ ⊗n 0 ⊗ σ ⊗n 1 , where n y is the number of copies for which Y j = y. The experimenter is allowed to make any physical operations on the training set (such as unitary evolution or measurements) and outputs a binary-valued measurement with POVM b n := (b elements M Pn , 1 − b Pn ). This (non-deterministic) POVM plays the role of the classical ˆ classifier h n : given a new quantum system, this time with unknown label, we apply the b n to guess whether the state is ρ or σ . The accuracy is measured in terms of measurement M the expected misclassification error, b n ) = E[π0 Tr[ρ(1 − b Pe ( M Pn )] + π1 Tr[σ b Pn ]], where the expectation is taken over the outcomes b Pn . ∗ The Bayes classifier M is nothing but the Helstrom measurement [11], which optimally discriminates between known states ρ, σ with priors π0 , π1 . In this case M ∗ = (P ∗ , 1 − P ∗ ) where P ∗ is the projection onto the subspace of positive eigenvalues of the operator π0 ρ − π1 σ , i.e. P ∗ = [π0 ρ − π1 σ ]+ . Note that if both eigenvalues are of the same sign, the optimal procedure is to choose the state with higher πi without making any measurement at all. The Helstrom risk can be expressed as Pe∗ = 12 (1 − Tr[|π1 σ − π0 ρ|]), which is the quantum extension of (2). b n is measured by the excess risk, As before, the performance of an arbitrary classifier M b n ) = Pe ( M b n ) − P∗e = E Tr[(π1 σ − π0 ρ)(b R( M Pn − P ∗ )],

(6)

bn . where the expectation is taken with respect to the distribution of M In table 1, we summarize the analogous concepts in the classical and the quantum learning setup. Besides these obvious correspondences we would like to point out some interesting differences. Based on the coin toss example 1, one may expect that the classification of two qubit states should exhibit similar exponentially fast rates. In fact as we will show in this paper, the rate is n −1 as in example 2, where the data are not discrete but continuous and the regression New Journal of Physics 12 (2010) 123032 (http://www.njp.org/)

8 Table 1. Comparison of classical and quantum learning. Element Distribution Training example Training set Function Optimal function Minimum risk Risk

Classical learning

Quantum learning

( p1 , p2 ) with priors (π0 , π1 ) (x, y) {(X 1 , Y1 ), . . . , (X n , Yn )} Classifier hˆ

(ρ, σ ) with priors (π0 , π1 ) (ρ, 0) or (σ, 1) ρ ⊗n 0 ⊗ σ ⊗n 1 Measurement b P ∗ P = [π0 ρ − π1 σ1 ]+ 1 − Tr[|π1 σ − π0 ρ|]) 2 (1 E Tr[(π1 σ − π0 ρ)(b P − P ∗ )]

h ∗ (X ) = 1π 1 p1 (X )>π 0 p0 (X ) 1 (1 − kπ0 p0 − π1 p1 k1 ) 2 Ek(π0 p0 − π1 p1 )(hˆ − h ∗ )k1

^ P

Y

P(X|Y, Pˆ )

X

Classifier

^ Y

Figure 2. Quantum learning seen as classical learning with data distribution

depending on additional parameters controlled by the experimenter. function is not bounded away from 1/2. A possible explanation is the fact that for quantum systems the classifiers (measurements) form a continuous set as in example 2, and unlike the discrete case of example 1. A helpful way to think about it is illustrated in figure 2. The unknown label is the input of a black box, which outputs the data X with conditional distribution P(X |Y ). In the quantum case the box has an additional input, the measurement choice that appears as a parameter in the conditional distribution and is controlled by the experimenter. The game is to learn from the training set the optimal value of this parameter, for which the identification of the label Y is most facile. This setup resembles that of active learning [35], where the training data X i are actively chosen rather than collected randomly. 2.3. Local minimax formulation of optimality We now give the precise formulation of what we mean by asymptotic optimality of a learning b n : n ∈ N}. As in the classical case we construct a model that contains all unknown strategy { M parameters of the problem: the two states ρ, σ and the prior π0 . We denote these parameters collectively by θ , which belongs to a parameter space 2 ⊂ Rk . When some prior information is available about the model, it can be included by restricting to a sub-model of the general b n ) the risk of M b n at θ , and we can define the one. As in the classical case we denote by Rθ ( M maximum excess risk as in (4). However, in the asymptotic setup, the maximum excess risk is in general too pessimistic as it captures the worst behaviour over the whole parameter space. For typical ‘smooth’ parametric models, the unknown parameter can be ‘localized’ with high probability within a ball of size n −1/2+ around a rough estimator θ0 obtained by measuring a small proportion n 1−  n of the training set (see lemma 2.1 of [24]). Here,  > 0 is a small number whose particular value becomes irrelevant in the asymptotics. Thus a more informative New Journal of Physics 12 (2010) 123032 (http://www.njp.org/)

9 performance measure is the local maximum excess risk around θ0 , (l) b b n ). Rmax ( Mn ; θ0 ) := sup Rθ ( M

(7)

kθ −θ0 k6n −1/2+

Since typically this risk scales as n −1 , in asymptotics we are interested in finding the best (smallest) constant factor in front of the n −1 rate. Definition 3. The local asymptotic minimax risk at θ0 is defined as (l) (l) b Rminmax (θ0 ) := lim sup inf n Rmax ( Mn ; θ0 ). n→∞

bn M

e n : n ∈ N} is called locally asymptotically minimax at θ0 if A sequence of classifiers { M (l) (l) e lim sup n Rmax ( Mn ; θ0 ) = Rminmax (θ0 ). n→∞

We identify two general learning strategies. The first one consists of optimally estimating the states ρ, σ and prior π0 to obtain ρ, ˆ σˆ , πˆ 0 and then constructing the classifier (measurement) as b PPLUG−IN = [πˆ 0 ρˆ − πˆ 1 σˆ ]+ .

(8)

The second strategy aims at estimating the Helstrom projection P ∗ directly from the training set without passing through state estimation. After having introduced the general quantum classification setup, the rest of the paper deals with the special case of asymptotically optimal classification of qubit states. In particular we will show that the direct strategy performs better than the plug-in one. In section 3, we show how to reduce the local risk to an expectation of a quadratic form in the local parameters. This will simplify the optimal measurement problem for the training set to that of finding the optimal measurement of a Gaussian state for a quadratic loss function [10]. The technical ingredient is the concept of LAN reviewed in the appendix. 3. Local formulation of the qubit state classification problem

In this section, we reformulate the problem of quantum state classification in the ‘local’ setup. This allows us to replace, on the one hand, the excess error probability with a quadratic form in local parameters and, on the other hand, the training set consisting of i.i.d. qubits with a simpler Gaussian model. Throughout this section we restrict ourselves to the case where the priors π0 , π1 are known. In section 4.2, we show that the results can be extended to unknown priors by simply estimating them from the counts of ρ and σ states in the training sample. To help the reader navigate through the different Bloch ball vector notations, we have collected them in table 2 and illustrated the geometry of the problem in figure 3. 3.1. The loss function Remember that in order to classify the unknown qubit states ρ and σ , we must measure the b n := (b training set and produce a ‘classifier’ M Pn , 1 − b Pn ), which is itself a (non-deterministic) 2 binary measurement on C . The accuracy of the procedure is measured by the excess risk (6), b n ) = E Tr[(π1 σ − π0 ρ)(b R( M Pn − P ∗ )], (9) New Journal of Physics 12 (2010) 123032 (http://www.njp.org/)

10 Table 2. List of Bloch ball notations. Vector notation

Defined in/on

(E r 0 , sE0 ) (u, E v) E ( pE0 , pE, pEˆ n ) dE0 := π0rE0 − π1 sE0 zE := (π0 uE − π1 v) E − pE0 ((π0 uE − π1 v) E · pE0 ) zEˆn , zE˜n (aE1 , aE2 , aE3 ) and (bE1 , bE2 , bE3 ) ( pE0 , lE0 , kE0 )

Bloch vectors of Local vectors of Bloch vectors of Relative vector Vector to be estimated Estimators of Reference frames for Reference frame for

(ρ0 , σ0 ) (10) (ρ, σ ) (10) (P0 , P ∗ , b Pn ) (12)–(15) (12) (14) zE (15), section 4.1 uE and vE page 12 zE page 12

√u n

a3

l0

a1

a2 √v n

r0 b1

d0 s0

π 0 r0 z⊥ √ n

b3 b2

ϕ0 ϕ 1 π1 s0 zˆn √ n

k0

p0

Figure 3. Bloch ball geometry of the learning problem. The unknown states √ are localized in the two balls centred at rE0 and sE0 and have local vectors u/ E n and √ v/ E n. The vector to be estimated zE and its estimator zEˆn are contained in the green equatorial plane orthogonal to pE0 . The three reference systems (aE1 , aE2 , aE3 ), (bE1 , bE2 , bE3 ) and ( pE0 , lE0 , kE0 ) are shown in red.

with P ∗ := [π0 ρ − π1 σ ]+ . Since any binary measurement is a mixture of projective POVMs [36], we can assume without loss of generality that b Pn is a projection. As discussed in the previous section, the unknown states (ρ, σ ) can be localized with high probability in n −1/2+ neighbourhoods of some fixed states (ρ0 , σ0 ) by sacrificing a small New Journal of Physics 12 (2010) 123032 (http://www.njp.org/)

11 proportion of the training set. This means that (ρ0 , σ0 ) are known and can be used in the classification procedure, and we can express the states as     uE 1 √ ρ = ρu/ 1 + rE0 + √ σE , E n = 2 n (10)     1 vE √ σ = σv/ 1 + sE0 + √ σE , E n = 2 n where (E r 0 , sE0 ) are the Bloch vectors of (ρ0 , σ0 ), and (u, E v) E are fixed but unknown parameters. Let P0 := [π0 ρ0 − π1 σ0 ]+ be the optimal projection corresponding to the pair (ρ0 , σ0 ). We can distinguish two cases: P0 has dimension one, or P0 is zero or identity. The next lemma shows that in the latter case, the optimal measurement is trivial (one can guess the state without measuring) and the excess risk vanishes exponentially. Lemma 4. Suppose that (ρ0 , σ0 ) and (π0 , π1 ) satisfy kπ0rE0 − π1 sE0 k < |π0 − π1 |. Then P0 is either zero or identity and the local minimax excess risk satisfies (l) Rminmax (ρ0 , σ0 ) = O(exp(−cn)),

c > 0.

Proof. Note that the inequality is satisfied only if π0 6= π1 and it implies that   π0 − π1 π0rE0 − π1 sE0 π0 ρ0 − π1 σ0 = 1+ σE 2 π0 − π1 is a strictly positive or negative operator depending on the sign of π0 − π1 . Then by standard concentration inequalities we obtain P ∗ = P0 ∈ {0, 1} with exponentially small error probability, so choosing b P := P0 gives the desired result. t u From now on, we will work under the assumption that kπ0rE0 − π1 sE0 k > |π0 − π1 |,

(11)

so that P0 := [π0 ρ0 − π1 σ0 ]+ is a one-dimensional (1D) projection whose Bloch vector is dE0 π0rE0 − π1 sE0 pE0 = := . (12) kπ0rE0 − π1 sE0 k kdE0 k Similarly, let P ∗ be the Helstrom projection for the pair (ρ, σ ), with the Bloch vector √ dE0 + (π0 uE − π1 v)/ E n pE = (13) √ , kdE0 + (π0 uE − π1 v)/ E nk and since pE is a unit vector, the component of π0 uE − π1 vE along pE0 can be neglected in the first order of approximation so that dE0 + √zEn

+ o(n −1/2 ), zE := (π0 uE − π1 v) pE = E − pE0 ((π0 uE − π1 v) E · pE0 ). (14)

E √zE

d 0 + n Similarly, b Pn is a small rotation of P0 and its Bloch vector can be written as ˆ √ ˆpEn = pE0 + zEn /√n , k pE0 + zEˆn / nk New Journal of Physics 12 (2010) 123032 (http://www.njp.org/)

(15)

12 with zEˆn = O(n  ) being a vector in the plane orthogonal to pE0 . Plugging (14) and (15) into (9), after some calculations we find the following expression for the risk: b n ) = E Tr((π1 σ − π0 ρ)(b R( M Pn − P ∗ ))    1 π0 uE − π1 vE ˆ E · ( pE − pEn ) = E d0 + √ 2 n =

1 EkE z − zEˆn k2 + o(n −1 ). 4nkdE0 k

This confirms that the excess risk decreases as n −1 , so it makes sense to optimize over b n the scaled local version of the local maximum excess risk (7). As we are measurements M interested in asymptotically optimal procedures, the contribution coming from the o(n −1 ) term can be dropped and the initial problem has been cast into that of an asymptotically optimal estimation of the linear transformation zE of the local parameters (u, E v), E with respect to the quadratic loss function 1 L((u, E v), E zEˆn ) := kE z − zEˆn k2 . (16) 4kdE0 k The essence of (16) is that optimal classification amounts to optimal estimation (on the training set) of the rotation that maps P0 to P ∗ , i.e. the local coefficients of the appropriate rotation generators. An interesting question is whether a similar phenomenon happens for higher dimensional systems. 3.2. The training set The problem of asymptotically optimal classification has been reduced to that of asymptotically optimal estimation of a certain linear transformation zE of the local parameters (u, E v). E The difference with optimal state estimation is in the loss function: we are not interested in knowing all parameters (u, E v) E but only a part of them, namely zE. The solution employs the machinery of LAN, reviewed in the appendix, which says that the multiple qubits model of the training set nπ1 0 : kuk, E kvk E 6 n } Tn := {ρunπ E ⊗ σvE

can be approximate by a simple classical–quantum Gaussian shift model G (2) := {NuE ⊗ NvE ⊗ 8uE ⊗ 8vE : u, E vE ∈ R3 }.

(17)

In particular according to lemma 11 the asymptotic minimax risk of the former (with respect to the loss function (16)) is the same as the minimax risk of the latter. The explicit expression for the minimax risk will be derived in the next section, but for the moment we will list the expressions for the Gaussian states and distributions involved in G (2) . For that we write their local Bloch vectors in terms of their coordinates (u 1 , u 2 , u 3 ) and (v1 , v2 , v3 ) with respect to their coordinate systems (aE1 , aE2 , aE3 ) and respectively (bE1 , bE2 , bE3 ) defined through the conditions (see figure 3) (i) aE3 is parallel to rE0 and bE3 is parallel to sE0 , (ii) aE1 , bE1 are in the plane (E r 0 , sE0 ), E (iii) aE2 = b2 . New Journal of Physics 12 (2010) 123032 (http://www.njp.org/)

13 The components of the limit Gaussian model are (cf the appendix) √ NuE := N ( π0 u 3 , 1 − kE r 0 k2 ), √ NvE := N ( π1 v3 , 1 − kE s 0 k2 ), r  r 1 π0 π0 8uE := 8 u1, u2; , 2kE r 0k 2kE r 0k 2kE r 0k  r r 1 π1 π1 v1 , v2 ; 8vE := 8 2kE s 0k 2kE s 0k 2kE s 0k

(18)

and 8(q, p, v) is a Gaussian state with means (q, p) and covariance matrix v1. 4. Optimal classifier

In this section, we formulate our main result characterizing the asymptotically optimal measurement on the training set and derive the expression for the optimal excess risk. Summarizing the previous section, we transformed the original problem into that of estimating zE ∈ R2 , which is a linear transformation of the local parameters (u, E v) E ∈ R3 × R3 of the Gaussian shift model (17). The corresponding loss function L((u, E v), E zEˆ) is given in (16). Since the local parameters contain both classical and quantum components, it is convenient to express the loss function L((u, E v), E zEˆ) in terms of these components. Let ( pE0 , lE0 , kE0 ) be the E reference frame with l 0 in the plane (E r 0 , sE0 ). Designate by ϕ0 , ϕ1 the angles between (E r 0 , lE0 ) and respectively (E s 0 , lE0 ) (see figure 3). Then zE = zl lE0 + z k kE0 with components (q)

zl = (π0 cos ϕ0 u 3 − π1 cos ϕ1 v3 ) + (π0 sin ϕ0 u 1 + π1 sin ϕ1 v1 ) := zl(c) + zl , z k = π0 u 2 − π1 v2 , where zl was split into a contribution coming from the ‘classical’ parameters (u 3 , v3 ) and another one from the ‘quantum’ parameters. Since the classical and quantum parts of the Gaussian model are independent it is easy to verify that the optimal estimator zEˆ can be written as (q) zEˆ = (ˆzl(c) + zˆ l )lE0 + zˆ k kE0 ,

(19) (q)

(q)

where zˆ l(c) is the optimal estimator of zl(c) , and (ˆzl , zˆ k ) are optimal estimators of (zl , z k ) obtained by (jointly) measuring the two quantum Gaussian components. The excess risk can be written as (q)

(q)

4kdE0 kE[L((u, E v), E zEˆ)] = E[(zl(c) − zˆ l(c) )2 ] + E[(zl − zˆ l )2 + (z k − zˆ k )2 ], where the classical and quantum contributions can be optimized separately. The optimal choice for the classical estimator is √ √ zˆ l(c) = π0 cos ϕ0 X r − π1 cos ϕ1 X s , where (X r , X s ) ∼ NuE ⊗ NvE denote the random variables making up the classical part of the limit Gaussian model. Its contribution to the excess risk is E[(zl(c) − zˆ l(c) )2 ] = π0 (1 − kE r 0 k2 ) cos2 ϕ0 + π1 (1 − kE s 0 k2 ) cos2 ϕ1 . New Journal of Physics 12 (2010) 123032 (http://www.njp.org/)

(20)

14 (q)

On the other hand, (zl , z k ) are the means of the canonical coordinates p p Q (l) := 2kE r 0 kπ0 sin ϕ0 Q 1 + 2kE s 0 kπ1 sin ϕ1 Q 2 , p p Q (k) := 2kE r 0 kπ0 P1 − 2kE s 0 kπ1 P2

(21)

whose commutator is [Q (l) , Q (k) ] = i(2kE r 0 kπ0 sin ϕ0 − 2kE s 0 kπ1 sin ϕ1 )1 := ic1. Now, the optimal joint measurement of canonical variables is the heterodyne type where the non-commuting coordinates are combined with the coordinates of an additional CV system prepared in a squeezed state [10, 24]. The optimal mean square error is (q)

(q)

E[(zl − zˆ l )2 + (z k − zˆ k )2 ] = Var(Q (l) ) + Var(Q (k) ) + |c| = π0 sin2 ϕ0 + π1 sin2 ϕ1 + 1 + 2|π0 kE r 0 k sin ϕ0 − π1 kE s 0 k sin ϕ1 |.

(22)

Adding the classical and quantum contributions (20) and (22), we obtain the minimax risk   (l) Rminmax (ρ0 , σ0 ) = 2 + 2|π0 kE r 0 k sin ϕ0 − π1 kE s 0 k sin ϕ1 | − kE r 0 kkE s 0 k cos ϕ0 cos ϕ1 (4kdE0 k), (23) which only depends on the states (ρ0 , σ0 ) and the priors (π0 , π1 ). Theorem 5. Consider the quantum classification problem with training set ρ ⊗π 0 n ⊗ σ ⊗π 1 n , where ρ, σ are unknown qubit states and (π0 , π1 ) are known. Then under the assumption (11), the local minimax risk is given by (23). The measurement consisting of the following steps is asymptotically optimal: (i) construct rough estimators of ρ and σ by measuring n 1− systems; (ii) transfer the localized spins state by Tn as in theorem 10; (iii) perform the optimal coherent measurement of (Q (l) , Q (k) ) and combine with classical estimator zˆ lc to produce estimator b Pn . 4.1. Plug-in classifier based on optimal state estimation Here we compute the asymptotics of the renormalized risk for the plug-in classifier based on optimal state estimation [24]. The first two steps of the measurement procedure are identical to those of theorem 5, while in the third step we perform separate heterodyne measurements on the modes (Q 1 , P1 ) and (Q 2 , P2 ) and observe the classical components, to obtain the estimators e n := (e uE˜n and vE˜n . The plug-in classifier is M Pn , 1 − e Pn ) where e Pn has the Bloch vector defined ˜ ˜ ˜ by (15) with zˆ n replaced by zEn := (π0 uEn − π1 vEn )⊥ . We write z˜ n as in (19) and compute the risk in a similar fashion. While the contribution from the first term is given by (20), the ‘quantum components’ have different variances due to the fact that we performed a different heterodyne measurement. By using (21) and the fact that heterodyne adds a factor 1/2 to the variance of canonical coordinates, we obtain (q)

(q)

E[(zl − z˜ l )2 ] = π0 sin2 ϕ0 (kE r 0 k + 1) + π1 sin2 ϕ1 (kE s 0 k + 1), E[(z k − z˜ k ) ] 2

= π0 (kE r 0 k + 1) + π1 (kE s 0 k + 1).

New Journal of Physics 12 (2010) 123032 (http://www.njp.org/)

(24)

15 Adding the three contributions, we obtain the asymptotic local risk for the plug-in classifier 1 e R (l) (ρ0 , σ0 ) = [2 + π0 (kE r 0 k sin2 ϕ0 + kE r 0 k − kE r 0 k2 cos2 ϕ0 ) E 4kd 0 k + π1 (kE s 0 k sin2 ϕ1 + kE s 0 k − kE s 0 k2 cos2 ϕ1 )].

(25)

Now, comparing the minimax risk (23) with the risk (25) of the plug-in classifier, we obtain (l) e R (l) (ρ0 , σ0 ) − Rminmax (ρ0 , σ0 ) = [π0 kE r 0 k(1 ± sin ϕ0 )2 + π1 kE s 0 k(1 ∓ sin ϕ1 )2 ]/(4kdE0 k),

with the signs chosen according to the sign of π0 kE r 0 k sin ϕ0 − π1 kE s 0 k sin ϕ1 . The difference is equal to zero if and only if sin ϕ0 = ∓1 and sin ϕ1 = ±1, which means that the vectors rE0 and sE0 are parallel and point in the same direction. On the other hand, the difference is maximum when rE0 and sE0 point in opposite directions and have length one. This can be easily understood from the Gaussian model. When the vectors are parallel, optimal learning requires a joint measurement of non-commuting variables (Q 1 − Q 2 , P1 − P2 ), whose risk is the same as that of heterodyning the CV system first and constructing linear combinations. In the antiparallel case, we need to measure commuting variables (Q 1 + Q 2 , P1 − P2 ), which can be done directly, without any loss. 4.2. The case of unknown priors The analysis so far deals with known priors π0 , π1 , which is the standard setup usually considered in quantum statistics. In general, the priors may be unknown but can be estimated from the training set with a standard n −1/2 error. Since the Helstrom measurement also depends on (π0 , π1 ), this uncertainty will bring an additional contribution to the excess risk. To find it, one needs to go back to the derivation of the√quadratic loss function and add another unknown local parameter δ for the prior: π0 = q0 + δ/ n. This amounts to replacing zE by zE + δ(E r 0 + sE0 )⊥ in (13) and by going through the same steps, we obtain the quadratic loss function 1 L((u, E v, E δ), zEˆn ) := kE z + δ(E r 0 + sE0 )⊥ − zEˆn k2 , (26) E 4kd 0 k where (E r 0 + sE0 )⊥ is the component orthogonal to pE0 . As before, the training set can be cast into a Gaussian model, with an additional independent component Z ∼ N (δ, π0 π1 ). This means that the minimax risk has an additional contribution π0 π1 k(E r 0 + sE0 )⊥ k2 k(E r 0 + sE0 )⊥ k2 Var(Z ) = . 4kdE0 k 4kdE0 k 5. Conclusions

We solved the problem of classifying two-qubit states in the asymptotic local minimax statistical framework. Asymptotically the problem reduces to that of optimally estimating a sub-parameter of a quantum Gaussian model consisting of two continuous variable systems whose states are Gaussian with unknown means. The estimator is then used to construct an approximation of the (unknown) Helstrom measurement, which is used to classify unlabelled states. The optimal procedure has excess risk of the order of n −1 and we computed the exact constant factor (l) Rminmax (ρ0 , σ0 ) as a function of the two unknown states. Except in the special case of states with New Journal of Physics 12 (2010) 123032 (http://www.njp.org/)

16 parallel Bloch vectors, the optimal procedure performs strictly better than the plug-in classifier obtained by estimating the states and applying the corresponding Helstrom measurement. The difference is only a constant factor, but it would probably become significant in more interesting infinite-dimensional models. Finally, let us briefly discuss the Bayesian analogue of our result. In the Bayesian framework, one would choose a ‘regular’ prior µ(dρ × dσ ) over the two types of states and try to find the (asymptotically) optimal Bayes risk for this prior Z µ b n ). Ropt := lim sup inf n µ(dρ × dσ )R(ρ,σ ) ( M n→∞

bn M

When the states are pure and the prior is uniform, this has been done (even non-asymptotically) in [18], but the proof relies on the symmetry of the prior and cannot be applied to general ones and mixed states. Based on a similar analysis done for state estimation [37], we expect our result can be used to prove that Z µ (l) Ropt = Rminmax (ρ0 , σ0 )µ(dρ0 × dσ0 ). The intuitive explanation is that when n → ∞, the features of the prior µ are washed out and the posterior distribution concentrates in a local neighbourhood of the true parameter, where the behaviour of the classifiers is governed by the local minimax risk. Proving this relation is, however, beyond the scope of this paper. Acknowledgments

We thank Richard Gill for useful discussions and one of the referees for numerous suggestions for improving the manuscript. MG was supported by the EPSRC Fellowship EP/E052290/1. WK was supported by the NWO (Netherlands Organisation for Scientific Research) VICI grant 639-023-302. Appendix. Local asymptotic normality

In this appendix, we give a short overview of the concept of LAN in classical statistics and its extension to quantum statistics in as much as it is necessary for this paper; see [24] for proofs and more details. LAN gives a precise mathematical formulation of the idea that asymptotically with the number of samples (or quantum systems), a priori complicated statistical models can be approximated by much simpler Gaussian ones, due to a central limit behaviour of sufficient statistics. Using this tool, one can cast the problem of (asymptotically) optimal state estimation into a much simpler one of estimating the mean of a quantum Gaussian state with known variance [23]–[26]. A.1. Local asymptotic normality in classical statistics A typical statistical problem is the estimation of some unknown parameter θ from a sample X 1 , . . . , X n ∈ X of independent, identically distributed random variables drawn from a distribution Pθ over a measure space (X , 6). If θ belongs to an open subset of Rk for some finite dimension k and if the map θ → Pθ is sufficiently smooth, then widely used estimators New Journal of Physics 12 (2010) 123032 (http://www.njp.org/)

17 θˆn (X 1 , . . . , X n ) such as the maximum likelihood are asymptotically optimal in the sense that they converge to θ at a rate n −1/2 , and the error has an asymptotically normal distribution √ L n(θˆn − θ ) −→ N (0, I −1 (θ)), (A.1) where the right side is the lower bound set by the Cramér–Rao inequality for unbiased estimators. To give a simple example, if X i ∈ {0, 1} is the resultPof a coin toss with P[X i = n 1] = θ and P[X i = 0] = 1 − θ , then the sufficient statistic θˆn = n1 i=1 X i satisfies (A.1) by the central limit theorem (CLT). Naturally, the first inquiries into quantum statistics concentrated on generalizing the Cramér–Rao inequality to quantum statistical models and on finding asymptotically optimal estimators that achieve the quantum version of the Fisher information matrix [10, 11, 38]. However, it was found that due to the additional uncertainty introduced by the non-commutative nature of quantum mechanics, the situation is essentially different from the classical case. A summary of these findings is the following: (i) the multi-dimensional version of the Cramér–Rao bound is in general not achievable; (ii) the optimal measurement depends on the loss function, i.e. the quadratic form (θˆ − θ )t G(θˆ − θ ), and different weight matrices G lead in general to incompatible measurements. As we will see, these issues can be overcome by adopting a more modern perspective to asymptotic statistics provided by the LAN technique [27, 39]. Instead of analysing particular estimation problems, the idea is to consider the structure of the statistical model underlying the data and to approximate it with a simpler model for which the statistical problems are easy to solve. To obtain a non-trivial limit model, it makes sense to rescale the parameters according to their uncertainty, so we assume that θ is localized in a region of size n −1/2+ and √ we can write θ = θ0 + h/ n with θ0 known and h ∈ Rk the local parameter to be estimated. As already discussed in section 2.3, this does not restrict the generality of the problem since one can use a two-step localization procedure where a rough estimate θ0 is obtained in the first step using a small part of the sample. Local asymptotic normality means that the sequence of (local) statistical models o n Pn := Pnθ0 +h/√n : khk 6 n  , n ∈ N, (A.2) depending ‘smoothly’ on h, approaches a sequence of Gaussian shift models  Gn := N (h, I −1 (θ0 )) : khk 6 n  ,

(A.3)

where we observe a single Gaussian variable with mean h and fixed and known variance I −1 (θ0 ). The convergence is defined in terms of the Le Cam distance between two models, which quantifies the extent to which each model can be ‘simulated’ by using data from the other. Definition 6. A positive linear map T : L 1 (X , A, P) → L 1 (Y , B, Q) is called a stochastic operator (or randomization) if kT ( p)k1 = k pk1 for every p ∈ L 1+ (X , A, P). For simplicity, we consider only dominated models for which all distributions have densities with respect to some fixed reference distribution. In this case, a randomization is the classical analogue of a quantum channel. New Journal of Physics 12 (2010) 123032 (http://www.njp.org/)

18 Definition 7. Let P := {Pθ : θ ∈ 2} and Q := {Qθ : θ ∈ 2} be two dominated statistical models with distributions having probability densities pθ := dPθ /dP and qθ := dQθ /dQ. The deficiencies δ(P , Q) and δ(Q, P ) are defined as δ(P , Q) := inf sup kT ( pθ ) − qθ k1 , T θ ∈2

δ(Q, P ) := inf sup kS(qθ ) − pθ k1 S θ ∈2

with infima taken over all randomizations T, S. The Le Cam distance between P and Q is 1(P , Q) := max(δ(Q, P ), δ(P , Q)). Local asymptotic normality for i.i.d. parametric models can now be formulated as follows: Theorem 8. The sequence of local models (A.2) converges in the Le Cam distance to the sequence (A.3) of Gaussian shift models lim 1(Pn , Gn ) = 0.

n→∞

A.2. Local asymptotic normality in quantum statistics We will now describe the quantum version of LAN for the simplest case of a family of qubit states. The generalization to finite dimensional systems can be found in [25]. We are given n independent qubits identically prepared in the state ρ = 12 (1 + rEσE), where rE is the unknown Bloch vector of the state and σE = (σx , σ y , σz ) are the Pauli matrices in M(C2 ). Following the methodology of the previous section, we concentrate on the structure of the statistical model itself rather than on optimal state estimation. As argued in the discussion leading up to equation (7), we can restrict ourselves to parameters within a ball of size n −1/2+ around a rough estimator ρ0 := 12 (1 + rE0 σE) and label the states in this ball by the local parameter uE √ 1 √ ρu/ r 0 + u/ E n)σE). E n = 2 (1 + (E We define the local quantum statistical model by  ⊗n√ Qn := ρunE : kuk E 6 n  , ρunE := ρu/ . E n

(A.4)

Heuristically, the emergence of a Gaussian model for large n follows from the CLT and can be illustrated with the ‘big Bloch sphere’ picture commonly used to describe spin coherent [40] and spin-squeezed states [41] (see figure A.1). Let L j :=

n X

σ j(i) ,

j = 1, 2, 3

i=1

be the collective spin operators and consider for simplicity that rE0 points in the vertical direction. We describe the state ρ0⊗n as a vector of length nkE r 0 k whose components are the expected values of the total spin operators. By the CLT, the quantum fluctuations around the means are asymptotically Gaussian with 1 D r 0 k) −→ N (0, 1 − kE r 0 k2 ), √ (L 3 − nkE n

New Journal of Physics 12 (2010) 123032 (http://www.njp.org/)

1 D √ L 1,2 −→ N (0, 1), n

19

z √ n(1 −

r02 )

n nr0

x

y Figure A.1. Big ball picture of the collective state of identical mixed spins. The

total spin is represented as a vector of length r 0 k with a 3D uncertainty blob p nkE √ of size n in the x- and y-directions and n(1 − kE r 0 k2 ) in the z-direction. √ so that the total spin vector should be pictured with a 3D Gaussian blob of size n at its tip. Furthermore, by a law of large numbers heuristic, we estimate the commutators     1 1 1 1 1 r 0 k1, √ L 1 , √ L 2 = 2i L 3 ≈ 2ikE √ L 1,2 , √ L 3 ≈ 0. n n n n n p p This suggests that L 1 / 2kE r 0 kn and L 2 / 2kE r 0 kn converge to the canonical coordinates Q and P of a continuous variable system and the limit state is the mixed Gaussian state ∞ X 1 − kE r 0k 8 := (1 − p) p k |kihk|, p := , 1 + kE r k 0 k=0 where {|ki: k > 0} represents the Fock basis. Moreover, the (rescaled) component √1n (L 3 − nkE r 0 k) converges to a classical Gaussian variable X with distribution N := N (0, 1 − kE r 0 k2 ), which is independent of the quantum state. Thus, the Gaussian limit state has both quantum and classical components and should be identified with the state 8 ⊗ N on the von Neumann algebra B(`2 (N)) ⊗ L ∞ (R). How does this Gaussian state change when the qubits are in the state ρunE ? By applying the same argument, we find that the variables Q, P, X pick up expectations which (in the first order in n −1/2 ) are proportional to the local parameters (u 1 , u 2 , u 3 ), while the variances remain unchanged. More precisely, the continuous variable p system is in the Gaussian state ∗ 8uE := D(u)8D( E u) E , where D(u) E := exp(i(−u 2 Q + u 1 P)/ 2kE r 0 k) is a displacement operator and the classical component has distribution NuE := N (u 3 , 1 − kE r 0 k2 ). New Journal of Physics 12 (2010) 123032 (http://www.njp.org/)

20 Definition 9. The quantum Gaussian shift model G is defined by the family of quantum-classical states on B(`2 (N)) ⊗ L ∞ (R) G := {8uE ⊗ NuE : uE ∈ R3 }.

(A.5)

Having defined the local models Qn and the Gaussian shift model G , we need to specify the quantum counterparts of randomizations and convergence of models. The natural analogue of a classical randomization is a quantum channel, i.e. completely positive, trace preserving map C : T1 (H) → T1 (K), where T1 (H) represents the trace class operators on H. Since the limit model has both classical and quantum components, this needs to be extended to arbitrary ‘quantum probability spaces’ and the proper mathematical framework is that of von Neumann algebras and channels between their preduals, which in finite dimensions means that we deal with channels between block diagonal matrix algebras. The Le Cam distance between two quantum models is defined as in definition 7, with classical randomization replaced by quantum ones and k · k1 representing the norm on the predual, which for density matrices is the trace norm. Theorem 10. Let Qn be the sequence of statistical models (A.4) for n i.i.d. local spin states and E 6 n . let Gn be the restriction of the Gaussian shift model (A.5) to the range of parameters kuk Then lim 1(Qn , Gn ) = 0,

n→∞

i.e. there exist sequences of channels Tn and Sn such that limn→∞ supkuk6n k8uE ⊗ NuE − Tn (ρunE )k1 = 0, limn→∞ supkuk6n  kρunE − Sn (8uE ⊗ NuE) k1 = 0.

(A.6)

Let us make a few comments on the significance of the above result. Although it was intuitively illustrated using the CLT, the concept of LAN provides a stronger characterization of the ‘Gaussian approximation’. Indeed the convergence in theorem 10 is strong (in L 1 ) rather than weak (in distribution), it is uniform over a range of local parameters rather than at a single point, and it has an operational meaning based on quantum channels. One can exploit these features to devise asymptotically optimal measurement strategies for state estimation and prove that the Holevo bound [10] is asymptotically attainable [42]. The following technical lemma [24] is a consequence of LAN and shows that in what concerns the asymptotics, for any estimation problem with a locally quadratic loss function on the state space, the original i.i.d. model can be replaced by the limit Gaussian one. Lemma 11. Let L be a loss function whose leading term in a local expansion is L(u, Eˆ u) E := ˆ ˆ (uE − u) E · G(uE − u)/n, E with G = G(E r 0 ) being a positive linear transformation. Then the corresponding local asymptotic minimax risks for the sequences Qn and Gn are the same and coincide with the minimax risk of the Gaussian model G . References [1] Mitchell T 1997 Machine Learning 1st edn (New York: McGraw-Hill) [2] Devroye L, Györfi L and Lugosi G 1996 A Probabilistic Theory of Pattern Recognition 1st edn (Berlin: Springer) [3] Vapnik V 1998 Statistical Learning Theory (New York: Wiley) New Journal of Physics 12 (2010) 123032 (http://www.njp.org/)

21 [4] Friedman J H, Hastie T and Tibshirani R 2003 Elements of Statistical Learning: Data Mining, Inference and Prediction (Berlin: Springer) [5] Bishop C M 2006 Pattern Recognition and Machine Learning (Berlin: Springer) [6] Nielsen M and Chuang I 2000 Quantum Computation and Quantum Information (Cambridge: Cambridge University Press) [7] Wiseman H M and Milburn G J 2009 Quantum Measurements and Control (Cambridge: Cambridge University Press) [8] Smithey D T, Beck M, Raymer M G and Faridani A 1993 Phys. Rev. Lett. 70 1244–7 [9] Häffner H, Hänsel W, Roos C F, Benhelm J, Chek-al kar D, Chwalla M, Körber T, Rapol U D, Riebe M and Schmidt P O et al 2005 Nature 438 643–6 [10] Holevo A S 1982 Probabilistic and Statistical Aspects of Quantum Theory (Amsterdam: North-Holland) [11] Helstrom C W 1976 Quantum Detection and Estimation Theory (New York: Academic) [12] Leonhardt U 1997 Measuring the Quantum State of Light (Cambridge: Cambridge University Press) [13] Hayashi M (ed) 2005 Asymptotic Theory of Quantum Statistical Inference: Selected Papers (Singapore: World Scientific) ˇ [14] Paris M G A and Rehᡠcek J (ed) 2004 Quantum State Estimation [15] Barndorff-Nielsen O E, Gill R and Jupp P E 2003 J. R. Stat. Soc. B 65 775–816 [16] Sasaki M and Carlini A 2002 Phys. Rev. A 66 022303 [17] Bergou J and Hillery M 2005 Phys. Rev. Lett. 94 160501 [18] Hayashi A, Horibe M and Hashimoto T 2005 Phys. Rev. A 72 052306 [19] Hayashi A, Horibe M and Hashimoto T 2006 Phys. Rev. A 93 012328 [20] Aïmeur E, Brassard G and Gambs S 2006 Proc. 19th Canadian Conf. on Artificial Intelligence (Canadian AI’06) (Québec City, Canada) (Berlin: Springer) pp 433–44 [21] Aïmeur E, Brassard G and Gambs S 2007 Proc. 24th Int. Conf. on Machine Learning (ICML’07) (Corvallis, OR, 2007) pp 1–8 [22] Gambs S 2008 Quantum classification arXiv:0809.0444 [23] Gu¸ta˘ M and Kahn J 2006 Phys. Rev. A 73 052108 [24] Gu¸ta˘ M, Janssens B and Kahn J 2008 Commun. Math. Phys. 277 127–60 [25] Kahn J and Gu¸ta˘ M 2009 Commun. Math. Phys. 289 597–652 [26] Gu¸ta˘ M and Jençová A 2007 Commun. Math. Phys. 276 341–79 [27] Le Cam L 1986 Asymptotic Methods in Statistical Decision Theory (New York: Springer) [28] Gu¸ta˘ M, Bowles P and Adesso G 2010 Quantum teleportation benchmarks for independent and identicallydistributed spin states and displaced thermal states Phys. Rev. A 82 042310 [29] Bagan E, Baig M, Muñoz Tapia R and Rodriguez A 2004 Phys. Rev. A 69 010304 [30] Bagan E, Ballester M A, Gill R D, Muñoz Tapia R and Romero-Isart O 2006 Phys. Rev. Lett. 97 130501 [31] Bagan E, Ballester M A, Gill R D, Monras A and Munõz-Tapia R 2006 Phys. Rev. A 73 032301 [32] Gammelmark S and Molmer K 2009 New J. Phys. 11 033017 [33] Bisio A, Chiribella G, D’Ariano G M, Facchini S and Perinotti P 2010 Phys. Rev. A 81 032324 [34] Audibert J Y and Tsybakov A B 2007 Ann. Stat. 35 608–33 [35] Cohn D A, Ghahramani Z and Jordan M I 1996 J. Artif. Intell. Res. 4 129–45 [36] D’Ariano G M, Presti P L and Perinotti P 2005 J. Phys. A: Math. Gen. 38 5979 [37] Gill R D 2008 Quantum Stochastics and Information: Statistics, Filtering and Control ed V P Belavkin and M Guta (Singapore: World Scientific) pp 239–61 [38] Belavkin V P 1976 Theor. Math. Phys. 26 213–22 [39] van der Vaart A 1998 Asymptotic Statistics (Cambridge: Cambridge University Press) [40] Radcliffe J M 1971 J. Phys. A: Math. Gen. A 4 313–23 [41] Kitagawa M and Ueda M 1993 Phys. Rev. A 47 5138–43 [42] Gu¸ta˘ M and Kahn J 2010 Optimal state estimation: attainability of the Holevo bound (unpublished work)

New Journal of Physics 12 (2010) 123032 (http://www.njp.org/)

Suggest Documents