Directly Lower Bounding the Information Capacity for ... - CiteSeerX

3 downloads 0 Views 430KB Size Report
the limit of a nondecreasing sequence, we can lower bound it ..... bility is a characterization of when one may interchange limit and expectation operations.
86

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 1, JANUARY 2010

Directly Lower Bounding the Information Capacity for Channels With I.I.D. Deletions and Duplications Adam Kirsch and Eleni Drinea

Abstract—In this paper, we directly lower bound the information capacity for channels with independent identically distributed (i.i.d.) deletions and duplications. Our approach differs from previous work in that we focus on the information capacity using ideas from renewal theory, rather than focusing on the transmission capacity by analyzing the error probability of some randomly generated code using a combinatorial argument. Of course, the transmission and information capacities are equal, but our change of perspective allows for a much simpler analysis that gives more general theoretical results. We then apply these results to the binary deletion channel to improve existing lower bounds on its capacity. Index Terms—Channel capacity, deletion channels, insertion channels.

I. INTRODUCTION

T

HIS work gives a new approach to lower bounding the asymptotic capacity of channels with independent identically distributed (i.i.d.) deletions and duplications with arbitrary finite alphabets. Specifically, we consider channels that send an i.i.d. number (possibly zero) of copies of each transmitted symbol. These channels are a subset of the class of channels with synchronization errors, first analyzed by Dobrushin [3], who generalized Shannon’s channel coding theorem to show that for any such channel, the information and transmission capacities are equal. We apply Dobrushin’s result to obtain lower bounds on the transmission capacity of our channels by directly lower bounding their information capacities. Our techniques are substantially different from those in prior works, which typically lower bound the transmission capacity directly through a more combinatorial approach. Using our techniques, which are based on elementary facts from renewal theory, we are able to achieve more general theoretical results than prior work with a Manuscript received January 26, 2008; revised September 09, 2009. Current version published December 23, 2009. The work of A. Kirsch was supported in part by the National Science Foundation (NSF) under a Graduate Research Fellowship and Grant CCF-0634923. The work of E. Drinea was supported by the NSF under Grant CCF-0634923 and EU Project NETwork REreseach FOUNDations (NETReFound) FP6-IST-034413. The material in this paper was presented at the 2007 IEEE International Symposium on Information Theory, Nice, France, January 2007. A. Kirsch is with the School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138 USA (e-mail: [email protected]). E. Drinea was with the New England Complex Systems Institute (NECSI) and the School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138 USA. She is now with the School of Computer and Communication Sciences, EPFL, 1015 Lausanne, Switzerland (e-mail: [email protected]). Communicated by I. Kontoyiannis, Associate Editor for Shannon Theory. Color version of Figure 3 in this paper is available online at http://ieeexplore. ieee.org. Digital Object Identifier 10.1109/TIT.2009.2034883

much cleaner analysis, along with improved lower bounds on the transmission capacities of the channels that we consider. We start by giving a very brief overview of previous work along these lines and its connections to this paper; for more details, see the survey [9]. For concreteness, we temporarily restrict our attention to the case of a binary alphabet. After Dobrushin’s work [3], for the specific case of the deletion channel, a series of theoretical works [1], [5], [6] (and to some extent [10]) gave successively improved lower bounds on the transmission capacity. All of these works adhere to the following codewords of some length basic paradigm. First, a set of are randomly chosen according to some symmetric first-order Markov chain; here is the rate of the code. (In other words, the codewords are obtained by generating alternating blocks of zeros and ones; the block lengths are i.i.d. samples from some geometric distribution.) Second, a decoding algorithm is proposed and analyzed. The analysis of the decoding algorithm results in an upper bound for the probability of an incorrect decoding. The supremum of the set of values for that results in is then a lower bound on this upper bound vanishing as the transmission capacity of the channel. (As an aside, Mitzenmacher and Drinea [10] follow this paradigm indirectly by deriving a relationship between the deletion channel and another channel and then lower bounding the transmission capacity of the other channel using the basic paradigm.) We take a different approach. By Dobrushin’s result [3], we can lower bound the transmission capacity by lower bounding the information capacity. That is, rather than generating codeover strings of words randomly from some distribution length , introducing an explicit decoding algorithm, and then deriving asymptotic error probability bounds, we simply , where denotes the received compute sequence when is transmitted across the channel. For the is just the first steps of commonly studied case where , this limit some symmetric first-order Markov chain on has a natural stochastic interpretation. Intuitively, as we scan from left to right, it restarts after every block. Thus, it is natural for to think of and then apply ideas from renewal theory to analyze . (In fairness, this intuition is implicit in the combinatorial analyses in [5] and [6]. In fact, the basic expression that we use in appears in a similar form in [6], but there calculating it is quickly bounded in the interest of analyzing the decoding algorithm proposed in that work; most of our attention in this paper is spent on the inaccuracy resulting from that bound.) , This results in an exact expression for whereas prior work only gives lower bounds. As an added bonus, our analysis allows for a much wider variety of channels

0018-9448/$26.00 © 2009 IEEE

KIRSCH AND DRINEA: DIRECTLY LOWER BOUNDING THE INFORMATION CAPACITY FOR CHANNELS

and block length distributions for than prior work (e.g., [5] and [6]). Specifically, our results apply to i.i.d. channels governed by an arbitrary deletion/duplication distribution with finite entropy and sources with i.i.d. block lengths generated from any distribution with finite mean and entropy, whereas the analyses in prior works force artificial restrictions on these distributions. Furthermore, since the lower bounds on the transmission capacity from [6] can be interpreted as lower bounds on , we can exactly identify the improvement of our bounds over those in [6]. However, the expression for this improvement is too computationally challenging to evaluate to obtain numerical bounds. Even obtaining lower bounds for this expression is a very time-consuming process. In fact, an exact numerical computation seems too slow and so we resort to simulations. On the positive side, the addition of this term to the previous bounds in [6] does seem to incur a nonnegligible improvement for all but very small deletion probabilities, and therefore its evaluation appears worthwhile. In effect, this work exhibits the best possible (theoretical) lower bounds on the transmission capacity that can be obtained by using random codes where codewords are generated with i.i.d. block lengths. The analyses in most prior works (e.g., [1], [5], and [6]) intrinsically rely on this model. Indeed, if codewords are generated independently from some distribution over strings of length , then the best possible resulting lower , bound on the transmission capacity is which we determine exactly in the case where has i.i.d. block lengths. Thus, for further improvements, one must consider codewords with dependencies among the block lengths (e.g., codewords generated by higher order Markov chains), and so the analysis in this paper no longer applies. Before delving into the body of the paper, we observe that since the first published (conference) version of this work [4], there have been several interesting developments in research on deletion and related channels with synchronization errors. Perhaps of broadest interest, Mitzenmacher [9] has written a survey on the state of research in this area (including this work). Diggavi, Mitzenmacher, and Pfister [2] and Fertonani and Duman [7] have also made significant progress on upper bounds for the capacity of the deletion channel, a problem on which there had been little progress in recent years. II. PRELIMINARIES We consider finite alphabet channels with i.i.d. deletions and duplications. Formally, we fix some finite alphabet with , and assume without loss of generality that . We also fix some distribution on . For convenience, we let for , and . For concreteness, and we assume that we measure all entropies and related functions in bits and we to denote the base logarithm function. We also use define some notation for strings: we denote the concatenation (we avoid the more common of two strings and by because we frequently and succinct multiplicative notation and the must differentiate between the concatenated string in our proofs), and we define a block to be a string pair of nonzero length where all characters are equal. We consider

Fig. 1. Procedure for generating Y

87

; Y ; . . ., and X ; X ; . . . given X .

Fig. 2. Example demonstrating a possible generation of the (X ; Y )’s from a fixed ternary input sequence. The actual randomness used by the channel is specified in Table I.

the channel which, for some input string , independently from and then outputs chooses copies of , followed by copies the string consisting of of , etc. We let be an arbitrary positive integer probability distri. Let be independent bution with random variables with common distribution . Let and let be i.i.d. random variables distributed uni. Define the sequence of symbols formly on by and for . Note that for , the conditional distribution of given is uniform on . We now define the infinite source That is, is an infinite copies of , followed by copies of string consisting of , followed by copies of , etc. Now we define a random variable that corresponds to the through the channel. We sequence received when we send do this by sending through the channel block by block. For each block sent, we look at the corresponding received sequence and see whether this string is entirely contained in the current block of , or whether this string is the prefix of a new block of (one of these cases must occur since the channel does not allow arbitrary insertions, just duplications; note that if the transmitted block is completely deleted, the former case applies). In this way, we can not only build the blocks of , but we can also associate each block of with the group of blocks in from which it arose. This procedure is expressed formally in Fig. 1, and we provide an example in Fig. 2 and Table I. The procedure in Fig. 1 allows us to define the random variand in a natural way. For any ables , once the variable in the procedure is strictly greater than , the variables and remain constant, and we define the and to be those values. value of the random variables It is not hard to see that with probability 1, line 10 is executed

88

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 1, JANUARY 2010

TABLE I

SAMPLES OF

D FOR THE EXAMPLE IN FIG. 2

TABLE II

G ’s CORRESPONDING TO THE EXAMPLE IN FIG. 2

infinitely often and the lengths of the variables and are aland are well-defined random ways finite, and therefore . (We use variables with to denote both the length of a string and the the notation absolute value of a real number, with the meaning always being and clear from context.) Thus, we may write , where and for all , and for . We note that since the channel treats each symbol of independently, really does correspond to the received sequence through the channel symbol by that results from sending symbol. That is, it does not matter whether we generate by sending through the channel symbol by symbol or block by block, as we do in the procedure in Fig. 1. Next we define some new notation. First, define and so that . Let and for , let . Clearly, the ’s are , they are distributed uniformly on independent, and for . , define . For symbols and symbol , define For a string . For the empty string , we set . For , we let . We think of as the group of blocks of that gives rise to , normalized to remove any information about ; we give an example in Table II. The ’s are useful to us because the channel is symmetric with respect to transmitted symbols, and so we expect to be identically distributed ( is special since is generated in a slightly different way than the other ’s). Indeed, we have the following key lemma, which forms the basis of our subsequent analysis. Lemma 1: 1) The ’s are independent and for , they , where is have some common distribution and is indepenuniformly distributed on . dent of , , , . 2)

Proof: For the most part, the proof is a sequence of fairly routine calculations, and so we defer its proof until Appendix II. , under such However, the fact that general conditions may seem unintuitive at first. We give some that is equally valid for intuition for the claim the other claim . The first observation is that is essentially the sum of some geometric-tailed number of loosely dependent block lengths with common distribution , and . It is then not difficult to believe that . From there, one can obtain the fact by conditioning on and noting that the that resulting conditional entropy is no more than if every charwere independent and uniformly distributed over acter in . This establishes that . We can then obtain by showing that , which follows from the observation that, conditioned on , the output of the channel when is sent through it is determined by samples from , which satisfies . (We take a slightly different route in the proof in Appendix I in order before bounding to avoid having to derive a bound on , but we still follow this intuition fairly closely.) III. THE MAIN RESULT We are now ready to state the main result of this work. Let denote the first symbols of , and let denote the prefix of that is received if is sent across the channel. Then we have the following theorem, whose proof is subject of this section. Theorem 1:

where

for

for . Furthermore, the sequence creasing and bounded.

is nonde-

Before moving to the proof of Theorem 1, we first give some is essenintuition about what it is saying. The quantity if one knows the infinite string tially the entropy left in , the entire output from the channel on this string, and where to break to obtain . By as , we expect to find attempting to take the limit of if the entire infinite string the uncertainty left in and its output are revealed. It is not particularly surprising that this limit exists, or that the ’s are nondecreasing. Indeed, there is much independence in the input to the channel and the channel itself, and so we expect that by looking at larger values , of , corresponding to larger portions of the input we are conditioning on less information. As for the limit itself, intuitively it seems that there must be some notion of the , “infinite conditional entropy” that would result from and so it is not surprising that exists.

KIRSCH AND DRINEA: DIRECTLY LOWER BOUNDING THE INFORMATION CAPACITY FOR CHANNELS

89

At a higher level, Theorem 1 essentially tells us that, modulo some fraction of this “infinite conditional entropy,” the information communicated across the channel is the same as if the ’s were sent independently. That is, characterizes the error in analyzing the channel under the (faulty) assumption that there are no synchronization problems in determining how the blocks of the channel input map to the blocks of the channel output, and that the only synchronization errors introduced by the channel are in how a particular output block is generated from its corresponding input blocks. Of course, this assumption is actually correct if the channel cannot introduce deletions, and indeed, from looking at the formulas it is easy to see that and all of ’s are in this case. But in the general case where the the channel does allow deletions, one can easily see that there are synchronization issues in determining what input blocks are responsible (together with the channel) for generating a particular output block. in As for the significance of Theorem 1, the term Theorem 1 is the (theoretical) improvement of our bounds over the bounds in [6]. Furthermore, as noted in the introduction, the expression in Theorem 1 exactly achieves the best possible bounds obtainable in the model of that work. Also, since is the limit of a nondecreasing sequence, we can lower bound it simply by lower bounding any term in the sequence. (Indeed, , corresponding to the lower bound in [6].) We use this observation in our simulations in Section IV. The rest of this section is devoted to the proof of Theorem 1. denote the last block of , let be one less Let , and write than the number of blocks in

tively restarting). Conditioned on this information, the process through the channel is the same as indepenof sending dently sending the appropriate ’s through the channel, which is easy to analyze. The hard part is determining how condiaffects the mutual information between tioning on and (this is another way of thinking about the ’s and ). With this in mind, we now informally examine each of the terms in (1). The first term is fairly easy to think about (and formally anhas entropy , and is, on alyze). Each block length of symbols long. Furthermore, for each , the average, conditional distribution of the symbol used in the th block is uniform over a set of size . of given Therefore, we have

Note that is a nonempty prefix of and . For brevity, we define a nonempty prefix of

The third term is the most challenging. As before, we nebecause it is special and should have no effect on glect the asymptotics. Now, for very large , we have that , since the ’s have average length . Then, letting , we obtain

is

Also, we let denote the last block of (which is ), define to be one less than also the last block of , and write the number of blocks in

We now express

as follows:

(1) We prove Theorem 1 by analyzing each of the terms in (1) separately. The full proof is somewhat technical, so we start by giving heuristic arguments. First, we give a high-level ex. Essentially, is the partiplanation of the role of tioning information that tells us when the process in Fig. 1 that and generates new output blocks (effecgenerates

The second term is slightly more challenging. As , we expect to behave like , which is essentially the same as , up to the ’s. (We neglect since it is special and should not have any effect on the asymptotics.) Similarly, , we expect to behave like , up as ’s. Now, since the ’s are independent and to the for , the conditional have common distribution given is entropy of for any . Since the average length of each is for , we have that, conditioned on , we bits of conditional entropy for accumulate roughly symbols of . Thus every

90

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 1, JANUARY 2010

For the corresponding lower bound, we fix any , and write

, let

(2) where the only nontrivial steps are the fourth and sixth equalities. The sixth equality holds by Lemma 1. The fourth form equality essentially holds because a Markov chain (this is clear from Fig. 1). Using the standard for a sequence of three random notation , and that form a Markov chain when convariables sidered in order (or equivalently, in reverse order), it follows that

and therefore (using the standard fact that if for any function then ) that contains the support of

, whose domain

where we have used Lemma 2. Taking the limit as then completes the analysis of the third term in (1). We now give the formal proof of Theorem 1. First, we forand malize the claim that using basic arguments from renewal theory. and Lemma 3: almost surely as . Proof: Let and let for . For , let . By Lemma 1, the ’s are independent and have common distribution . Therefore, is a delayed renewal process. Since , we may apply a standard result from renewal theory (e.g., [11, Prop. 3.14]) to to obtain almost surely as . Thus, almost surely. The proof that almost surely is entirely similar.

Lemma 2: The sequence is nondecreasing and bounded. Thus, it Proof: By Lemma 1, each suffices to show that the sequence is nondecreasing. Indeed, for any

We can now sketch the formal asymptotic analyses of the first two terms in (1). The analyses rely heavily on the notion of uniform integrability, and so we encourage readers who are unfamiliar with that concept to read Appendix I before continuing for the definition, standard results, and intuition that is extremely useful in understanding our proofs. We define the following notation for jointly distributed dis. We let denote the crete random variables support of . For any , we let denote the support of the conditional distribution of given . Finally, we let denote the random variable whose value is the support of the conditional distribution of given . We start by stating two general technical lemmas.

where the second step follows from the fact that the ’s are independent, and the third step follows from the fact that

Lemma 4: Let random variables with stant . Then, In particular, the family formly integrable. Proof: We write

This observation establishes the necessary step. Now we examine the terms in (2).

determines

be any family of discrete for some con. is uni-

.

Returning to (2), Lemma 2 tells us that an arbitrarily large constant fraction of the terms in the sum converge to , and assuming that this convergence occurs sufficiently quickly, then as . we should obtain Indeed, by Lemma 2, we have that for any

as

KIRSCH AND DRINEA: DIRECTLY LOWER BOUNDING THE INFORMATION CAPACITY FOR CHANNELS

where we have used Jensen’s inequality and the concavity of on (which is easy to check by examining the second derivative of ). It follows that

and therefore, the grable (by, e.g., [8, Example 7.10.5]).

91

Proof:

We start by noting that and that . Thus, by Lemma 4 and Theorem 2 (in Appendix I), it suffices to almost surely as show that . To this end, we write

’s are uniformly inte-

Lemma 5: Let be a sequence of independent discrete random variables where for all and, for ’s have some common distribution sufficiently large , the . Let be a sequence of positive integral random vari’s, and ables defined on the same probability space as the suppose that almost surely for some constant . Then, almost surely

where we have used the fact that conditional distribution of given . Next, we write

Proof: Fix some surely, we have

By Lemma 3, we have (and ) almost surely, and therefore, Lemma 5 implies that

. Since

almost

and that for , the is uniform on

for sufficiently large , almost surely. It follows that, almost surely, for sufficiently large

It follows that

almost surely as

, completing the proof.

The analysis of the second term in (1) is essentially the same as for the first term, but slightly more technical. Lemma 7:

(3) Proof: We write by the strong law of large numbers. A union bound over all now tells us that (3) holds almost surely for all simultaneously, and therefore, we may take the limit as to obtain

almost surely. The proof of the corresponding upper bound is entirely similar. We are now ready to analyze the first term in (1). Lemma 6:

Note that and that can be represented as a (possibly empty) binary string of the form

which

has

length

. Thus, , and so Lemma 4 im’s are uniformly plies that the integrable. By Theorem 2 (in Appendix I), it now suffices to show that, almost surely

92

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 1, JANUARY 2010

We do this by showing that, almost surely

(4)

(5) We start with (4), writing

random varifor the first two terms, because the relevant ables do not factor into i.i.d. random variables. However, we for can still write . The ’s are easily shown to be uniformly integrable using Lemma 4. Recalling that and as for , we have (heuristically) any

This argument is essentially the justification for (2), and from there, our previous analysis is sufficient to complete the proof of Theorem 1. In reality, we do not prove (2) directly, but instead perform a direct analysis of the third term in (1) using the previous arguments as intuition. We now give the formal proof as a sequence of lemmas. Lemma 8:

Proof: For the upper bound, we write (recalling that is just the last block of , and is the remainder)

Therefore

By Lemma 1, the ’s are independent and, for they have common distribution with

Since almost surely as almost surely

,

(and ) (by Lemma 3), Lemma 5 tells us that,

Thus, we have established (4). The proof of (5) is similar. We write

so that

Lemmas 1, 3, and 5 now yield (5), completing the proof. The formal analysis of the third term in (1) is fairly technical. The added difficulty is due primarily to the fact that we cannot use the same sort of almost sure convergence trick as

For the lower bound, we have

where the fourth step is the only nontrivial one, and it folis conditionally lows from the fact that [which independent of ] given . Indeed, given determines , we can take a sample from in the appropriate conditional distribution of ]. the following way [here we temporarily assume , and choose randomly 1) Let be the last character of from . from the conditional distribution of 2) Take a sample given that if the first characters in are sent through the channel, the result is a nonempty block of ’s. 3) Compute the sample of by adding (modulo ) to every character of , and taking the first characters of the result. by taking a sample from 4) Compute the sample of the conditional distribution of the channel output when the

KIRSCH AND DRINEA: DIRECTLY LOWER BOUNDING THE INFORMATION CAPACITY FOR CHANNELS

93

sample of is sent through the channel, given that the output contains a single nonempty block of ’s. is the same, except (The sampling algorithm for we set and use instead of .) Since this sampling , we are done. algorithm only requires knowledge of Lemma 9: Consider some random variable and any other discrete random variable defined on the same probability space. Then by Lemma 3 and the strong law of large numbers. Lemma 11: Let be an event with Proof: Since

Proof: First, we write

be a nonnegative random variable, and let . Then, . is nonnegative

Next, we observe that where we use the standard notation ment of the event .

to denote the comple-

Lemma 12:

Finally, we have Proof: First, we note that since for any

As we saw in the proof of Lemma 7, as a binary string of the form which is the entropy of the conditional distribution of given that , and the entropy of any distribution with support is at most . contained in Lemma 10: Proof: We write

which

has

length

. Now, for any

where the first step follows from Lemma 9, and the second step is contained follows from the fact that the support of . Since as , it suffices to in as . To this end, we note show that that , and therefore, it is enough to show that as almost surely, by the dominated con, since vergence theorem. Indeed, is a prefix of . Therefore, almost surely

(6) can be represented

. Thus, , and so Lemma 4 tells us that the ’s are uniformly integrable. and , define the event

Fix any , and let denote the indicator function. such By Theorem 3 (in Appendix I), there exists some and any event with that for all

By Lemma 3, ciently large , we have

, and so for suffi-

94

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 1, JANUARY 2010

It follows that for large enough , we have

(7) For convenience, let define

For brevity, let

for

. Also

Bounding the other terms in (8) requires more care. The main random idea is that, conditioned on , we can modify the variable inside the expectation in a way that would not be valid in the original probability space. We can then remove the conditioning from the expectation through an application of Lemma 11. Once the conditioning is removed from the expectation, the expectation once again becomes a conditional entropy, but a different conditional entropy than at the beginning of the proof. The new conditional entropy is easily bounded by . . To shorten our equations, define Fix some

for

. Then Now, we have that independent given given

and and

are conditionally , because for any ,

forms a Markov chain. To see this, consider the following algorithm for taking a sample from the conditional distribution of: (8) We now upper bound the terms of the sum in (8). We start with the first term. By Lemma 11

(9) We can bound the last few terms of the sum in (8) similarly

(10)

given

and

and

1) Let be the last character of (which is determined by the input), and choose randomly from . sequentially, choosing ranThen, choose . domly from from the conditional joint 2) Take a sample distribution of independent samples from given that and that sending the first characters of through the channel results in a single nonempty block of ’s. 3) For , compute the sample of by ) to every character in . adding (modulo according to the conditional distribution 4) Sample is sent of the channel output when the sample of through, given that the output corresponding to the first bits is a single nonempty block of ’s. For , sample according the distribution of the channel output when is sent through. at Note that this algorithm does not use the input value of all, implying the promised conditional independence result. , then for any Continuing, it follows that if

KIRSCH AND DRINEA: DIRECTLY LOWER BOUNDING THE INFORMATION CAPACITY FOR CHANNELS

Of course, letting

given for any

, and therefore, and

95

Proof: The proof is very similar in spirit to that of Lemma 4, although more technical, and so we defer it to Appendix III.

, we have Lemma 14:

(11) where the third step follows from the fact that, when occurs, is determined by the other random variables on which we are conditioning, and the fourth step follows from Lemma 11. Now

Proof: Fix any and define

and . Define as in the proof of Lemma 12, and

For brevity, let

Then

(13) (12) The first step is obvious. The second step follows from the fact that

and

determines

(In the case where the former tuple implies that , this fact is easily seen upon expanding the notation. , Otherwise, the former tuple implies that and it is easy to see that in this case, this tuple determines and and , and that the first two of these three quantities can be used to from the third; from this point, the recover claim can be easily seen by expanding the notation.) ’s are The third step follows from the fact that the . Finally, the fourth step follows identically distributed for from Lemma 2. now gives Combining (6)–(12) and taking limits as

Now, for any independent of and Indeed, given

, we claim that and and

is conditionally given and .

forms a Markov chain. To see this, consider the following algorithm for taking a sample from the conditional distribution of:

and since are arbitrary, we can take limits as to obtain the desired result. For convenience, we define , we define for and

(14)

. Also, and .

given

and

and

Lemma 13: The family

is uniformly integrable.

1) The input already specifies and . Thus, it is enough to generate a sample from the appropriate conditional distribution of

96

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 1, JANUARY 2010

. Also, for the rest of the algo. rithm, note that the input specifies , and choose 2) Let be the last character of randomly from . Then, sample for sequentially by choosing at . random from from the con3) Take a sample independitional joint distribution of dent samples of given that

4) For , compute a sample of by adding the appropriate conditional distribution of (modulo ) to each of the characters in . Concatenate these strings to obtain the required sample from the appro. priate conditional distribution of at all, imThis algorithm does not use the input value of plying the claimed conditional independence result. Continuing, , we have that for , letting given

Substituting (18) into (16) now gives

(19) Now we lower bound (17). We start by writing

(20) We now use Lemma 13 to bound (20). Fix some . By Lemma 13 and Theorem 3 (in Appendix I), there exists some such that for all and any event with , we have

Since , we have ciently large . Therefore, for sufficiently large

for suffi-

(21) (15) Substituting (19) into (16) and substituting (21) into (20) into (17) and taking limits as now gives

Combining (13)–(15) now gives

(22)

(16)

(17)

Since (22) holds for sufficiently small take the limit in (22) as result.

, we may to obtain the desired

Combining Lemmas 8, 10, 12, and 14 now gives the following result, which completes the proof of Theorem 1. Lemma 15:

We lower bound (16) and (17) separately, starting with(16). , Lemma 2 yields For

IV. SIMULATION RESULTS FOR THE DELETION CHANNEL

(18)

We now apply Theorem 1 to the binary deletion channel to gain a more concrete sense of our improvement on previous work. Our objective here is merely to demonstrate that the third term in Theorem 1 is significant enough to warrant further study. We leave the issue of analyzing this term more precisely for future work.

KIRSCH AND DRINEA: DIRECTLY LOWER BOUNDING THE INFORMATION CAPACITY FOR CHANNELS

97

Similarly to [6], we can use Theorem 1 to lower bound the capacity of the binary deletion channel by choosing to be some geometric distribution. Theorem 1 then tells us that the capacity of the channel is at least

(23) and have All of the quantities in (23) except for closed forms. can be written as an infinite sum that can be numerically approximated (see [6, Lemma 1]). These calculations become more time consuming as the deletion probability increases because this causes the average length of a group to increase. Indeed, if has distribution , from the proof of Lemma we obtain 1, which increases with . Estimating is much more challenging, even if all we seek is a lower bound (which, as discussed in Section III, is sufficient to improve the lower bounds on the capacity shown in [6]). We write

(24) where the first step follows from Lemma 2 and the last step follows from Lemma 1. Since we can numerically approximate , it suffices to estimate (or even just lower bound) . Similarly to , we can write as an , numerically approxinfinite sum. However, unlike for imating the sum for seems to require an enormous amount of time. The basic underlying reason for the increase in the complexity of the computations is the same as the reason for the increase in the difficulty of analytically deriving the third term in the expression of Theorem 1. In essence, our approach attempts block decoding in the standard information theoretic sense, which was previously suggested in [6]. More specifically, Drinea and Mitzenmacher [6] consider a received block of zeros (or ones) as the output symbol; for better decoding, more consecutive received blocks could be considered as the new output symbol. However, besides the increased computational effort, the combinatorial nature of the approach in [6] yields very complicated formulas, even when considering output symbols consisting of only two received blocks, making it difficult to quantify the improvement due to block decoding. Here, we succinctly identify this improvement for output symbols consisting of any number of consecutive received blocks. to the first two terms of Theorem 1 Indeed, adding blocks. essentially mimics block decoding for Of course, the main drawback for block decoding is that it is computationally challenging; in our case, considering multiple blocks of the received sequence instead of just one introduces some very complicated dependencies. The details are fairly technical, but in brief, these dependencies present themselves as a large number of nested summations for , making , where this it much more difficult to compute than nesting can be reduced. Thus, we resort to estimating through simulation.

Fig. 3. Lower bounds on the capacity of the binary deletion channel.

We estimate by taking independent samples from the distribution of and computing the entropy of the resulting empirical distribution. Our simulations suggest that converges from below as increases, but that the chosen value of gives a sufficiently good approximation of for the resulting lower bound on the capacity of the channel to be reasonably accurate (for the values of that we consider). We give the results of our simulations in Fig. 3. Since the grows, we only have computation time increases quickly as . However, it is evident that our theoretical results for results allow for improved lower bounds on the capacity. Indeed, , the new lower bound exceeds that of [6] by more at than 10%. Furthermore, we expect this improvement to increase with . V. CONCLUSION We give a new approach to lower bounding the capacity of channels with i.i.d. deletions and duplications that is simple, intuitive, and provides stronger theoretical results than currently obtainable using the more standard combinatorial approach. Specifically, we consider channels governed by any with and we deletion/duplication distribution consider sources with i.i.d. block lengths given by any distri, as opposed to prior work [5], bution with [6], which requires that have geometrically decreasing tails and that is geometric or has finite support. Furthermore, our results essentially reveal the limits of using i.i.d. block lengths to lower bound the channel capacity, which is currently the standard technique for deriving such lower bounds. Finally, we show that our techniques allow for improved lower bounds on the capacity of the binary deletion channel. APPENDIX I UNIFORM INTEGRABILITY This Appendix gives an overview of the concept of uniform integrability and a preview of its importance in our analysis. We include this Appendix so that readers who are unfamiliar with

98

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 1, JANUARY 2010

uniform integrability can gain some additional intuition before reading our proofs that rely on it. Of course, we give only a brief overview; for more information see, for example, [8, Ch. 7.10]. We begin with the standard definition. Definition 1: A family of random variables is uniformly integrable if

where

denotes the indicator function.

In our proofs, we are interested in computing , and we essentially do this by computing (that is, the limit of the conditional expectation of given the corresponding “good” event). We use this technique because provides us with a lot of conditioning on the “good” event . However, to structure that we can use to analyze argue that

we need to show that

At first glance, Definition 1 is not particularly enlightening. However, if we write

then we can see that Definition 1 formalizes the idea that grows very large, shrinks to , and as does not grow fast enough to prevent the product from converging to . Furthermore, this effect happens uniformly over all . Indeed, this uniformity is what makes is uniform integrability stronger than just saying that each for every ). integrable (that is, There are two important characterizations of uniform integrability that we use in our analysis. We begin with the first that we use. Theorem 2 [8, Th. 7.10.3]: Suppose that in prob. Then, the following are equivalent. ability as is uniformly integrable. 1) 2) for all , , and in . mean as for all , and as . 3) In our applications of Theorem 2, the family is almost surely always nonnegative, and we know that . In this case, the equivalence of the first and third as items in the theorem essentially tells us that uniform integrability is a characterization of when one may interchange limit and expectation operations. In particular, the theorem allows for generalizations of the dominated convergence theorem, which is the result that is usually used to justify such an interchange. The second characterization of uniform integrability that we present is slightly more technical in appearance. Theorem 3 [8, Lemma 7.10.6]: The family is uniformly integrable if and only if both of the following conditions hold. . 1) , there exists some such that for all 2) For any and any event with , we have In our proofs, we always know that is uniformly integrable and nonnegative when we apply Theorem 3, and we are interested in obtaining the second item. The second item is significant to us because if we have some sequence of such that , then “good” events we may write

(25) Theorem 3 gives us the mechanism to prove (25). We fix some and apply Theorem 3 to guarantee the existence arbitrary such that for all any event with , of some Now, since , we we have know that for sufficiently large , and therefore

Since the above equation holds for any limit as to obtain (25).

, we may take the

APPENDIX II PROOF OF LEMMA 1 That the ’s are independent is obvious from an examination of Fig. 1. Similarly, it is easy to see that the randomness used between successive invocations of line 10 is iden, the ’s tically distributed, and therefore, for are identically distributed. Furthermore, it is also clear that is uniformly distributed on and is independent of . The rest of the proof is a sequence of calculations. We start (i.e., the channel does not perform with the case where is just the th block of the input to the deletions). In this case, channel, and so for . It follows that . Furthermore, conditioned on , we have that has the same distribution as the sum of independent (the joint samples from , which has entropy at most entropy of independent samples from ). It follows that

We now have

as required. , We now handle the more interesting case where starting with some technical preliminaries. Suppose that we denote the condisend symbols across the channel. Let tional distribution of given that all symbols are deleted. denote i.i.d. samples from , and let denote Let , where the geometric the distribution of ’s. (Throughout random variable is independent of the

KIRSCH AND DRINEA: DIRECTLY LOWER BOUNDING THE INFORMATION CAPACITY FOR CHANNELS

this paper,

denotes the geometric distribution with for .) Then

and by Wald’s equation

Next, let denote the longest suffix of that does not contain the symbol , and let denote the number of blocks in . if .) To describe the distributions of (Note that and , consider sending across the channel. is uniformly distributed on and that for Recall that , the conditional distribution of given is . Suppose we send across the channel uniform on block by block, stopping just after the first block where a symbol comes out of the channel. Let denote the prefix of that we send, and construct from by dropping the last block (the first block in resulting in a symbol coming out of the channel). Then, the distribution of is the same as the conditional distrigiven that does not contain the symbol . It bution of follows that has the same distribution as , where is independent of the ’s. Furthermore, for any , conditioned on any particular values for the first blocks of , the probability that the next block sent across the channel is not a block of ’s and is entirely deleted by the channel is . Thus, the distribution of is stochastically dominated by the distribution . (A real-valued distribution stochastically dominates another if real-valued distribution for every real ; by a standard coupling argument, this implies for every nondecreasing function . Note that we use stochastic dominance here because geometric distributions are easy to analyze, but is not geometric as it can be .) By Wald’s equation, it follows that

We are now ready to commence the calculations in earnest. Define independent random variables and and and so that: • ; • has distribution ; has distribution ; • has distribution ; • • has distribution ; ’s are identically distributed for . • the , define so that For a string , where denotes the indicator function. . For the empty string , let . Let , let . It is easy to see that For is distributed as

99

In other words, the first block of is a block of zeros whose length is distributed according to the conditional distribution of given that at least one symbol in a block of length is transmitted through the channel. The blocks of then alternate between ’s and ’s, where a block length of ’s is just determined from , and a block length of ’s represents all of the symbols that the channel deletes between subsequent occurrences of the in . This alternation between blocks of ’s and symbol corresponding to the last blocks of ’s continues until in in . Then, there is one last block of ’s, occurrence of possibly empty, corresponding to all of the symbols deleted beand the next block of that is tween this last occurrence of not entirely deleted by the channel. Using Wald’s equation, we can compute

Next, we write

Now

and for

and so

Similarly

and

Therefore

100

Now let denote the samples of channel in transmitting . Then

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 1, JANUARY 2010

used by the

because and are each distributed as the conditional distribution of something determined by the process of sending zeros through the channel, given that the process satisfies some condition; specifically, that at least one comes out of the channel (for ), or that all symbols are deleted (for ). Also, it is easy to see that , and therefore, . Thus

Furthermore

and the conditional distribution of given is the same as i.i.d. random variables whose common distribution is the conditional distribution of given that sending symbols across the channel does not result in any received symbols. (we saw that Thus, earlier in the proof). Also

and so have

where we have used the fact that the conditional distribution of given has support at most , and therefore, has entropy at most . Now let . Then

. It is easy to see that

, and we

since we can think of as being determined by a sample from and the random variable . Finally, using the same technique as for yields , implying that

which completes the proof. where the inequality follows from the fact that mine . For any in the support of

and

where we have used the fact that the conditional distribution can be thought of as the conditional disof given tribution of independent samples from given that they satisfy some condition; since conditioning does not increase entropy, this distribution has entropy at most . Thus, , and so

We have now shown that . The analysis of is similar. Let denote the number of blocks that are completely transmitted across the channel before the first symbol is received. Then, has the same distribution as

where if

APPENDIX III PROOF OF LEMMA 13

deter-

is a random variable that is the empty string and whose conditional distribution given and is the same as . It is clear that has distribution and that is a stopping time for . Thus, Wald’s equation gives

We proceed using a generalization of the technique used in the proof of Lemma 4. Specifically, we show that

by showing that as

For . Let in

, let

be the number of blocks of contained in denote the total number of blocks . Then, given , we can encode as a binary string of the form

(This string uniquely specifies when is known since the lengths of the blocks of contained in are easily seen.) The length of the string is clearly , and therefore Now, as

KIRSCH AND DRINEA: DIRECTLY LOWER BOUNDING THE INFORMATION CAPACITY FOR CHANNELS

101

symbol used for the previous block. Now, once we reach the next block of ’s in , the length of that block simply has distribubetween tion , and then the distribution of the portion of this block of ’s and the next block of ’s is determined as before. This process continues until the last block of ’s in , corresponding to the last block of ’s in . The remainder of the has distribution (by definition of ), corresponding to deleted after the last occurrence of . all of the symbols in It now follows that has the same distribution as

The first two steps are obvious. The third step follows from Jensen’s inequality and the concavity of on (which is easy to check by examining the second derivative of ). The fourth, fifth, and sixth steps are obvious, and the seventh step follows from the Cauchy–Schwarz inequality. For the eighth step, looking back to the procedure ’s have a common disin Fig. 1, it is easy to see that the tribution for . Furthermore, it is also easy to see that is stochastically dominated by the sum of two independent samples from . Thus, it suffices to show that , which we do by a direct calculation. Define random variables and and and and , , and such that: ; • are i.i.d. with common distribution ; • for ; • • are i.i.d. with common distribution ; has distribution , where is • as defined in the proof of Lemma 1; • is uniformly distributed on ; • has the distribution from the proof of Lemma 1; • is the number of blocks in ; • the ’s, ’s, ’s, , , and are independent; , the conditional distribution of given the • for ’s, ’s, ’s, , , and is uniform on if for any , and uniform on otherwise. It is easy to see that has the same distribution as

Recall from the proof of Lemma 1 that is stochastically dominated by a random variable for some , so . A simple calculation now yields

completing the proof. ACKNOWLEDGMENT In other (less precise) words, for any , the first block of is a block of ’s whose length is distributed according to the conditional distribution of given that at least one symbol in a block of length is transmitted through the channel. The correspond to the blocks that are deleted benext blocks of tween the first block and the next block of ’s in . The number of such blocks clearly has distribution , which is the common distribution of the ’s. Furthermore, the symbol for each of these blocks, conditioned on all previous blocks, is uniform over all of , with the exception of and the

The authors would like to thank M. Mitzenmacher for his comments and many useful discussions and the anonymous reviewers for helping to improve the presentation of this paper. REFERENCES [1] S. Diggavi and M. Grossglauser, “On information transmission over a finite buffer channel,” IEEE Trans. Inf. Theory, vol. 52, no. 3, pp. 1226–1237, Mar. 2006. [2] S. Diggavi, M. Mitzenmacher, and H. Pfister, “Capacity upper bounds for deletion channels,” in Proc. Int. Symp. Inf. Theory, 2007, pp. 1716–1720.

102

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 1, JANUARY 2010

[3] R. L. Dobrushin, “Shannon’s theorems for channels with synchronization errors,” Transl.:Translated from Problemy Peredachi Informatsii, vol. 3, no. 4, pp 18-36, 1967 Probl. Inf. Transmission, vol. 3, no. 4, pp. 11–26, 1967. [4] E. Drinea and A. Kirsch, “Directly lower bounding the information capacity for channels with I.I.D. deletions and duplications,” in Proc. IEEE Symp. Inf. Theory, 2007, pp. 1731–1735. [5] E. Drinea and M. Mitzenmacher, “On lower bounds for the capacity of deletion channels,” IEEE Trans. Inf. Theory, vol. 52, no. 10, pp. 4648–4657, Oct. 2006. [6] E. Drinea and M. Mitzenmacher, “Improved lower bounds for I.I.D. deletion and insertion channels,” IEEE Trans. Inf. Theory, vol. 53, no. 8, pp. 2693–2714, Aug. 2007. [7] D. Fertonani and T. M. Duman, “Novel bounds on the capacity of the binary deletion channel,” [Online]. Available: http://arxiv.org/abs/ 0810.0785v1 [8] G. Grimmett and D. Stirzaker, Probability and Random Processes, Third ed. Oxford, U.K.: Oxford Univ. Press, 2003. [9] M. Mitzenmacher, “A survey of results for deletion channels and related synchronization channels,” [Online]. Available: http://http://www.eecs.harvard.edu/~michaelm/ [10] M. Mitzenmacher and E. Drinea, “A simple lower bound for the capacity of the deletion channel,” IEEE Trans. Inf. Theory, vol. 52, no. 10, pp. 4657–4660, Oct. 2006. [11] S. Ross, Applied Probability Models with Optimization Applications. San Francisco, CA: Holden-Day, 1970.

Adam Kirsch received the Sc.B. degree in mathematics-computer science (magna cum laude) from Brown University, Providence, RI, in 2003 and the S.M. and Ph.D. degrees in computer science from Harvard University, Cambridge, MA, in 2005 and 2008, respectively. His research interests are primarily in the applications of techniques from theoretical computer science, including the probabilistic analysis of computer processes and the design and analysis of algorithms and data structures for massive data, data streams, and long-term systems. Dr. Kirsch received a National Science Foundation (NSF) Graduate Research Fellowship in 2004.

Eleni Drinea received the B.S./M.S. degree in computer engineering and informatics from the Computer Engineering and Informatics Department, University of Patras, Patras, Greece, in 1999 and the M.Sc. and Ph.D. degrees in computer science from Harvard University, Cambridge, MA, in November 2005. In 2006, she joined the New England Complex Systems Institute (NECSI) and Brandeis University as a Postdoctoral Fellow. She is a Research Associate at the School of Computer and Communication Sciences, EPFL, Lausanne, Switzerland, since June 2007.

Suggest Documents