Tuesday, October 22, 2013 CC-BY-NC
The pumping lemma for CFLs

Maintainer: admin

1The pumping lemma for CFLs

1.1Relation to regular languages

Just as we had for regular languages, context-free languages also have a pumping lemma that can be used to prove that a language is not context-free. However, this pumping lemma is quite different from the one we've seen before. The major difference between the machines for recognising context-free languages and those for regular languages is that the latter are finite-state machines, while the former are more like infinite-state machines due to the existence of the unbounded stack. Thus, the argument we used to prove the pumping lemma for DFAs will not work for PDAs.

So where do we find the structure needed to prove the pumping lemma? In the grammar, of course. We learned earlier that any CFL can be described by a grammar in Chomsky normal form. We'll use this fact in our proof of the pumping lemma.

1.2Sketch of the proof

Assume that $L$ has a context-free grammar $G$ in Chomsky normal form, where $G$ contains $n$ non-terminals (where $n$ is finite).

Recall that in CNF, our rules are all of the form $A \to BC$ and $A \to a$ (or $S \to \epsilon$). Consider a string of length $2^{n+1}$. By the form of the rules in CNF, we know that the branching in the tree for generating this string is at most binary. So the smallest tree you can build to represent such a tree has a height of $n+1$.

Here's another way to think about this: going down the tree, you must have hit at least $n+1$ non-terminals on the way. Since there are only $n$ distinct non-terminals in this grammar, then, by the pigeonhole principle, at least two occurrences along the path must be using the same non-terminal, which we'll call $X$.

Pumping lemma diagram

Now divide the string up into 5 pieces: $uvwxy$. We'll call the tree that generates $w$ $t$, and the (larger) tree that generates $vwx$ $T$, as in the diagram below.

Another pumping lemma diagram

This shows us where we have the opportunity to pump. In a context-free language, we could just replace $t$ by $T$ or vice versa (pumping up vs. pumping down). So then you would have either $uvxy$ or $uvvwxxy$, and in the latter case you could keep pumping indefinitely, resulting in $uv^iwx^iy$ for some integer $i$.

This is the essence of the proof for the pumping lemma. We won't formalise it.

1.3Formal statement of the pumping lemma

Let $L$ be a CFL. $\exists p \geq 0$ such that $\forall s \in L$ with $|s| \geq p$, $\exists u, v, w, x, y \in \Sigma^*$ where $s = uvwxy$ such that

  1. $|vx| > 0$ (i.e., $v$ and $x$ can't both be the empty string);
  2. $|vwx| \leq p$;
  3. $\forall i \geq 0$, $uv^iwx^iy \in L$.

1.4Using the contrapositive

As with regular languages, we can use the contrapositive of the pumping lemma to prove that a language is not context-free. Here's a formal statement of the contrapositive:

Given some language $L$. If, $\forall p \geq 0$, $\exists s \in L$ such that $|s| \geq p$, and $\forall u, v, w, x, y \in \Sigma^*$ where $s = uvwxy$, with $|vx| > 0$ and $|vwx| \leq p$, $\exists i \geq 0$ such that $uv^iwx^iy \notin L$, then $L$ is not context-free.

1.4.1Examples

$$L = \{a^nb^na^n\}$$

We see intuitively that a pushdown automaton with only one stack is incapable of recognising this language. To prove it, we can use the pumping lemma. We can use the same sort of game formulation that we used with regular languages:

  • Adversary chooses some $p$
  • We produce a string $s = a^pb^pa^p$, which is clearly $\in L$
  • Adversary chooses some $u,v,w,x,y\in \Sigma^*$ such that $s = uvwxy$.

Our job, now, is to choose some $i$ such that $uv^iwx^iy \notin L$. We can do this on a case-by-case basis, making use of the fact that $|vwx|$ must be $\leq p$ (and so $v$ and $x$ cannot be greater than $p$ characters apart). We break up the string $s$ into blocks as follows:

$$\underbrace{a^p}_{\text{block 1}} \underbrace{b^p}_{\text{block 2}} \underbrace{a^p}_{\text{block 3}}$$

  • Case 1: both $v$ and $x$ are in block 1. Choose $i \neq 1$. Then $uv^iwx^iy$ would have too many $a$'s at the beginning, and so the resulting string would not be part of $L$.
  • Case 2: $v$ or $x$ straddles a block boundary. Choose $i = 2$. Then, the $a$'s and $b$'s would be out of order in the resulting string.
  • Case 3: $v = \epsilon$ and $x$ is fully within a block or $x = \epsilon$ and $v$ is fully within a block. Choose $i\neq 1$. Then there would be too many of one letter.

(There are in fact many other ways of choosing the cases - the solution presented above is merely one of many possibilities.)

$$L = \{ww \mid w \in \Sigma^* \} \tag{$\Sigma = \{a, b\}$}$$

This one is trickier since the language is so unconstrained. Instead of using the same approach that we used for the last question, we'll make the following fact: if $L$ is context-free and $R$ is regular, then $L \cap R$ is context-free. We define a language $L' = L \cap \{a^*b^*a^*b^*\} = \{a^nb^ma^nb^m\}$. Since $L'$ is the intersection of $L$ with a regular language, we can use the pumping lemma on $L'$ (which is more structured and thus easier to pump), as proving that $L'$ is not context-free suffices to prove that $L$ is not context-free.

  • Adversary picks some $p$.
  • Choose $s = a^pb^pa^pb^p$.
  • Adversary chooses some $u,v,w,x,y\in \Sigma^*$ such that $s = uvwxy$.

Now we break up the string $s$ into two equal blocks (halves). As before, we analyse the different possibilities for $v$ and $x$:

  • Case 1: $v$ and $x$ are both fully contained in the same block. Choose $i \neq 1$. Then the two halves will no longer match and so the resulting word is not in $L'$.
  • Case 2: $v$ and $x$ are both fully contained in different blocks. Because they can only be a maximum of $p$ letters apart, $v$ must be in the $b$ section of the first block and $x$ must be in the $a$ section of the second block. If we choose $i \neq 1$, the two halves will again no longer match.
  • Case 3: either $v$ or $x$ straddles the boundary between the two blocks. If we pump up, then there will be a middle section in the resulting word where the $a$'s and $b$'s are out of order.

$$L = \{a^ib^j \mid j = i^2 \}$$

Left as an exercise for the reader.

2Further notes on CFLs

2.1One-letter alphabets

There is no CFL over a one-letter alphabet that is not regular. The proof is left as an exercise for the reader.

2.2Designing a CFL from its complement

Recall $L = \{ww \mid w \in \Sigma^* \}$, which we showed is not context-free. However, its complement, $\overline{L}$, is in fact context-free. Let's design a grammar for $\overline{L}$:

$$\begin{align} S & \to AB \mid BA \mid A \mid B \\ A & \to CAC \mid a \\ B & \to CBC \mid b \\ C & \to a \mid b \end{align}$$

Let's check that this grammar indeed generates $\overline{L}$. This means that no words generated by $G$ can be of the form $ww$, and that all words not of the form $ww$ must be generated by this grammar.

Well, between the rules $S \to A$ and $S \to B$, we can generate all the possible odd-length strings. $A$ generates all odd-length strings with an $a$ in the middle (through $AS \overset{*}{\to} C \ldots C a \ldots C$) and $B$ generates all odd-length strings with a $b$ in the middle. Then, $S \to AB$ and $S \to BA$ together generate all even-length strings that are not of the form $ww$. To convince yourself that this is true, consider any word $s$ generated by $S \to AB$. Suppose the $A$ part generates a string $xay$ where $x, y \in \Sigma^*$ and $|x| = |y| = n$, and suppose the $B$ part generates a string $ubv$ where $|u| = |v| = m$. Then the string $s$ can be divided into two (not necessarily equal) parts:

$$s = \underbrace{xay}_{A \overset{*}{\to}} \mid \underbrace{ubv}_{B \overset{*}{\to}}$$

Recall that $|y| = n$ and $|u| = m$. So the $a$ and $b$ shown above are exactly $m+n$ characters apart. The word as a whole has a length of $2(m+n+1)$, which means that each half has a length of $m+n+1$. Now, the first $a$ is $n$ letters away from the start of the word. But the $b$ is also $n$ letters away from the midpoint, simply because each half must have $m+n+1$ characters. Here's an illustration:

Not the clearest diagram in the world but deal with it

So the two halves of the resulting string differ - one half has an $a$ at some position where the other half has a $b$. Similar reasoning holds for words generated by the $S \to BA$ rule.

Thus we have an example of a CFL whose complement is not a CFL, illustrating that CFLs are not closed under complementation1. We also learned that CFLs are not closed under intersection; here's an illustration of that:

$$L_1 = \{a^nb^nc^i\} \quad L_2 = \{a^ib^nc^n \} \quad L_1 \cap L_2 = \{a^nb^nc^n \} \mbox{where $L_1$, $L_2$ are CFLs but $L_1 \cap L_2$ is not}$$

  1. CFLs that can be represented by deterministic PDAs are closed under complementation, however.