The pumping lemma: statement, proof, and how to use it to prove that a language is not regular.
1The pumping lemma¶
1.1Statement of the lemma¶
For any regular language $L$, there exists a positive integer $p$ such that for every word $w \in L$ (where $w \geq p$), there exists a triple of strings $x, y, z \in \Sigma^*$ such that:
(i) $xy \leq p$
(ii) $y > 0$ (equivalently: $y \neq \epsilon$)
(iii) for every $i \geq 0$, $xyiz \in L$.
In (mostly) symbolic notation, this is:
$\forall L \,\text{reg}, \exists p > 0$ such that $\forall w \in L$ with $w \geq p, \exists x, y, z \in \Sigma^*$ such that: (0) $w = xyz$; (i) $xy \leq p$; (ii) $y \neq \epsilon$; (iii) $\forall i,\, xy^iz \in L$.
1.2Interpretation¶
What this is saying is that for any word of a certain length (or more) in a regular language, we can remove a particular middle section of it, or repeat that middle section any number of times, and still get a valid word in that language.
Now, at first glance this might seem to imply that any regular language must be infinite! Of course, this is not true, as finite regular languages clearly exist. We can reconcile this apparent contradiction as follows: any finite language $L$ has a longest word, which we'll call $w_0$. Let $p = w_0 + 1$. Then the pumping lemma applies to any word in $L$ with length $\geq p$. But, of course, there are no such words. Thus the pumping lemma is only of interest in relation to infinite languages.
1.3Proof¶
We will make use of the fact that regular languages can be recognised by DFAs. Suppose $L$ is a regular language, which is recognised by $M = (S, s_0, \delta, F)$.
Choose $p$ to be $S$ (note: this is merely an example of a value of $p$ that will work, not the only value). Consider any word $w \in L$ such that $w \geq p$. So there are at least $p$ letters in $w$, which means that when $w$ is fed as input to $M$, there are at least $p + 1$ states visited (including the initial state). Bu the total number of states is $p$. So, by the pigeonhole principle^{1}, the machine must visit the same state twice (at least) when processing $w$! Indeed, there could be more than one state visited more than once, but we only need to concentrate on one state that is visited more than once. We'll call this state $t$. If the start state is $s_0$, and the accept state that $w$ ends up in is $s_f$, the path taken through the machine looks something like this:
Hopefully the above diagram alone is sufficient for convincing you that the pumping lemma holds, because that's it for this section. $\blacksquare$
1.4Using the contrapositive¶
The pumping lemma tells us that "a regular language" $\to$ "pumpable". The contrapositive, that "not pumpable" $\to$ "not regular", will be very useful for proving that a language is not regular.
First, let's find the negation of the pumping lemma:
$$\forall p \geq 0 \exists w\in L, w \geq p, \forall x,y,z \in \Sigma^*, w=xyz, xy \leq p, y > 0, \exists i \geq 0, \,\text{such that} \,xy^iz \notin L$$
Thus, if we can find a language for which this property holds, then we know that language is not regular.
1.4.1Game formulation¶
We can use a game formulation of predicate calculus that will both assist our understanding of the formula above (since it's difficult to intuitively grasp any formula with so many quantifiers) and provide a simple way of proving that a language is not regular. Given a formula, pretend that you are battling an adversary who wants to falsify it, while you want to make it valid. You get to choose all the $\exists$ variables, and your adversary gets to choose all the $\forall$ variables, in their order of appearance.
Let's apply this to the formula above. Suppose the adversary chooses some fixed $p$. You must then choose some $w \in L$ (with $w \geq p$). Then, your adversary will choose $x$, $y$, and $z$, subject to the constraints. Your job is to choose some $i$ such that $xy^iz \notin L$. If this is possible for an arbitrary adversary, then this language is not regular.
Thus, $L$ is not pumpable $\Leftrightarrow$ you always have a winning strategy (no matter what the adversary chooses).
1.4.2Examples¶
$L = \{a^nb^nn \geq 0\}$. This is clearly not regular, since you'd have to count to keep track of how many $a$'s have been seen, and counting isn't possible for memoryless machines like DFAs. We can prove this via the game formulation above as follows:
 A $\to p$ (i.e., your adversary picks some $p$)
 I $\to w = a^pb^p$ (i.e., you choose $w$ to be $a^pb^p$)
 Then, since your adversary must choose $x,y,z$ such that $xy \leq p$, then $xy$ must consist purely of $a$'s! And $y$ can't be $\epsilon$. So $y = a^j$ where $j > 0$. (The values of $x$ and $z$ are irrelevant for this example.)
 Then you choose $i = 2$. So then $xy^2z = a^{p+j}b^p$ which is clearly not in $L$, since $p + j \neq p$. $\blacksquare$
$L = \{$ strings with an equal number of $a$'s and $b$'s $\}$. Since this is a superset of the previous language, we can use the exact same argument (remember that you choose the word).
$L = \{a^mb^nm \neq n\}$. This is clearly not regular for the same reason as the first example. However, it would be difficult to use the pumping lemma for this language. What we can do instead is reduce this to something we already know. We know that if $L$ is regular, then its complement is also regular. Furthermore, since $a^*b^*$ is regular, then $\overline{L} \cap a^*b^*$ should also be regular, if $L$ is regular. But $\overline{L} \cap a^*b^*$ is equivalent to the language from the first example. So $L$ can't be regular (proof by contradiction).
$\Sigma = \{a\}$, $L = \left \{a^{2^n} \,\mid\, n \geq 0 \right\}$. Again, clearly not regular because, counting.
 A picks $p$
 Choose $w = a^{2^p}$
 A picks $x, y, z$ such that $xyz = w$. Since $xy \leq p$, then the length of $y$ is constrained as follows: $0 < y \leq p < 2^p$ (where $p < 2^p$ is just something we know to be true for natural numbers). Let $k = y$.
 Choose $i = 2$. Then $w' = xyyz = a^{2^p + k}$.
Now, it's obvious that $2^p + k > 2^p$, but is it equal to some larger power of 2? Is there a $m \in \mathbb{N}$ such that $2^p + k = 2^m$? Well, no, because the next largest power of 2 is $2^{p+1} = 2^p + 2^p$, but since $k < 2^p$, then $2^p + k < 2^p + 2^p$, so yeah. $\blacksquare$
$L = \{ww\,\,w \in \Sigma^*\}$
 A picks $p$
 Choose $w=a^pba^pb$
 The adversary is limited to all $a$'s for $y$. So $y = a^k$ where $k > 0$.
 Choose $i=2$, then we win ($a^{p+k}ba^pb$). $\blacksquare$
(We could also have proven irregularity using MyhillNerode, but this method is much nicer.)
$L = \{a^ib^j \, \mid \, i > j \geq 0\}$
 A picks $p$, as always
 Choose $w = a^pb^{p1}$
 As before, the adversary can only choose $a$'s. So $y = a^k$ where $k > 0$.
 Choose $i = 0$ (pumping down, instead of pumping up, for once). Then $xy^iz = xz = a^{pk}b^{p1} \notin L$. $\blacksquare$

Incidentally, this is the reason that $xy \leq p$. ↩