Thursday, September 12, 2013 CC-BY-NC
Constructions on languages and regular expressions

Maintainer: admin

Constructions on regular languages, and introduction to regular expressions.

1Constructions on regular languages

1.1Closure properties of regular languages

Proposition: If $L, L_1, L_2$ are regular languages, then so are:

(i) $L_1 \cup L_2$
(ii) $\overline{L}$ (the set complement)
(iii) $L_1 \cdot L_2$ (concatenation - so if $L_1 = \{aa, bb\}$ and $L_2 = \{c\}$, $L_1 \cdot L_2 = \{aac, bbc\}$)
(iv) $L_1 \cap L_2$
(v) and $L^*$, which is defined as $\displaystyle L^* = \bigcup_{i=0}^{\infty} \underbrace{L\cdot L \cdot L \cdots L}_{i \text{ times}}$. The finite unions (so $\displaystyle \bigcup_{i=0}^{n}$ for some finite $n$) are also regular languages: $L^0 = \{\epsilon\}$, $L^1 = L$, $L^2 = L \cdot L$, etc.

1.1.1Proofs

We'll prove these by constructing an NFA, either by giving a description of the new machine or explicitly defining the variables. (During the lecture, diagrams were drawn for some of them, but it's too much work to do here. You can see them in Prakash's scanned lecture notes on this subject.)

(i) Suppose $M_1$ and $M_2$ are machines (DFAs) for $L_1$ and $L_2$, respectively. Create a new start state with 2 epsilon transitions, one to the start state of $M_1$ and the other to the start state of $M_2$.

(ii) Given a machine for $L$, simply reverse the acceptedness of accepted/reject states.

(iii) Given $M_1$ and $M_2$ (both DFAs), we make $M_1$'s start state the new start state for $M$, change $M_1$'s accept states into reject states, and add $\epsilon$ transitions from them to $M_2$'s start state.

(iv) This is equivalent to $\overline{\overline{L_1} \cup \overline{L_2}}$, which we can do by (i) and (ii). But here's an explicit construction anyway, as a DFA (since it's cool): given $M_1 = (S_1, s_1, \delta_1, F_1)$ and $M_2 = (S_2, s_2, \delta_2, F_2)$, we build a new machine $M = (S, s, \delta, F)$ that runs both $M_1$ and $M_2$ in parallel:

  • $S = S_1 \times S_2$
  • $s = (s_1, s_2)$
  • $F = F_1 \times F_2$
  • $\delta((p, q), a) = (\delta_1(p, a), \delta_2(q, a))$ (where $a \in \Sigma$, as usual) where $p \in S_1$ and $q \in S_2$

(We don't need to prove that this machine works, since it's simple.)

(v) Add a new start state (which should be an accept state), and add an $\epsilon$-transition from this to the old start state. Then, add $\epsilon$-transitions from the final state to the old start state (which should not be an accept state). Note: the description given in class was slightly incorrect. This is the correct version description, confirmed by Prakash.

2Regular expressions

Regular expressions arose out of the need for a machine-independent way of defining regular languages and operations on them. There are various distinct notations being used today (e.g., UNIX-style regex vs algebraic regex), but we'll stick with the original notation.

2.1Inductive definition

For a fixed $L$, we have:

(i) $\epsilon$ is a regular expression
(ii) $\forall a \in \Sigma$, $a$ is a regular expression
(iii) $\varnothing$ is a regular expression (the machine for this has just one state, which is a reject state)
(iv) If $r$ and $s$ are regexes, so is $r + s$ (where $+$ means "or" (or, $\cup$)
(v) Same for $r \cdot s$ (concatenating, as usual)
(vi) also $r^*$ (same interpretation as $L^*$; also exactly how it works in UNIX-style regex)

2.2Examples

Let $\Sigma = \{a, b\}$.

  • How do we do $\Sigma^*$ as a regex? One method is $(a + b)^*$. Another (equivalent) method is $(a^* \cdot b^*)^*$.
  • Everything that ends in 2 $a$'s: $(a+b)^*\cdot (aa)$

Now let $\Sigma = \{0, 1\}$. If we wanted to write a regular expression that recognises "all strings that, when intrepreted as an binary integer, are divisible by 3", the end result would be extremely complex. And yet, the machine is simple! This just goes to show that there is often no correlation between the complexity of the machine and that of the corresponding regex.

2.3Connection to regular languages

Theorem: regular expressions define regular languages. Note that this is not a trivial fact that simply arises as a consequence of them both being called "regular"; rather, it's due to the fact that we can convert explicitly from a regular expression to a regular language, and vice versa.

2.3.1Proof

Given any regex, we can construct an NFA (with $\epsilon$-transitions) to recognise the language. We can prove this by induction, on the structure of regular expressions.

Base cases:

  • $\varnothing$: the machine is easy (described previously)
  • $\epsilon$: also easy (2 states, start is accept, other is death state that takes in everything)
  • $a \in \Sigma$: 3 states (second is accept, others are reject, the transitions are pretty obvious)

Now the inductive step. We can prove this via constructions on machines. Conveniently, these constructions were provided earlier in the lecture! The full proof is omitted. You may be able to find more details in the posted lecture notes on closures.

Next, let's prove the other direction. Given a DFA $M = (S, s_0, \delta, F)$ with $n$ states (which we explicitly enumerate from 1 to $n$, with 1 being the start state), we can construct a family of regexes

$$R^{(k)}_{ij} \tag{where $i, j$ are from $1, \ldots, n$}$$

where each $R$ describes all the possible paths to lead from state $i$ to state $j$, using only states whose number is $\leq k$ as intermediate states. (This is a bit confusing; just note that $k$ does not refer to quantity of intermediate states, but rather to the number of each state.) $i$ and $j$ do not count as intermediate states.

Now, if $k=0$, this is easy to evaluate: either there's a direct transition from $i$ to $j$, or there's no direct transition. Formally: Assuming $i \neq j$, then if $\delta(i, a) = j$ for any $a \in \Sigma$, we will include $a$ in $R_{ij}^{(0)}$. So $R_{ij}^{(0)} = a_1 + a_2 + \cdots + a_k$ for each $a_p \in \Sigma$, where $\delta(i, a_p) = j$. If $i = j$, then we do the same thing (look for loops, but we have to add $\epsilon$ as well).

Then, we induct on $k$. Assume that we know $R^{(k-1)}_{ij}$ - that is, the path from $i$ to $j$ using only $1, \ldots, k-1$ as intermediate states - for all states $i, j \in S$. Obviously, all of these paths are still valid for $R_{ij}^{(k)}$ (since this just means that we can now use $k$ as well, if we want). But new paths are allowed: we can now stop at $k$. Well, we know $R_{ik}^{(k-1)}$ and $R_{kj}^{(k-1)}$. But then we know $R_{ij}$:

$$R_{ij}^{(k)} = R_{ij}^{(k-1)} + R_{ik}^{(k-1)} \cdot \underbrace{\left ( R_{kk}^{(k-1)} \right )^*}_{\text{accounting for loops}} \cdot R_{ij}^{(k-1)}$$

This concludes the proof. $\blacksquare$

More details (including a diagram explaining how we arrived at the above formula for $R_{ij}^{(k)}$) can be found in the scanned lecture notes.

2.4Algebraic properties

Now onto the algebraic properties of regular expressions (where algebra, in its simplest form, refers to "equations"):

  1. $R + \varnothing = \varnothing + R = R$ (identity for $+$)
  2. $R + S = S + R$ (commutativity of $+$)
  3. $R + (S + T) = (R + S) + T$ (associativity of $+$)
  4. $R + R = R$ (not sure what to call this)
  5. $R \cdot \varnothing = \varnothing \cdot R = \varnothing$ (the zero element for $\cdot$)
  6. $R \cdot \epsilon = \epsilon \cdot R = R$ (the identity/unit element for $\cdot$)
  7. $R \cdot (S \cdot T) = (R \cdot S) \cdot T$ (associativity of $\cdot$)
  8. $R \cdot (S + T) = R \cdot S + R \cdot T$ (distributivity)
  9. $(R + S) \cdot T = R \cdot T + S \cdot T$ (also distributivity)
  10. $\epsilon + R\cdot R^* = \epsilon + R^*\cdot R = R^*$

Also: $(R + S)^* = (R^*S)^*R^*$. For proof of this, and of the 10 rules above, see the scanned lecture notes.

In addition, note that we can drop the $\cdot$ when its presence is clear (e.g., $RR^*$ instead of $R\cdot R^*$)

2.4.1Solving equations

How would we solve $X = SX + T$ where $S, T$ are regular expressions and $X$ is a variable?

Actually, this doesn't have a general solution! In fact, there only exists a solution when $\epsilon \notin S$, in which case we have $X = S^*T$ (and this is the unique solution).