Thursday, September 5, 2013 CC-BY-NC
DFAs for determining validity of words

Maintainer: admin

How can we use a DFA to determine if a particular string is a valid word in a language?

1Definitions

Alphabet
A finite1 set of symbols/letters denoted $\Sigma$. Example: $\Sigma = \{a, b, c\}$ or $\Sigma = \{0, 1\}$ (bits).
Word
A finite sequence of letters. Example: $a$, $b$, $ab$, $aaabac$, etc. There is also the empty word: $\epsilon$.

We define the concatenation operation on words in the usual way: $w_1 \cdot w_2$ (with the dot symbol often omitted) simply means we write the letters in $w_1$ followed by the letters in the word $w_2$ to get a new word. This is an associative operation (so we can drop the parentheses), but it's not commutative, and there is an identity element ($\epsilon$). Thus, this is a monoid, but not a group, as groups have inverses and obviously concatenation does not.

$\Sigma^*$
The collection of all possible words that can be formed from an alphabet $\Sigma$. If $\Sigma \neq \varnothing$, then clearly $\Sigma^*$ is infinite. Note that even if $\Sigma = \varnothing$, $\Sigma^*$ is still not an empty set - it would be $\{\epsilon\}$.
Language
A language $L$ is a subset of $\Sigma^*$. You can think of a language as a set of words that are valid.

Much of this class is going to focus on how we describe languages, and how we can identify whether or not a particular word $w$ belongs to a language $L$. We'll find that for most languages, there isn't actually any way to do this! So we'll restrict our study to languages that we can describe, called well-behaved languages.

2Theory of deterministic finite automata (DFA)

This theory was developed by Stephen Kleene (pronounced "clay-knee"), not that it matters.

A DFA is defined by the 4-tuple $(S, s_0, \delta, F)$, for a fixed alphabet $\Sigma$:

$S$
the finite set of possible states
$s_0$
The initial state ($s_0 \in S$)
$\delta$
A function, $S \times \Sigma \to S$, which, given a state and the alphabet, gives the next state. As this is a function, it is deterministic by definition, hence the "deterministic" part of DFA.
$F$
Final/accept states ($F \subseteq S$)

Here's how the machine works:

  1. Start in state $s_0$.
  2. Read one letter and move to the next state according to $\delta$.
  3. Move one letter forward and repeat step 2 until we run out of letters. Note that there is no backtracking.

When the word is complete (and the machine has stopped), if the current state is an element of $F$, then we "accept" the word as a valid word in the language. Note that all the DFA keeps in its memory is the state, and the position in the word; it has no memory of how it got to any state.

Incidentally, any finite language can be recognised by a DFA. (Proof omitted.) In general, a language that can be recognised by a DFA is called a regular language. We'll be studying these in more detail.

2.1DFA diagrams

Example DFA

The circles are states; the initial state is indicated via an arrow; double-stroke circles are accept states. The behaviour of the machine is as easy to understand as following the arrows. For instance, if I'm in state $s_0$, and I read an a, I should move to $s_1$.

Note that while technically every state should have $|\Sigma|$ arrows coming out, in practice we can get rid of the arrows that lead to "death states", i.e., states from which we can never reach an accept state (like $s_3$ in the above example). In fact, we can omit drawing death states and any associated arrows completely - if we reach a state for which the next letter in our word does not have an arrow, then we have an automatic reject.

This particular DFA will accept any word that starts with an "ab". Incidentally, not all machines have such a nice natural language description; sometimes the best description is the machine itself.

We could also make a machine that accepts any word:

The "Mickey Mouse" machine

Notably, this illustrates that the complexity of the DFA diagram is independent of the size of the language (i.e., the number of words that are accepted); the Mickey Mouse machine above accepts all of $\Sigma^*$, which, as you should recall, is infinitely large.

2.1.1Combination lock example

Let $\Sigma = \{1,2,3,4,5\}$. We can represent a combination lock that only opens for the combination 4-1-2-3-5 as a DFA with the following diagram (death states omitted):

Combination lock DFA

2.1.2Divisibility by 3 example

$\Sigma = \{0, 1\}$, interpreted as a binary number (where $\epsilon = 0$). We define a language $L$ as only those words which, when interpreted as binary, are divisible by 3. If we were to enumerate these words, we'd have $\epsilon$, 0, 11, 110, 1001, ... (infinite set, of course).

Recall that the machine is reading the number from left to right. Let's say we have the following word $w$, represented in terms of its letters as follows:

$$w = \underbrace{\_\_\_\_\_}_{x} \_ \tag{where the last letter is either 0 or 1}$$

If the next letter is 0, then the value of the word $w$ becomes $2x$. If the next letter is 1, on the other hand, the value is $2x+1$.

A key observation for this problem is the fact that we only need to remember the state mod 3. When we divide something by 3, the possibilities are 0, 1, and 2. So we'd only need 3 states! Using this information, plus the rules we determined in the previous paragraph, we can easily draw the DFA diagram:

Divisibility by 3 DFA

We will now prove that the machine we have just drawn is the most optimal machine (as in, fewest states) for this language. This is a fairly straightforward proof by contradiction: consider any machine with exactly 2 states, A and B, which accepts exactly the same words as the machine we defined above. Consider the strings 100, 101, and 110. Only 110 is divisible by 3, so that should end up in the accept state (which we'll take to be A, WLOG). 100 and 101, on the other hand, are not divisible by 3, so they should end up in the reject state, B.

Of course, once we are in state B, the machine has no memory of how we got there. So then if the next letter is 1, we get either 1001 or 1011; the former is divisible by 3 and thus needs to go to A, while the latter is not divisible by 3 and thus needs to stay in B. Clearly we get a contradiction, etc. $\blacksquare$

The general strategy for proving $n$ as a lower bound on the number of states is to find $n$ pairs of strings. For each pair $(u, v)$, find a string $x$ such that $ux$ is accepted but $vx$ is rejected.

2.2$\delta^*$

Given $M = (S, s_0, \delta, F)$, we define $\delta^*$ to be a function, $S \times \Sigma^* \to S$ (so a map from states and words to new states).

We can define this by induction on the length of the word. First, the base case: $\delta^*(s, \epsilon) = s$ (so if we're given an empty word, stay on state $s$). Then, $\delta^*(s, a\omega) = \delta^*(\delta(s, a), \omega)$ where $a$ is a letter and $\omega$ is a word. This concept is pretty straightforward - it's just extending the concept of $\delta$ to entire words; this inductive definition merely formalises it.

2.3Optional topic: monoids

This topic is optional; excuse the free-form quality of the notes in this section

$(\Sigma^*, \cdot, \epsilon)$ forms a free monoid.

Another example of a monoid: the positive integers (including zero), with addition. This is not a group since there are no negative integers (for inverses).

Also: $(f: S\to S, \circ, \text{the identity function})$ where $\circ$ is function composition, again not a group.

Just like we have group homomorphisms, we also have monoid homomorphisms!

Any DFA has a natural monoid structure.

Also, there is a map $\Psi: \Sigma^* \to (S \to S)$ which takes a word and returns a function from state to state. This is defined by $\Psi(w) = s \to \delta^*(s, w)$.

Exercise: show that $\Psi$ is a monoid homomorphism.

  1. Not always finite, but in this course, we'll only consider the finite case.