Scanning

Scanning is the process of partitioning an input program (string of characters as a Regular Language) into a sequence of valid tokens.

In CS241E, a token is annotated by pair of a kind and a lexeme - kind = the kind of of the token (ex: number, ID, see LACS Language) - lexeme = the actual string of characters (ex: 42, x, def, or +)

// Example
def main(foo:Int,bar:Int)={foo+bar} // input
|def|main|(|foo|:|Int|,|bar|:|Int|)|=|{|foo|+|bar|}| // Output generated after scan

We state the scanning problem precisely as follows (refresher of terminology in Formal Language):

Input: is a word $w$ and a language $L$ of valid tokens
Output: is a sequence of non-empty words $w_{1}, w_{2}, \dots, w_{n}$ such that:
1. If we concatenate all the words in the sequence, we get back $w : w_{1} w_{2} \dots, w_{n} = w$
2. Each of the words is in the language of valid tokens: $\forall i : w_{i} \in L$

However, there is an issue: there might be multiple solutions produced. For example, when our input $w = aaaaa$ and our regular language $L = {aa, aaa}$ , there are 2 solutions: $w = aa, aaa$ or $w = aaa, aa$ . Which one do we pick?

We therefore add a 3rd constraint so that the output is unique. We call this into the maximal munch scanning problem (extension of the scanning problem): 3. Each $w_{i}$ is the longest prefix of the remainder of the input $w_{i} w_{i + 1} \dots w_{n}$ that is in $L$

The idea is that we greedily find the longest token possible before continuing to the rest of the input to try to find more tokens.

token is a pair of a kind and a lexeme , where the lexeme is the actual string of characters that were scanned to make up the token, such as 42, x, def, or +.

Implementation

Typically, $L$ is a regular language, so we can specify it through a DFA.

We can use this observation to define the following algorithm to solve the maximal munch scanning problem:

Run the DFA for L on the remaining input until the DFA gets stuck or the end of input is reached.
If the DFA is in a non-accepting state, backtrack the DFA and the input to the last-visited accepting state. If no accepting state was visited, there is no solution, so output an error.
Output a token corresponding to the accepting state that the DFA is in.
Reset the DFA to the start state and repeat, starting again from Step 1.

The Backtracking can be easily implemented by keeping record of the last accepting state that was visited and the position in the input at which it was visited.

Below is the implementation that I wrote:

def scanOne(input: List[Char], state: State, backtrack: (List[Char], State) ): (List[Char], State) = {  
  var new_backtrack = backtrack  
  if (dfa.accepting(state)) {  
    new_backtrack = (input, state)  
  }  
  if (input.length != 0 && dfa.transition.isDefinedAt(state, input(0))) {  
    scanOne(input.drop(1), dfa.transition((state, input(0))), new_backtrack)  
  } else {  
      new_backtrack  
  }  
}

We then call this code on the following

val start = (List.empty, "")
val (new_input, new_state) = scanOne(input, dfa.start, start)
if ((new_input, new_state) == start) {
	sys.error("Cannot make progress")
} else {
	recur(new_input, accumulator :+ Token(new_state, listDiff(input, new_input).mkString))
}

It always greedily picks the longest one. The bad thing is that sometimes, when there is a solution, it cannot produce one → I was confused about this, but below is an example.

Let $L = {abb, abba, abbabba}$ and $w = abbaabbabb$ .

First iteration: $w_{1} = abba$ , $x = abbabb$ .
Second iteration: $w_{2} = abba$ , $x = bb$ .
Third iteration: an error is thrown.

The maximal munch scanner concludes that no solution is found, even though a valid scan exists: $[abba] [abb] [abb]$ .

The maximal munch scanner cannot find this, since it always greedily picks the longest word in the language

🛠️ Steven Gong

Table of Contents

Scanning

Implementation

Graph View

Backlinks