Pattern Matching for Strings

Formally learned in CS240.

Pattern Matching Definition

Search for a string (pattern) in a large body of text

$T [0.. n - 1]$ – The text (or haystack) being searched within

$P [0.. m - 1]$ – The pattern (or needle) being searched for

Strings over alphabet $Σ$

Return smallest $i$ such that $P [j] = T [i + j] for 0 \leq j \leq m - 1$

This is the first occurrence of $P$ in $T$

If $P$ does not occur in $T$ , return FAIL

Applications:

Information Retrieval (text editors, search engines)
Bioinformatics
Data Mining

How to solve

The easiest way to solve this is brute forcing, by checking every possible guess. But worst case performance is $Θ ((n - m) \cdot m)$ .

How to improve?

Do extra preprocessing on the pattern $P$
- Karp-Rabin Algorithm
- Boyer-Moore Algorithm
- Deterministic finite automata (DFA), KMP Algorithm
- We eliminate guesses based on completed matches and mismatches.
Do extra preprocessing on the text $T$
- Suffix Tree
- Suffix Array
- We create a data structure to find matches easily.

🛠️ Steven Gong

Table of Contents

Pattern Matching for Strings

How to solve

Graph View

Backlinks