DSA Lab | String Matching

What Is String Matching?

You have a text (length n) and a pattern (length m). You want to find where the pattern appears in the text — maybe the first occurrence, maybe all of them. This is the "needle in a haystack" problem, except the haystack is a string and the needle is a substring you're searching for.

The naive approach checks every possible starting position: align the pattern at position 0, compare character by character, if a mismatch happens, slide one position right and start over from scratch. That's O(n·m) in the worst case — fine for short strings, but punishing when the text is a 3-billion-character genome and the pattern is a 200-character gene sequence.

All the clever algorithms share one core insight: when a mismatch happens, don't throw away what you already know. The characters you already matched tell you something about where the next match could start. The naive algorithm forgets everything on a mismatch. KMP, Rabin-Karp, and Boyer-Moore remember.

A Bit of History

The KMP algorithm was developed by Donald Knuth, James Morris, and Vaughan Pratt in 1977, though Morris and Pratt had independently discovered the core idea earlier. The algorithm was groundbreaking because it proved you could match strings in guaranteed O(n+m) time — no worst case regression. Before KMP, the best known algorithms all had O(nm) worst cases.

Rabin-Karp came in 1987 from Michael Rabin and Richard Karp. Their insight was different: use hashing to compare substrings in O(1) instead of O(m). It's not the fastest for single-pattern search, but it extends beautifully to multi-pattern matching.

Boyer-Moore (1977, by Robert Boyer and J Strother Moore) took the surprising approach of comparing right-to-left and using mismatches to skip ahead — sometimes by the entire pattern length. It's the fastest in practice for natural language text and is the basis of what grep uses internally.

The Three Approaches

KMP (Knuth-Morris-Pratt): Pre-computes a failure function (also called the prefix table or partial match table) from the pattern. On a mismatch, it tells you exactly how far back to shift without re-checking characters. Guaranteed O(n+m), no worst case. The workhorse of competitive programming.
Rabin-Karp: Uses a rolling hash — a hash function that can be updated incrementally as you slide the window. Compare the pattern's hash against each window's hash. Only do character-by-character comparison when hashes match (to confirm it's not a hash collision — two different strings producing the same hash value). Great for multi-pattern search.
Boyer-Moore: Scans the pattern right-to-left (backwards). Uses the "bad character" rule and "good suffix" rule to skip ahead — sometimes by the entire pattern length on a mismatch. Fastest in practice for natural language text. Sub-linear: it often reads fewer than n characters total.

Real-World Uses

grep, ripgrep, Ctrl+F — text search tools use variants of Boyer-Moore and Aho-Corasick
DNA sequence alignment — finding gene sequences in chromosomes (billions of characters)
Plagiarism detection — Rabin-Karp with multiple pattern hashes, comparing k-gram fingerprints
Intrusion detection systems — matching network packet payloads against known attack signatures in real-time
Compilers — lexer/tokenizer pattern matching during the first phase of compilation
Spam filters — scanning email bodies for known spam patterns
Bioinformatics — BLAST and similar tools for sequence similarity search

How They Work

KMP — The Failure Function

KMP's secret weapon is the failure function. For each position in the pattern, it stores the length of the longest proper prefix (a prefix that isn't the entire string) that's also a suffix (the same characters appear at the end).

That sounds abstract, so here's a concrete example. For pattern "ABCABD":

Position 0 ("A"): longest proper prefix = suffix: 0 (single character — no proper prefix)
Position 1 ("AB"): no prefix equals suffix: 0
Position 2 ("ABC"): no match: 0
Position 3 ("ABCA"): "A" is both a prefix and suffix: 1
Position 4 ("ABCAB"): "AB" is both a prefix and suffix: 2
Position 5 ("ABCABD"): no match with "D" at the end: 0

So the failure table is: [0, 0, 0, 1, 2, 0].

When a mismatch happens at position i in the pattern, the failure function tells you: "you can safely skip ahead because the last failure[i-1] characters of what you already matched form a prefix of the pattern." Instead of restarting from scratch, you resume comparison from that known-good position. The text pointer never moves backward — this is what guarantees O(n) search time.

Building the failure function itself is also clever. It uses the same "match-and-fallback" logic that the search phase uses. You're essentially running KMP on the pattern against itself. This takes O(m) time, so the total algorithm is O(n + m).

💡 Why "failure" function? It's named for what happens on failure — a mismatch. It answers: "given that we just failed at position i, what's the best position to fall back to?" The answer is always the length of the longest prefix-suffix match, because those characters are guaranteed to already be correct in the text. No need to re-verify them.

Rabin-Karp — Rolling Hashes

Instead of comparing strings character by character, Rabin-Karp compares hash values. A hash function converts a string into a number. If two strings have different hashes, they're definitely different (no need to compare characters). If hashes match, they might be the same — you need to verify character by character to rule out a collision.

Compute the hash of the pattern, then slide a window of length m across the text, computing the hash of each window. The "rolling" part is the trick — you don't recompute the hash from scratch each time. To slide the window one position right:

Subtract the contribution of the leftmost character (the one leaving the window)
Multiply by the base (shifts everything left in the polynomial hash)
Add the new rightmost character (the one entering the window)

Each slide takes O(1), so scanning the entire text is O(n). With a good hash function and a large prime modulus (the number you divide by to keep hash values bounded), collisions are rare and expected time is O(n+m).

⚠️ Hash collisions are real A bad hash function or adversarial input can cause O(nm) worst-case behavior — every window's hash matches the pattern's hash, forcing full verification each time. In interviews, always mention that Rabin-Karp has O(nm) worst case but O(n+m) expected case. Using a large prime modulus (like 10^9 + 7 or a random large prime) makes collisions astronomically unlikely.

Boyer-Moore — Skip Ahead

Boyer-Moore is the outlier: it compares the pattern right-to-left (starting from the last character). This seems backwards, but it enables powerful skip rules. When it hits a mismatch, it uses two heuristics to figure out how far to skip:

Bad Character Rule: Look at the character in the text that caused the mismatch. Find the rightmost occurrence of that character in the pattern (before the mismatch position). Shift the pattern to align them. If the character doesn't appear in the pattern at all, skip the entire pattern length — that's the best case.
Good Suffix Rule: If part of the pattern matched before the mismatch, look for the next occurrence of that matched suffix earlier in the pattern. Shift to align it. If the suffix doesn't reoccur, shift based on the longest prefix of the pattern that matches a suffix of the matched portion.

Boyer-Moore is sub-linear in practice — it often examines fewer than n characters total, because mismatches on the last character of the pattern allow skipping by m positions. For English text and patterns longer than ~5 characters, it's typically the fastest algorithm.

When to Use Which

Guaranteed O(n+m), simple implementation: KMP. The go-to for competitive programming and interviews.
Multiple patterns simultaneously: Rabin-Karp (hash all patterns, check the text against the set). For many patterns, use Aho-Corasick instead.
Fastest in practice for one pattern in natural text: Boyer-Moore. What grep and most text editors use.
Pattern changes frequently, text is static: Build a suffix array or suffix tree on the text (O(n) preprocessing), then search in O(m log n) per query.

Interactive Visualization

Type a text and pattern to watch KMP's failure function in action. Green highlights show matches, red shows mismatches, and the failure table updates as the pattern is built. Click "Run KMP" to see the search animated step by step.

KMP String Matching

Text

Pattern

Common Mistakes

⚠️ Off-by-one errors in the failure function The failure function loop starts at index 1, not 0. failure[0] is always 0 by definition (a single character has no proper prefix that's also a suffix). Starting at 0 and comparing pattern[0] with pattern[0] gives a wrong table. This is the #1 KMP implementation bug.

⚠️ Forgetting to handle overlapping matches After finding a match at position i, the naive approach restarts the pattern from 0. But KMP can find overlapping matches: pattern "ABA" in text "ABABA" matches at both position 0 and position 2. After a full match, set j = failure[m-1] to allow overlaps. If the problem wants non-overlapping matches, reset j to 0 instead.

⚠️ Integer overflow in Rabin-Karp hash The rolling hash computation involves multiplication and addition that can overflow. Always apply the modulus at every step: (hash * BASE + char) % MOD. A common mistake is computing the full value first and modding at the end — the intermediate value overflows long before you get there. Python handles big integers natively, but Java/C++ do not.

⚠️ Negative hash values in Rabin-Karp When removing the leftmost character, the hash can go negative: hash - char * h might be negative. In Python this is fine (Python's modulo handles negatives correctly), but in Java/C++ you need to add MOD before taking modulo: hash = ((hash - char * h) % MOD + MOD) % MOD. Missing this causes false negatives — valid matches that your algorithm skips.

⚠️ Using Rabin-Karp without verification Hash matches are not guaranteed true matches — they could be collisions. Always verify with a character-by-character comparison when hashes match. Skipping verification means your algorithm might return false positives, which is wrong. The verification is O(m) per match but happens rarely with a good hash.

Operations & Complexity

Algorithm	Preprocessing	Search Time	Space	Best For
Brute Force	O(1)	O(nm)	O(1)	Short strings, simplicity
KMP	O(m)	O(n)	O(m)	Guaranteed linear, streams
Rabin-Karp	O(m)	O(n) expected	O(1)	Multi-pattern, plagiarism
Boyer-Moore	O(m + σ)	O(n/m) best	O(m + σ)	Long patterns, natural text
Aho-Corasick	O(Σm_i)	O(n + matches)	O(Σm_i × σ)	Many patterns simultaneously

Here σ (sigma) is the alphabet size — the number of distinct characters possible. For lowercase English that's 26, for ASCII it's 256, for Unicode it's much larger. Boyer-Moore's O(n/m) best case is remarkable: with a pattern of length 10 in a text of length 1000, it might only examine ~100 characters. Each mismatch on the last pattern character lets it skip forward by 10 positions.

Aho-Corasick deserves mention even though it's not covered in depth here. It's a trie-based algorithm that searches for multiple patterns simultaneously in O(n + total_matches) time. Think of it as "KMP for many patterns at once." It's what powers tools like fgrep and many intrusion detection systems.

Implementation

KMP Algorithm

Python

def kmp_search(text, pattern):
    """Find all occurrences of pattern in text using KMP.
    
    Returns list of starting indices where pattern appears.
    Time: O(n + m) guaranteed. Space: O(m) for the failure table.
    
    The text pointer never moves backward — each character is examined
    at most twice (once for matching, once for falling back). This is
    what makes KMP O(n) regardless of input.
    """
    n, m = len(text), len(pattern)
    if m == 0:
        return []

    # === Step 1: Build failure function ===
    # failure[i] = length of longest proper prefix of pattern[0..i]
    #              that is also a suffix of pattern[0..i]
    failure = [0] * m
    j = 0  # length of previous longest prefix-suffix
    
    for i in range(1, m):  # Start at 1! failure[0] is always 0.
        # Fall back until we find a match or hit the start
        while j > 0 and pattern[i] != pattern[j]:
            j = failure[j - 1]
        if pattern[i] == pattern[j]:
            j += 1
        failure[i] = j

    # === Step 2: Search text using the failure function ===
    matches = []
    j = 0  # position in pattern (how many chars matched so far)
    
    for i in range(n):
        # On mismatch, don't restart — jump to the best fallback position
        while j > 0 and text[i] != pattern[j]:
            j = failure[j - 1]
        
        if text[i] == pattern[j]:
            j += 1
        
        if j == m:  # Full pattern matched!
            matches.append(i - m + 1)  # Starting index of this match
            j = failure[j - 1]  # Allow overlapping matches by falling back
                                # (set j = 0 here for non-overlapping matches)

    return matches

JavaScript

function kmpSearch(text, pattern) {
  const n = text.length, m = pattern.length;
  if (m === 0) return [];

  // Build failure function
  const fail = new Array(m).fill(0);
  let j = 0;
  for (let i = 1; i < m; i++) {
    while (j > 0 && pattern[i] !== pattern[j]) j = fail[j - 1];
    if (pattern[i] === pattern[j]) j++;
    fail[i] = j;
  }

  // Search
  const matches = [];
  j = 0;
  for (let i = 0; i < n; i++) {
    while (j > 0 && text[i] !== pattern[j]) j = fail[j - 1];
    if (text[i] === pattern[j]) j++;
    if (j === m) {
      matches.push(i - m + 1);
      j = fail[j - 1]; // Allow overlapping matches
    }
  }
  return matches;
}

Rabin-Karp Algorithm

Python

def rabin_karp(text, pattern):
    """Rolling hash string matching.
    
    Expected O(n+m), worst O(nm) due to hash collisions.
    
    Hash function: treat the string as a number in base BASE,
    modulo MOD. For "ABC": ord('A')*256^2 + ord('B')*256 + ord('C')
    all mod MOD.
    
    The "rolling" trick: to slide the window right by one position,
    remove the leftmost character's contribution and add the new
    rightmost character. This is O(1) per slide.
    """
    n, m = len(text), len(pattern)
    if m > n:
        return []

    BASE = 256    # Number of possible characters (ASCII)
    MOD = 1000000007  # Large prime — reduces collision probability

    # Hash the pattern and first window of text
    p_hash = 0
    t_hash = 0
    # h = BASE^(m-1) mod MOD — the multiplier for the leftmost character
    h = pow(BASE, m - 1, MOD)

    for i in range(m):
        p_hash = (p_hash * BASE + ord(pattern[i])) % MOD
        t_hash = (t_hash * BASE + ord(text[i])) % MOD

    matches = []
    for i in range(n - m + 1):
        if p_hash == t_hash:
            # Hashes match — MUST verify character by character
            # (hash collisions exist; this prevents false positives)
            if text[i:i + m] == pattern:
                matches.append(i)

        # Roll the hash forward: remove leftmost char, add next char
        if i < n - m:
            t_hash = ((t_hash - ord(text[i]) * h) * BASE + ord(text[i + m])) % MOD
            # In Python, % handles negatives correctly.
            # In Java/C++, add MOD: t_hash = (t_hash % MOD + MOD) % MOD

    return matches

JavaScript

function rabinKarp(text, pattern) {
  const n = text.length, m = pattern.length;
  if (m > n) return [];

  const BASE = 256;
  const MOD = 1000000007; // Large prime to minimize collisions
  const matches = [];

  // Compute hash of pattern and first window
  let pHash = 0, tHash = 0;
  // h = BASE^(m-1) mod MOD — multiplier for leftmost character
  let h = 1;
  for (let i = 0; i < m - 1; i++) {
    h = (h * BASE) % MOD;
  }

  for (let i = 0; i < m; i++) {
    pHash = (pHash * BASE + pattern.charCodeAt(i)) % MOD;
    tHash = (tHash * BASE + text.charCodeAt(i)) % MOD;
  }

  for (let i = 0; i <= n - m; i++) {
    if (pHash === tHash) {
      // Verify character by character (hash collision possible)
      if (text.substring(i, i + m) === pattern) {
        matches.push(i);
      }
    }
    // Roll the hash: remove leftmost char, add next char
    if (i < n - m) {
      tHash = ((tHash - text.charCodeAt(i) * h) * BASE
               + text.charCodeAt(i + m)) % MOD;
      // Handle negative values (JavaScript % can be negative)
      if (tHash < 0) tHash += MOD;
    }
  }
  return matches;
}

// rabinKarp("ABABCABABCABAB", "ABABC") → [0, 5]

Rabin-Karp for Multiple Patterns

Python

def rabin_karp_multi(text, patterns):
    """Search for multiple patterns simultaneously using Rabin-Karp.
    
    This is where Rabin-Karp shines over KMP — searching for k patterns
    of the same length takes O(n + k*m) expected time instead of O(n*k).
    
    Hash all patterns into a set, then roll across the text checking
    if the window hash is in the set. Verification on match.
    
    Used in plagiarism detection: hash every m-gram of both documents,
    compare the sets. Shared hashes = potentially copied passages.
    """
    if not patterns:
        return {}
    
    BASE = 256
    MOD = 1000000007
    n = len(text)
    results = {p: [] for p in patterns}
    
    # Group patterns by length (Rabin-Karp works on fixed window size)
    by_length = {}
    for p in patterns:
        by_length.setdefault(len(p), []).append(p)
    
    for m, group in by_length.items():
        if m > n:
            continue
        
        # Hash all patterns of this length
        pattern_hashes = {}
        for p in group:
            h = 0
            for c in p:
                h = (h * BASE + ord(c)) % MOD
            pattern_hashes.setdefault(h, []).append(p)
        
        # Roll across text
        h_power = pow(BASE, m - 1, MOD)
        t_hash = 0
        for i in range(m):
            t_hash = (t_hash * BASE + ord(text[i])) % MOD
        
        for i in range(n - m + 1):
            if t_hash in pattern_hashes:
                window = text[i:i + m]
                for p in pattern_hashes[t_hash]:
                    if window == p:
                        results[p].append(i)
            
            if i < n - m:
                t_hash = ((t_hash - ord(text[i]) * h_power) * BASE + ord(text[i + m])) % MOD
    
    return results

Boyer-Moore (Simplified — Bad Character Rule Only)

Python

def boyer_moore_simple(text, pattern):
    """Boyer-Moore with bad character rule only.
    
    The bad character rule: on a mismatch, look at the text character
    that failed. Find its rightmost occurrence in the pattern. Shift
    the pattern to align them. If it doesn't appear, skip the whole
    pattern length.
    
    The full Boyer-Moore also uses the "good suffix" rule for even
    bigger skips, but bad character alone gives great practical speed.
    """
    n, m = len(text), len(pattern)
    if m == 0:
        return []
    
    # Preprocess: last occurrence of each character in the pattern
    # bad_char[c] = rightmost index of c in pattern, or -1 if absent
    bad_char = {}
    for i, c in enumerate(pattern):
        bad_char[c] = i
    
    matches = []
    i = 0  # Text index — where the pattern's left end is aligned
    
    while i <= n - m:
        # Compare right-to-left
        j = m - 1
        while j >= 0 and pattern[j] == text[i + j]:
            j -= 1
        
        if j < 0:
            # Full match found
            matches.append(i)
            i += 1  # Simple shift (full BM uses good suffix here)
        else:
            # Mismatch at pattern[j] vs text[i+j]
            # Shift pattern so the bad character aligns, or skip past it
            bad_pos = bad_char.get(text[i + j], -1)
            shift = max(1, j - bad_pos)
            i += shift
    
    return matches

JavaScript

function boyerMooreSimple(text, pattern) {
  const n = text.length, m = pattern.length;
  if (m === 0) return [];

  // Preprocess: last occurrence of each character in pattern
  const badChar = new Map();
  for (let i = 0; i < m; i++) {
    badChar.set(pattern[i], i);
  }

  const matches = [];
  let i = 0; // left edge of pattern alignment

  while (i <= n - m) {
    let j = m - 1; // compare right-to-left

    while (j >= 0 && pattern[j] === text[i + j]) {
      j--;
    }

    if (j < 0) {
      // Full match
      matches.push(i);
      i++; // simple shift; full BM would use good suffix
    } else {
      // Mismatch: shift based on bad character rule
      const badPos = badChar.has(text[i + j]) ? badChar.get(text[i + j]) : -1;
      i += Math.max(1, j - badPos);
    }
  }
  return matches;
}

// boyerMooreSimple("ABAAABCD", "ABC") → [4]

Z-Algorithm (KMP Alternative)

The Z-algorithm computes the Z-array: for each position i, z[i] is the length of the longest substring starting at i that matches a prefix of the string. For pattern matching, concatenate pattern + "$" + text and look for z-values ≥ pattern length. Some competitive programmers prefer it over KMP because the logic feels more direct.

Python

def z_function(s):
    """Compute the Z-array for string s.
    
    z[i] = length of the longest substring starting at position i
    that matches a prefix of s.
    
    Uses a "Z-box" [l, r) to track the rightmost known match window,
    avoiding redundant comparisons. Runs in O(n) time.
    """
    n = len(s)
    z = [0] * n
    z[0] = n  # By convention, z[0] = length of the string
    l, r = 0, 0  # Current Z-box: s[l..r-1] matches s[0..r-l-1]
    
    for i in range(1, n):
        if i < r:
            # We're inside the Z-box — reuse previously computed info
            z[i] = min(r - i, z[i - l])
        
        # Try to extend the match beyond what we already know
        while i + z[i] < n and s[z[i]] == s[i + z[i]]:
            z[i] += 1
        
        # Update Z-box if we extended past the current right boundary
        if i + z[i] > r:
            l, r = i, i + z[i]
    
    return z

def z_search(text, pattern):
    """Pattern matching using Z-algorithm.
    
    Concatenate pattern + "$" + text ($ is a sentinel that doesn't
    appear in either string). Compute Z-array. Any position where
    z[i] >= len(pattern) indicates a match.
    
    Time: O(n + m). Space: O(n + m) for the concatenated string.
    """
    combined = pattern + "$" + text
    z = z_function(combined)
    m = len(pattern)
    
    matches = []
    for i in range(m + 1, len(combined)):
        if z[i] >= m:
            matches.append(i - m - 1)  # Convert back to text index
    
    return matches

# z_search("ABABCABABCABAB", "ABABC") → [0, 5]

JavaScript

function zFunction(s) {
  const n = s.length;
  const z = new Array(n).fill(0);
  z[0] = n;
  let l = 0, r = 0;

  for (let i = 1; i < n; i++) {
    if (i < r) z[i] = Math.min(r - i, z[i - l]);
    while (i + z[i] < n && s[z[i]] === s[i + z[i]]) z[i]++;
    if (i + z[i] > r) { l = i; r = i + z[i]; }
  }
  return z;
}

function zSearch(text, pattern) {
  const combined = pattern + '$' + text;
  const z = zFunction(combined);
  const m = pattern.length;
  const matches = [];
  for (let i = m + 1; i < combined.length; i++) {
    if (z[i] >= m) matches.push(i - m - 1);
  }
  return matches;
}

Real-World Applications

1. Log File Pattern Matching

System administrators and security analysts constantly search through log files for specific patterns — error codes, IP addresses, attack signatures. When you're scanning gigabytes of logs, the algorithm matters.

Python

import os
from datetime import datetime

def scan_logs_for_attacks(log_dir, attack_signatures):
    """Scan server logs for known attack patterns using Rabin-Karp.
    
    Real IDS (Intrusion Detection Systems) like Snort use similar
    multi-pattern matching. The key insight: you need to check MANY
    patterns against EVERY packet/line. Rabin-Karp's multi-pattern
    variant (or Aho-Corasick for production) does this efficiently.
    
    log_dir: path to directory of log files
    attack_signatures: dict of {name: pattern_string}
    
    Returns: list of (file, line_number, attack_name, line) matches
    """
    # Use Rabin-Karp for multiple patterns (same idea as above, simplified)
    BASE = 256
    MOD = 1000000007
    
    alerts = []
    
    for filename in sorted(os.listdir(log_dir)):
        filepath = os.path.join(log_dir, filename)
        if not os.path.isfile(filepath):
            continue
        
        with open(filepath, 'r', errors='ignore') as f:
            for line_num, line in enumerate(f, 1):
                # Check each signature against this line
                for sig_name, pattern in attack_signatures.items():
                    if kmp_search(line, pattern):
                        alerts.append({
                            "file": filename,
                            "line": line_num,
                            "attack": sig_name,
                            "content": line.strip()[:200],
                            "timestamp": datetime.now().isoformat()
                        })
    
    return alerts

# Common attack patterns to look for in web server logs
ATTACK_SIGNATURES = {
    "SQL Injection":       "' OR 1=1",
    "XSS Attempt":         "

String Matching

On This Page

What Is String Matching?

A Bit of History

The Three Approaches

Real-World Uses

How They Work

KMP — The Failure Function

Rabin-Karp — Rolling Hashes

Boyer-Moore — Skip Ahead

When to Use Which

Interactive Visualization

Common Mistakes

Operations & Complexity

Implementation

KMP Algorithm

Rabin-Karp Algorithm

Rabin-Karp for Multiple Patterns

Boyer-Moore (Simplified — Bad Character Rule Only)

Z-Algorithm (KMP Alternative)

Real-World Applications

1. Log File Pattern Matching