Finding a pattern inside a text โ it sounds trivial until you need to do it fast. Brute force is O(nm), but KMP, Rabin-Karp, and Boyer-Moore all achieve O(n+m) or better. Here's how they pull it off.
AdvancedYou have a text (length n) and a pattern (length m). You want to find where the pattern appears in the text โ maybe the first occurrence, maybe all of them. This is the "needle in a haystack" problem, except the haystack is a string and the needle is a substring you're searching for.
The naive approach checks every possible starting position: align the pattern at position 0, compare character by character, if a mismatch happens, slide one position right and start over from scratch. That's O(nยทm) in the worst case โ fine for short strings, but punishing when the text is a 3-billion-character genome and the pattern is a 200-character gene sequence.
All the clever algorithms share one core insight: when a mismatch happens, don't throw away what you already know. The characters you already matched tell you something about where the next match could start. The naive algorithm forgets everything on a mismatch. KMP, Rabin-Karp, and Boyer-Moore remember.
The KMP algorithm was developed by Donald Knuth, James Morris, and Vaughan Pratt in 1977, though Morris and Pratt had independently discovered the core idea earlier. The algorithm was groundbreaking because it proved you could match strings in guaranteed O(n+m) time โ no worst case regression. Before KMP, the best known algorithms all had O(nm) worst cases.
Rabin-Karp came in 1987 from Michael Rabin and Richard Karp. Their insight was different: use hashing to compare substrings in O(1) instead of O(m). It's not the fastest for single-pattern search, but it extends beautifully to multi-pattern matching.
Boyer-Moore (1977, by Robert Boyer and J Strother Moore) took the surprising approach of comparing right-to-left and using mismatches to skip ahead โ sometimes by the entire pattern length. It's the fastest in practice for natural language text and is the basis of what grep uses internally.
KMP's secret weapon is the failure function. For each position in the pattern, it stores the length of the longest proper prefix (a prefix that isn't the entire string) that's also a suffix (the same characters appear at the end).
That sounds abstract, so here's a concrete example. For pattern "ABCABD":
So the failure table is: [0, 0, 0, 1, 2, 0].
When a mismatch happens at position i in the pattern, the failure function tells you: "you can safely skip ahead because the last failure[i-1] characters of what you already matched form a prefix of the pattern." Instead of restarting from scratch, you resume comparison from that known-good position. The text pointer never moves backward โ this is what guarantees O(n) search time.
Building the failure function itself is also clever. It uses the same "match-and-fallback" logic that the search phase uses. You're essentially running KMP on the pattern against itself. This takes O(m) time, so the total algorithm is O(n + m).
Instead of comparing strings character by character, Rabin-Karp compares hash values. A hash function converts a string into a number. If two strings have different hashes, they're definitely different (no need to compare characters). If hashes match, they might be the same โ you need to verify character by character to rule out a collision.
Compute the hash of the pattern, then slide a window of length m across the text, computing the hash of each window. The "rolling" part is the trick โ you don't recompute the hash from scratch each time. To slide the window one position right:
Each slide takes O(1), so scanning the entire text is O(n). With a good hash function and a large prime modulus (the number you divide by to keep hash values bounded), collisions are rare and expected time is O(n+m).
Boyer-Moore is the outlier: it compares the pattern right-to-left (starting from the last character). This seems backwards, but it enables powerful skip rules. When it hits a mismatch, it uses two heuristics to figure out how far to skip:
Boyer-Moore is sub-linear in practice โ it often examines fewer than n characters total, because mismatches on the last character of the pattern allow skipping by m positions. For English text and patterns longer than ~5 characters, it's typically the fastest algorithm.
grep and most text editors use.Type a text and pattern to watch KMP's failure function in action. Green highlights show matches, red shows mismatches, and the failure table updates as the pattern is built. Click "Run KMP" to see the search animated step by step.
failure[0] is always 0 by definition (a single character has no proper prefix that's also a suffix). Starting at 0 and comparing pattern[0] with pattern[0] gives a wrong table. This is the #1 KMP implementation bug.
j = failure[m-1] to allow overlaps. If the problem wants non-overlapping matches, reset j to 0 instead.
(hash * BASE + char) % MOD. A common mistake is computing the full value first and modding at the end โ the intermediate value overflows long before you get there. Python handles big integers natively, but Java/C++ do not.
hash - char * h might be negative. In Python this is fine (Python's modulo handles negatives correctly), but in Java/C++ you need to add MOD before taking modulo: hash = ((hash - char * h) % MOD + MOD) % MOD. Missing this causes false negatives โ valid matches that your algorithm skips.
| Algorithm | Preprocessing | Search Time | Space | Best For |
|---|---|---|---|---|
| Brute Force | O(1) | O(nm) | O(1) | Short strings, simplicity |
| KMP | O(m) | O(n) | O(m) | Guaranteed linear, streams |
| Rabin-Karp | O(m) | O(n) expected | O(1) | Multi-pattern, plagiarism |
| Boyer-Moore | O(m + ฯ) | O(n/m) best | O(m + ฯ) | Long patterns, natural text |
| Aho-Corasick | O(ฮฃm_i) | O(n + matches) | O(ฮฃm_i ร ฯ) | Many patterns simultaneously |
Here ฯ (sigma) is the alphabet size โ the number of distinct characters possible. For lowercase English that's 26, for ASCII it's 256, for Unicode it's much larger. Boyer-Moore's O(n/m) best case is remarkable: with a pattern of length 10 in a text of length 1000, it might only examine ~100 characters. Each mismatch on the last pattern character lets it skip forward by 10 positions.
Aho-Corasick deserves mention even though it's not covered in depth here. It's a trie-based algorithm that searches for multiple patterns simultaneously in O(n + total_matches) time. Think of it as "KMP for many patterns at once." It's what powers tools like fgrep and many intrusion detection systems.
def kmp_search(text, pattern):
"""Find all occurrences of pattern in text using KMP.
Returns list of starting indices where pattern appears.
Time: O(n + m) guaranteed. Space: O(m) for the failure table.
The text pointer never moves backward โ each character is examined
at most twice (once for matching, once for falling back). This is
what makes KMP O(n) regardless of input.
"""
n, m = len(text), len(pattern)
if m == 0:
return []
# === Step 1: Build failure function ===
# failure[i] = length of longest proper prefix of pattern[0..i]
# that is also a suffix of pattern[0..i]
failure = [0] * m
j = 0 # length of previous longest prefix-suffix
for i in range(1, m): # Start at 1! failure[0] is always 0.
# Fall back until we find a match or hit the start
while j > 0 and pattern[i] != pattern[j]:
j = failure[j - 1]
if pattern[i] == pattern[j]:
j += 1
failure[i] = j
# === Step 2: Search text using the failure function ===
matches = []
j = 0 # position in pattern (how many chars matched so far)
for i in range(n):
# On mismatch, don't restart โ jump to the best fallback position
while j > 0 and text[i] != pattern[j]:
j = failure[j - 1]
if text[i] == pattern[j]:
j += 1
if j == m: # Full pattern matched!
matches.append(i - m + 1) # Starting index of this match
j = failure[j - 1] # Allow overlapping matches by falling back
# (set j = 0 here for non-overlapping matches)
return matches
function kmpSearch(text, pattern) {
const n = text.length, m = pattern.length;
if (m === 0) return [];
// Build failure function
const fail = new Array(m).fill(0);
let j = 0;
for (let i = 1; i < m; i++) {
while (j > 0 && pattern[i] !== pattern[j]) j = fail[j - 1];
if (pattern[i] === pattern[j]) j++;
fail[i] = j;
}
// Search
const matches = [];
j = 0;
for (let i = 0; i < n; i++) {
while (j > 0 && text[i] !== pattern[j]) j = fail[j - 1];
if (text[i] === pattern[j]) j++;
if (j === m) {
matches.push(i - m + 1);
j = fail[j - 1]; // Allow overlapping matches
}
}
return matches;
}
def rabin_karp(text, pattern):
"""Rolling hash string matching.
Expected O(n+m), worst O(nm) due to hash collisions.
Hash function: treat the string as a number in base BASE,
modulo MOD. For "ABC": ord('A')*256^2 + ord('B')*256 + ord('C')
all mod MOD.
The "rolling" trick: to slide the window right by one position,
remove the leftmost character's contribution and add the new
rightmost character. This is O(1) per slide.
"""
n, m = len(text), len(pattern)
if m > n:
return []
BASE = 256 # Number of possible characters (ASCII)
MOD = 1000000007 # Large prime โ reduces collision probability
# Hash the pattern and first window of text
p_hash = 0
t_hash = 0
# h = BASE^(m-1) mod MOD โ the multiplier for the leftmost character
h = pow(BASE, m - 1, MOD)
for i in range(m):
p_hash = (p_hash * BASE + ord(pattern[i])) % MOD
t_hash = (t_hash * BASE + ord(text[i])) % MOD
matches = []
for i in range(n - m + 1):
if p_hash == t_hash:
# Hashes match โ MUST verify character by character
# (hash collisions exist; this prevents false positives)
if text[i:i + m] == pattern:
matches.append(i)
# Roll the hash forward: remove leftmost char, add next char
if i < n - m:
t_hash = ((t_hash - ord(text[i]) * h) * BASE + ord(text[i + m])) % MOD
# In Python, % handles negatives correctly.
# In Java/C++, add MOD: t_hash = (t_hash % MOD + MOD) % MOD
return matches
function rabinKarp(text, pattern) {
const n = text.length, m = pattern.length;
if (m > n) return [];
const BASE = 256;
const MOD = 1000000007; // Large prime to minimize collisions
const matches = [];
// Compute hash of pattern and first window
let pHash = 0, tHash = 0;
// h = BASE^(m-1) mod MOD โ multiplier for leftmost character
let h = 1;
for (let i = 0; i < m - 1; i++) {
h = (h * BASE) % MOD;
}
for (let i = 0; i < m; i++) {
pHash = (pHash * BASE + pattern.charCodeAt(i)) % MOD;
tHash = (tHash * BASE + text.charCodeAt(i)) % MOD;
}
for (let i = 0; i <= n - m; i++) {
if (pHash === tHash) {
// Verify character by character (hash collision possible)
if (text.substring(i, i + m) === pattern) {
matches.push(i);
}
}
// Roll the hash: remove leftmost char, add next char
if (i < n - m) {
tHash = ((tHash - text.charCodeAt(i) * h) * BASE
+ text.charCodeAt(i + m)) % MOD;
// Handle negative values (JavaScript % can be negative)
if (tHash < 0) tHash += MOD;
}
}
return matches;
}
// rabinKarp("ABABCABABCABAB", "ABABC") โ [0, 5]
def rabin_karp_multi(text, patterns):
"""Search for multiple patterns simultaneously using Rabin-Karp.
This is where Rabin-Karp shines over KMP โ searching for k patterns
of the same length takes O(n + k*m) expected time instead of O(n*k).
Hash all patterns into a set, then roll across the text checking
if the window hash is in the set. Verification on match.
Used in plagiarism detection: hash every m-gram of both documents,
compare the sets. Shared hashes = potentially copied passages.
"""
if not patterns:
return {}
BASE = 256
MOD = 1000000007
n = len(text)
results = {p: [] for p in patterns}
# Group patterns by length (Rabin-Karp works on fixed window size)
by_length = {}
for p in patterns:
by_length.setdefault(len(p), []).append(p)
for m, group in by_length.items():
if m > n:
continue
# Hash all patterns of this length
pattern_hashes = {}
for p in group:
h = 0
for c in p:
h = (h * BASE + ord(c)) % MOD
pattern_hashes.setdefault(h, []).append(p)
# Roll across text
h_power = pow(BASE, m - 1, MOD)
t_hash = 0
for i in range(m):
t_hash = (t_hash * BASE + ord(text[i])) % MOD
for i in range(n - m + 1):
if t_hash in pattern_hashes:
window = text[i:i + m]
for p in pattern_hashes[t_hash]:
if window == p:
results[p].append(i)
if i < n - m:
t_hash = ((t_hash - ord(text[i]) * h_power) * BASE + ord(text[i + m])) % MOD
return results
def boyer_moore_simple(text, pattern):
"""Boyer-Moore with bad character rule only.
The bad character rule: on a mismatch, look at the text character
that failed. Find its rightmost occurrence in the pattern. Shift
the pattern to align them. If it doesn't appear, skip the whole
pattern length.
The full Boyer-Moore also uses the "good suffix" rule for even
bigger skips, but bad character alone gives great practical speed.
"""
n, m = len(text), len(pattern)
if m == 0:
return []
# Preprocess: last occurrence of each character in the pattern
# bad_char[c] = rightmost index of c in pattern, or -1 if absent
bad_char = {}
for i, c in enumerate(pattern):
bad_char[c] = i
matches = []
i = 0 # Text index โ where the pattern's left end is aligned
while i <= n - m:
# Compare right-to-left
j = m - 1
while j >= 0 and pattern[j] == text[i + j]:
j -= 1
if j < 0:
# Full match found
matches.append(i)
i += 1 # Simple shift (full BM uses good suffix here)
else:
# Mismatch at pattern[j] vs text[i+j]
# Shift pattern so the bad character aligns, or skip past it
bad_pos = bad_char.get(text[i + j], -1)
shift = max(1, j - bad_pos)
i += shift
return matches
function boyerMooreSimple(text, pattern) {
const n = text.length, m = pattern.length;
if (m === 0) return [];
// Preprocess: last occurrence of each character in pattern
const badChar = new Map();
for (let i = 0; i < m; i++) {
badChar.set(pattern[i], i);
}
const matches = [];
let i = 0; // left edge of pattern alignment
while (i <= n - m) {
let j = m - 1; // compare right-to-left
while (j >= 0 && pattern[j] === text[i + j]) {
j--;
}
if (j < 0) {
// Full match
matches.push(i);
i++; // simple shift; full BM would use good suffix
} else {
// Mismatch: shift based on bad character rule
const badPos = badChar.has(text[i + j]) ? badChar.get(text[i + j]) : -1;
i += Math.max(1, j - badPos);
}
}
return matches;
}
// boyerMooreSimple("ABAAABCD", "ABC") โ [4]
The Z-algorithm computes the Z-array: for each position i, z[i] is the length of the longest substring starting at i that matches a prefix of the string. For pattern matching, concatenate pattern + "$" + text and look for z-values โฅ pattern length. Some competitive programmers prefer it over KMP because the logic feels more direct.
def z_function(s):
"""Compute the Z-array for string s.
z[i] = length of the longest substring starting at position i
that matches a prefix of s.
Uses a "Z-box" [l, r) to track the rightmost known match window,
avoiding redundant comparisons. Runs in O(n) time.
"""
n = len(s)
z = [0] * n
z[0] = n # By convention, z[0] = length of the string
l, r = 0, 0 # Current Z-box: s[l..r-1] matches s[0..r-l-1]
for i in range(1, n):
if i < r:
# We're inside the Z-box โ reuse previously computed info
z[i] = min(r - i, z[i - l])
# Try to extend the match beyond what we already know
while i + z[i] < n and s[z[i]] == s[i + z[i]]:
z[i] += 1
# Update Z-box if we extended past the current right boundary
if i + z[i] > r:
l, r = i, i + z[i]
return z
def z_search(text, pattern):
"""Pattern matching using Z-algorithm.
Concatenate pattern + "$" + text ($ is a sentinel that doesn't
appear in either string). Compute Z-array. Any position where
z[i] >= len(pattern) indicates a match.
Time: O(n + m). Space: O(n + m) for the concatenated string.
"""
combined = pattern + "$" + text
z = z_function(combined)
m = len(pattern)
matches = []
for i in range(m + 1, len(combined)):
if z[i] >= m:
matches.append(i - m - 1) # Convert back to text index
return matches
# z_search("ABABCABABCABAB", "ABABC") โ [0, 5]
function zFunction(s) {
const n = s.length;
const z = new Array(n).fill(0);
z[0] = n;
let l = 0, r = 0;
for (let i = 1; i < n; i++) {
if (i < r) z[i] = Math.min(r - i, z[i - l]);
while (i + z[i] < n && s[z[i]] === s[i + z[i]]) z[i]++;
if (i + z[i] > r) { l = i; r = i + z[i]; }
}
return z;
}
function zSearch(text, pattern) {
const combined = pattern + '$' + text;
const z = zFunction(combined);
const m = pattern.length;
const matches = [];
for (let i = m + 1; i < combined.length; i++) {
if (z[i] >= m) matches.push(i - m - 1);
}
return matches;
}
System administrators and security analysts constantly search through log files for specific patterns โ error codes, IP addresses, attack signatures. When you're scanning gigabytes of logs, the algorithm matters.
import os
from datetime import datetime
def scan_logs_for_attacks(log_dir, attack_signatures):
"""Scan server logs for known attack patterns using Rabin-Karp.
Real IDS (Intrusion Detection Systems) like Snort use similar
multi-pattern matching. The key insight: you need to check MANY
patterns against EVERY packet/line. Rabin-Karp's multi-pattern
variant (or Aho-Corasick for production) does this efficiently.
log_dir: path to directory of log files
attack_signatures: dict of {name: pattern_string}
Returns: list of (file, line_number, attack_name, line) matches
"""
# Use Rabin-Karp for multiple patterns (same idea as above, simplified)
BASE = 256
MOD = 1000000007
alerts = []
for filename in sorted(os.listdir(log_dir)):
filepath = os.path.join(log_dir, filename)
if not os.path.isfile(filepath):
continue
with open(filepath, 'r', errors='ignore') as f:
for line_num, line in enumerate(f, 1):
# Check each signature against this line
for sig_name, pattern in attack_signatures.items():
if kmp_search(line, pattern):
alerts.append({
"file": filename,
"line": line_num,
"attack": sig_name,
"content": line.strip()[:200],
"timestamp": datetime.now().isoformat()
})
return alerts
# Common attack patterns to look for in web server logs
ATTACK_SIGNATURES = {
"SQL Injection": "' OR 1=1",
"XSS Attempt": "