Diagrams: Regex Explained¶
Regex Engine Matching Flow¶
How the regex engine processes a pattern against a string, step by step.
flowchart TD
INPUT["Input: text = 'hello 42 world'<br/>Pattern: r'\\d+'"] --> START["Engine starts at position 0"]
START --> POS0{"Position 0: 'h'<br/>Matches \\d?"}
POS0 -->|"No"| POS1{"Position 1: 'e'<br/>Matches \\d?"}
POS1 -->|"No"| POS2{"...skip to position 6"}
POS2 -->|"No matches at 2-5"| POS6{"Position 6: '4'<br/>Matches \\d?"}
POS6 -->|"Yes!"| EXTEND{"Position 7: '2'<br/>Matches \\d+ (more)?"}
EXTEND -->|"Yes!"| EXTEND2{"Position 8: ' '<br/>Matches \\d+ (more)?"}
EXTEND2 -->|"No — stop"| MATCH["Match found!<br/>'42' at positions 6-7"]
MATCH --> MODE{"Which function?"}
MODE -->|"search()"| RETURN_FIRST["Return first match: '42'"]
MODE -->|"findall()"| CONTINUE["Continue scanning<br/>from position 8..."]
MODE -->|"match()"| NO_MATCH["No match<br/>(string doesn't start with \\d)"]
style INPUT fill:#cc5de8,stroke:#9c36b5,color:#fff
style MATCH fill:#51cf66,stroke:#27ae60,color:#fff
style RETURN_FIRST fill:#51cf66,stroke:#27ae60,color:#fff
style NO_MATCH fill:#ff6b6b,stroke:#c92a2a,color:#fff
Character Class Hierarchy¶
How regex character classes relate to each other, from specific to general.
flowchart TD
DOT[". (dot)<br/>Any character except newline"] --> WORD["\\w — Word characters<br/>Letters + Digits + Underscore"]
WORD --> ALPHA["[a-zA-Z]<br/>Letters only"]
WORD --> DIGIT["\\d — Digits<br/>[0-9]"]
WORD --> UNDER["[_]<br/>Underscore"]
DOT --> SPACE["\\s — Whitespace<br/>Space, tab, newline"]
DOT --> PUNCT["[^\\w\\s]<br/>Punctuation and symbols"]
subgraph NEGATIONS ["Negated Classes"]
ND["\\D — NOT a digit<br/>Same as [^0-9]"]
NW["\\W — NOT a word char<br/>Same as [^a-zA-Z0-9_]"]
NS["\\S — NOT whitespace<br/>Same as [^ \\t\\n\\r]"]
end
subgraph CUSTOM ["Custom Classes"]
CC1["[aeiou] — Vowels only"]
CC2["[A-Fa-f0-9] — Hex digits"]
CC3["[^aeiou] — NOT vowels"]
end
style DOT fill:#cc5de8,stroke:#9c36b5,color:#fff
style WORD fill:#4a9eff,stroke:#2670c2,color:#fff
style DIGIT fill:#51cf66,stroke:#27ae60,color:#fff
style SPACE fill:#ff922b,stroke:#e8590c,color:#fff
style NEGATIONS fill:#ff6b6b,stroke:#c92a2a,color:#fff
style CUSTOM fill:#ffd43b,stroke:#f59f00,color:#000
Common Pattern Decision Tree¶
Not sure which regex pattern to use? Follow this flowchart.
flowchart TD
START["What are you matching?"] --> Q1{"Digits?"}
Q1 -->|"Yes"| Q1A{"How many?"}
Q1A -->|"Any amount"| D_PLUS["\\d+"]
Q1A -->|"Exact count"| D_EXACT["\\d{3} (exactly 3)"]
Q1A -->|"Range"| D_RANGE["\\d{2,4} (2 to 4)"]
Q1 -->|"No"| Q2{"Words/letters?"}
Q2 -->|"Yes"| Q2A{"Include digits<br/>and underscore?"}
Q2A -->|"Yes"| W_PLUS["\\w+"]
Q2A -->|"Letters only"| ALPHA_PLUS["[a-zA-Z]+"]
Q2 -->|"No"| Q3{"Specific format?"}
Q3 -->|"Email"| EMAIL["[\\w.+-]+@[\\w-]+\\.[\\w.]+"]
Q3 -->|"Date"| DATE["\\d{4}-\\d{2}-\\d{2}"]
Q3 -->|"Phone"| PHONE["\\d{3}[-.]\\d{3}[-.]\\d{4}"]
Q3 -->|"IP address"| IP["\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}"]
Q3 -->|"No"| Q4{"Boundaries?"}
Q4 -->|"Start of string"| CARET["^pattern"]
Q4 -->|"End of string"| DOLLAR["pattern$"]
Q4 -->|"Whole word"| BOUNDARY["\\bword\\b"]
Q4 -->|"Entire string"| FULL["^pattern$"]
style D_PLUS fill:#51cf66,stroke:#27ae60,color:#fff
style D_EXACT fill:#51cf66,stroke:#27ae60,color:#fff
style D_RANGE fill:#51cf66,stroke:#27ae60,color:#fff
style W_PLUS fill:#4a9eff,stroke:#2670c2,color:#fff
style ALPHA_PLUS fill:#4a9eff,stroke:#2670c2,color:#fff
style EMAIL fill:#cc5de8,stroke:#9c36b5,color:#fff
style DATE fill:#cc5de8,stroke:#9c36b5,color:#fff
style PHONE fill:#cc5de8,stroke:#9c36b5,color:#fff
style IP fill:#cc5de8,stroke:#9c36b5,color:#fff
Greedy vs Lazy Matching¶
By default, quantifiers are greedy (match as much as possible). Adding ? makes them lazy (match as little as possible).
flowchart TD
INPUT["Text: '<b>bold</b> and <b>more</b>'<br/>Pattern: <b>.*</b>"]
INPUT --> GREEDY["Greedy: .*<br/>Match as MUCH as possible"]
INPUT --> LAZY["Lazy: .*?<br/>Match as LITTLE as possible"]
GREEDY --> G_RESULT["'<b>bold</b> and <b>more</b>'<br/>Grabs everything between first <b><br/>and LAST </b>"]
LAZY --> L_RESULT["'<b>bold</b>'<br/>Stops at the FIRST </b><br/>it finds"]
subgraph RULE ["The Rule"]
R1["* + ? {} → Greedy by default"]
R2["*? +? ?? {}? → Add ? for lazy"]
end
style GREEDY fill:#ff6b6b,stroke:#c92a2a,color:#fff
style LAZY fill:#51cf66,stroke:#27ae60,color:#fff
style G_RESULT fill:#ff6b6b,stroke:#c92a2a,color:#fff
style L_RESULT fill:#51cf66,stroke:#27ae60,color:#fff
style RULE fill:#4a9eff,stroke:#2670c2,color:#fff