Python Regular Expressions
Python Regular Expressions Interview Questions
\d - Matches any digit [0-9]
\w - Matches word character [a-zA-Z0-9_]
\s - Matches whitespace
^ - Matches start of string
$ - Matches end of string
[] - Character class
| - OR operator
+ - 1 or more occurrences
? - 0 or 1 occurrence
{n} - Exactly n occurrences
{n,} - n or more occurrences
{n,m} - Between n and m occurrences
Example: a{2,4} matches 'aa', 'aaa', or 'aaaa'.
Negative lookahead: (?!...) - matches if ... doesn't follow
Lookbehind: (?<=...) - matches if ... precedes
Negative lookbehind: (? - matches if ... doesn't precede
Example: r'\d(?=px)' matches digit only if followed by 'px'.
re.MULTILINE or re.M - ^ and $ match start/end of line
re.DOTALL or re.S - . matches newline
Example: re.search('python', 'PYTHON', re.I) matches case-insensitively.
2. Greedy matching when non-greedy needed
3. Forgetting to escape special characters
4. Not handling edge cases
5. Performance issues with complex patterns
6. Unicode issues with \w and \b
Always test with various inputs and consider performance for large texts.
Tricky interview questions
match tries only at the start of the string; search scans forward — interviews slip candidates who assume “find substring.”
They anchor to each line boundary, not only the whole string — without it, $ matches absolute end only.
No — enable re.DOTALL if dot must span lines; multiline HTML/logs commonly trip this.
Nested quantifiers can explode alternation work exponentially — mitigate by simplifying patterns, possessive ideas via refactoring, or compiled parsers for nested languages.
If the pattern contains capturing groups, results become tuples per match — use (?:...) non-capturing groups when you want whole-match strings.
Groups precedence without capturing — avoids polluting findall results and keeps numbering stable.
Default Unicode semantics — letters beyond ASCII can match \w; use re.ASCII when you need strict ASCII word rules.
Lookbehind patterns must be fixed width — arbitrary-length lookbehind fails at compile time; alternative engines differ.
Validating that the entire string conforms (IDs, tokens) — avoids accidentally accepting partial substring matches.
Toggles case-insensitive matching for the remainder or scoped segment — handy but can hurt readability if overused.
Whitespace is ignored outside classes; you document patterns across lines — escape spaces/literals where needed.
Engines can emit empty matches — advance indices carefully or you risk infinite loops when consuming input manually.
A match object per occurrence — enables conditional replacements beyond literal strings.
groupdict() yields readable keys — scales better than remembering numeric indexes in complex parsers.
Delimiters captured by groups appear in the resulting list — differs from simple string split.
Standard matching is non-overlapping — overlapping scans need lookahead tricks or shifting indices yourself.
Yes — pattern objects are immutable; compiling once saves work in hot loops (still measure if micro-optimizing).
Classic interview trap — regex extracts snippets; nested/st malformed HTML needs parsers (html.parser, lxml). Same for “validate email” strictly.
A raw string cannot end with a lone backslash — you still balance quotes and escapes where the lexer demands.
finditer streams matches — good for large texts; for many passes reuse compiled patterns and avoid re-scanning unchanged slices unnecessarily.