RegEx (Regular Expressions) NOTES

RegEx is a defined set of characters, which is arrange in an order, that follows some specific rules. It helps us to find some specific patterns in the string.

Let suppose we have a big string and we want to find numbers/pattern in that, then we can use Regex to do it.

---------------------------------------------------- IMPORTANTS ---------------------------------------------------

1. U can use RegEx in VsCode by using CTRL + F (finding shortcut) then u will see 3 buttons in the finding input tag, then select the 3rd one -> Regex pattern. then u can write any regex u want in that input box.

2. Note: If u are using the regex in VsCode by using CTRL + F then you don't need to use the delimiters on that, it automatically put delimiters in input box.

1. Delimiters

Delimiters means determining the limits or setting the limits/bountries.

For example: In HTML comments we have boundries like  or In div we have starting to closing tag like <div> bountries </div>

The concept of a "delimiter" for a regular expression (regex) itself isn't universal across all programming languages, as some languages integrate regex directly into string methods or dedicated regex objects without explicit delimiters. However, in many languages, especially those influenced by Perl, you'll find clear delimiters used to define the regex pattern.

Languages with Explicit Delimiters:

Perl: This is where the concept of regex delimiters is most prominent.
- Common: Forward slashes /pattern/ are the most common.
- Alternatives: You can use almost any non-alphanumeric, non-backslash, non-whitespace character as a delimiter. This is useful if your pattern contains forward slashes and you want to avoid escaping them. Examples include m{pattern}, m(pattern), m#pattern#, m~pattern~, etc. The same applies to substitution (e.g., s/pattern/replacement/) and transliteration (e.g., tr/searchlist/replacementlist/).
PHP: PHP's PCRE (Perl Compatible Regular Expressions) functions require delimiters.
- Common: Forward slashes /pattern/ are widely used.
- Alternatives: Similar to Perl, you can use other non-alphanumeric, non-backslash, non-whitespace characters like #, ~, %, {}, (), [], <>.
JavaScript: When using regular expression literals, delimiters are required.
- Common: Forward slashes /pattern/ are the standard way to define a regex literal.
- No other literal delimiters: Unlike Perl or PHP, you don't typically use other characters as literal delimiters. If you need to define a regex dynamically (e.g., from a string variable), you use the RegExp constructor, where the pattern is a string: new RegExp("pattern").
Ruby: Ruby's regex literals typically use slashes.
- Common: Forward slashes /pattern/ are the most common.
- Alternatives: You can also use %r{pattern} (or (), [], <>) to define a regex, which can be useful if your pattern contains slashes.

Languages without Explicit Delimiters (or where they are implied by method calls):

Python: Python's re module doesn't use explicit delimiters like slashes in its function calls. You pass the regex pattern as a string.
- Example: re.search('pattern', 'string') or re.split(r'pattern', 'string').
- Raw strings: It's common practice to use raw strings (prefixed with r, e.g., r'\d+') for regex patterns in Python to avoid issues with backslash escaping.
Java: Java's java.util.regex package also uses strings to define regex patterns.
- Example: Pattern.compile("pattern") or string.split("pattern").
- Double backslashes: Because string literals in Java interpret backslashes, you often need to double escape them in regex patterns (e.g., "\\d+" to match a digit).
C# (.NET): The System.Text.RegularExpressions namespace in C# takes regex patterns as strings.
- Example: Regex.Match("string", "pattern") or Regex.Split("string", "pattern").
- Verbatim string literals: C# has verbatim string literals (prefixed with @, e.g., @"pattern") that can be helpful for regex to avoid double escaping backslashes, similar to Python's raw strings.
Go: Go's regexp package uses string patterns.
- Example: regexp.Compile("pattern") or regexp.MustCompile("pattern").Split(text, -1).

2. Literals (or Literal Characters)

Literal characters are characters that match themselves directly. If you put a character in your regex that isn't a metacharacter, it will be treated as a literal.

Examples:

/a/ will match the character "a".
/hello/ will match the exact string "hello".
/123/ will match the exact string "123".
/-_=/ will match the exact string "-_=".

Escaping Metacharacters:

If you want to match a metacharacter literally, you need to "escape" it by placing a backslash (\) before it. This tells the regex engine to treat the special character as a regular character.

Example:

If you want to match a literal dot (.), you'd use /\./. (Otherwise, . is a metacharacter that matches any character except newline).
If you want to match a literal asterisk (*), you'd use /\*/. (Otherwise, * is a metacharacter for zero or more repetitions).
If you want to match a literal backslash (\), you'd use /\\/.

3. Regex Flags in JavaScript

g (Global search)

Meaning: This flag ensures that the regular expression will search for all matches in the string, rather than stopping after the first match.
Behavior without g: If g is not used, methods like String.prototype.match() will return only the first match, and RegExp.prototype.exec() will also find only one match per call (though subsequent calls on the same regex object will find the next match).

Example:

JavaScript
const str = "apple banana apple orange";
const regexWithoutG = /apple/;
const regexWithG = /apple/g;

console.log(str.match(regexWithoutG)); // ["apple", index: 0, input: "apple banana apple orange", groups: undefined]
console.log(str.match(regexWithG));    // ["apple", "apple"]

let match;
while ((match = regexWithG.exec(str)) !== null) {
    console.log(`Found ${match[0]} at index ${match.index}`);
}
// Output:
// Found apple at index 0
// Found apple at index 13

i (Case-insensitive search)

Meaning: This flag makes the regular expression perform a case-insensitive match. It ignores the difference between uppercase and lowercase letters.

Example:

JavaScript
const str = "Hello World";
const regexWithoutI = /hello/;
const regexWithI = /hello/i;

console.log(str.match(regexWithoutI)); // null
console.log(str.match(regexWithI));    // ["Hello", index: 0, input: "Hello World", groups: undefined]

m (Multiline search)

Meaning: This flag changes the behavior of ^ (start of string) and $ (end of string) anchors.
- Without m: ^ matches only the very beginning of the entire input string, and $ matches only the very end of the entire input string.
- With m: ^ matches the beginning of the entire input string and the beginning of each line (after a newline character \n or \r). Similarly, $ matches the end of the entire input string and the end of each line.

Example:

JavaScript
const str = "Line 1\nLine 2\nLine 3";
const regexWithoutM = /^Line/g; // Note: 'g' is used here to find all occurrences within the string
const regexWithM = /^Line/gm;

console.log(str.match(regexWithoutM)); // ["Line"]
console.log(str.match(regexWithM));    // ["Line", "Line", "Line"]

const regexEndWithoutM = /3$/g;
const regexEndWithM = /2$/gm;

console.log(str.match(regexEndWithoutM)); // ["3"]
console.log(str.match(regexEndWithM));    // ["2"]

s (DotAll mode) - (Introduced in ES2018)

Meaning: This flag changes the behavior of the . (dot) special character.
- Without s: The . matches any character except newline characters (\n, \r, \u2028, \u2029).
- With s: The . matches any character, including newline characters.

Example:

JavaScript
const str = "First line\nSecond line";
const regexWithoutS = /line.Second/;
const regexWithS = /line.Second/s;

console.log(str.match(regexWithoutS)); // null
console.log(str.match(regexWithS));    // ["line\nSecond", index: 5, input: "First line\nSecond line", groups: undefined]

u (Unicode support) - (Introduced in ES6)

Meaning: This flag enables full Unicode support for the regular expression. It's crucial when working with Unicode characters beyond the basic Latin set (e.g., emojis, characters from different languages).
Key impacts of u:
- Unicode code point escapes: Allows you to use \u{xxxx} for code points greater than 0xFFFF.
- Proper handling of astral plane characters: Characters like emojis (which occupy two JavaScript "characters" because they're represented by surrogate pairs) are treated as single characters.
- Unicode property escapes (\p{...}): (Requires v flag as well, see below).

Example:

JavaScript
// Example with astral plane character
const emoji = "👍"; // Unicode code point U+1F44D
const regexWithoutU = /./; // Matches one JavaScript "character"
const regexWithU = /./u;   // Matches one Unicode code point

console.log(emoji.match(regexWithoutU).length); // 1 (because it matches the first surrogate character)
console.log(emoji.match(regexWithU).length);    // 1 (matches the entire emoji as one code point)

console.log("Match of regexWithoutU: ", emoji.match(regexWithoutU)); // ["", index: 0, input: "👍", groups: undefined] (often displays as a replacement character)
console.log("Match of regexWithU: ", emoji.match(regexWithU));     // ["👍", index: 0, input: "👍", groups: undefined]

// Example with Unicode code point escape
console.log("a\u{00F1}b".match(/a\u{F1}b/u)); // ["a\xF1b", index: 0, input: "a\xF1b", groups: undefined]

y (Sticky search) - (Introduced in ES6)

Meaning: This flag makes the regex match only from the lastIndex property of the regex object. It ensures that subsequent matches are "sticky" to the position where the previous match ended, or to the position specified by lastIndex.
Behavior:
- If a match is found, lastIndex is updated to the end of the match.
- If no match is found at lastIndex, the exec method returns null, and lastIndex is reset to 0.
Important: This flag is primarily useful with RegExp.prototype.exec(). It's not typically used with String.prototype.match() as match doesn't use lastIndex in the same way.

Example:

JavaScript
const str = "foo bar baz";
const regexY = /bar/y;

regexY.lastIndex = 4; // Set lastIndex to the start of "bar"
console.log(regexY.exec(str)); // ["bar", index: 4, input: "foo bar baz", groups: undefined]

console.log(regexY.lastIndex); // 7 (lastIndex is updated)

// Try to match again from the new lastIndex (7). "baz" is at 8, so it won't match.
console.log(regexY.exec(str)); // null
console.log(regexY.lastIndex); // 0 (reset to 0 because no match was found)

// Without 'y', the regex would still find "bar" even if lastIndex was elsewhere
const regexWithoutY = /bar/;
regexWithoutY.lastIndex = 0;
console.log(regexWithoutY.exec(str)); // ["bar", index: 4, input: "foo bar baz", groups: undefined]

d (Has indices) - (Introduced in ES2022)

Meaning: This flag makes the exec() method return an array with an additional indices property. This indices property is an array of arrays, where each inner array contains the [start, end] indices for each captured group, including the full match itself (at index 0).

Example:

JavaScript
const str = "hello world";
const regex = /(hello) (world)/d;

const match = regex.exec(str);
console.log(match);
/*
[
  'hello world',
  'hello',
  'world',
  index: 0,
  input: 'hello world',
  groups: undefined,
  indices: [
    [ 0, 11 ], // Full match
    [ 0, 5 ],  // 'hello'
    [ 6, 11 ]  // 'world'
  ]
]
*/

if (match && match.indices) {
    console.log("Full match indices:", match.indices[0]);
    console.log("Group 1 indices:", match.indices[1]);
    console.log("Group 2 indices:", match.indices[2]);
}

v (Unicode sets) - (Introduced in ES2024)
- Meaning: This flag enables "set notation" in regular expressions, allowing you to use Unicode property escapes and perform operations like set intersection, difference, and symmetric difference on character classes. It's an enhancement over the u flag, specifically for character classes.
- Key features with v:
  - Unicode property escapes: \p{Property} and \P{Property} (e.g., \p{Script=Greek}, \p{Emoji}). These are now fully supported for a wider range of properties.
  - Set operations:
    - Intersection: [a--b] (matches characters that are in a AND in b)
    - Difference: [a&&b] (matches characters that are in a BUT NOT in b)
    - Symmetric difference: [a~~b] (matches characters that are in a OR b, but not both)
  - Nested character classes with improved behavior.
- Example (demonstrating Unicode property escapes with v):
  JavaScript
  // Example for a character from a specific Unicode script const greekChar = "α"; // Alpha const regexGreek = /\p{Script=Greek}/v; console.log(greekChar.match(regexGreek)); // ["α", index: 0, input: "α", groups: undefined] // Example for an emoji character const emojiChar = "😊"; const regexEmoji = /\p{Emoji}/v; console.log(emojiChar.match(regexEmoji)); // ["😊", index: 0, input: "😊", groups: undefined] // Example demonstrating set intersection (find digits that are also in a-f) const hexDigit = "5"; const regexHexDigit = /[0-9--a-f]/v; // Matches a digit that is also a hex char console.log(hexDigit.match(regexHexDigit)); // ["5", index: 0, input: "5", groups: undefined]

How to use flags:

1. Regex Literal:

JavaScript
const regex1 = /pattern/flags;
const regex2 = /hello/gi; // global and case-insensitive

2. RegExp Constructor:

JavaScript
const regex3 = new RegExp("pattern", "flags");
const regex4 = new RegExp("world", "im"); // case-insensitive and multiline

Summary Table of Regex Flags in JavaScript:

Flag	Name	Description	Introduced In
`g`	Global search	Finds all matches, not just the first.	ES1
`i`	Case-insensitive	Performs a case-insensitive match.	ES1
`m`	Multiline	`^` and `$` match start/end of lines, not just start/end of string.	ES1
`s`	DotAll	`.` matches any character, including newline characters.	ES2018
`u`	Unicode	Enables full Unicode support (e.g., astral plane characters, Unicode code point escapes).	ES6
`y`	Sticky	Matches only from the `lastIndex` of the regex object, requiring contiguous matches.	ES6
`d`	Has indices	Returns `indices` property in the match array, providing `[start, end]` for full match and groups.	ES2022
`v`	Unicode sets	Enables set notation (intersection, difference) and extended Unicode property escapes in character classes.	ES2024

4. Character Classes in Regex

Character classes (also known as character sets) allow you to define a set of characters, and the regex engine will match any one character from that set. They are denoted by square brackets [].

Here's the information from your image, expanded:

Defining sets of characters to match [abc]
- Meaning: This is the most basic form of a character class. It matches any single character that is literally listed inside the square brackets.
- Explanation: If you have [abc], the regex will match 'a', 'b', or 'c'. It will match only one of them at any given position in the string.
- Example:
  JavaScript
  const text = "apple banana cherry"; const regex = /[abc]/g; // 'g' flag for global match console.log(text.match(regex)); // Output: ["a", "b", "a", "n", "a", "c", "h", "e", "r", "r", "y"]
  In this example, it finds every instance of 'a', 'b', or 'c' in the string.
Ranges within character classes [a-z] or [0-9]
- Meaning: Instead of listing every character, you can specify a range of characters using a hyphen -. This is a shorthand for commonly used character sets.
- Explanation:
  - [a-z] matches any lowercase letter from 'a' to 'z'.
  - [A-Z] matches any uppercase letter from 'A' to 'Z'.
  - [0-9] matches any digit from '0' to '9'.
- Important Note: The range is based on the character's Unicode (or ASCII) value. For example, [A-z] would include some non-alphabetic characters between 'Z' and 'a' in the ASCII table. Stick to well-defined ranges like [a-z], [A-Z], [0-9].
- Example:
  JavaScript
  const text = "Item A: 123, Item B: 456"; const lettersRegex = /[A-Z]/g; const digitsRegex = /[0-9]/g; console.log(text.match(lettersRegex)); // Output: ["I", "A", "I", "B"] console.log(text.match(digitsRegex)); // Output: ["1", "2", "3", "4", "5", "6"]
Combined ranges within character classes [a-z0-9]
- Meaning: You can combine multiple characters and ranges within a single character class.
- Explanation: [a-z0-9] would match any lowercase letter OR any digit. The order of ranges and individual characters within the [] generally doesn't matter, but it's good practice for readability.
- Example:
  JavaScript
  const text = "Version 1.0.0 Alpha_2"; // Matches any lowercase letter, uppercase letter, or digit const alphanumericRegex = /[a-zA-Z0-9]/g; console.log(text.match(alphanumericRegex)); // Output: ["V", "e", "r", "s", "i", "o", "n", "1", "0", "0", "A", "l", "p", "h", "a", "2"]
Negated character classes [^abc]
- Meaning: When a caret ^ is the first character inside a character class [], it negates the class. It means "match any single character that is not in this set."
- Explanation: [^abc] matches any character except 'a', 'b', or 'c'.
- Important: If ^ is not the first character inside [], it's treated as a literal caret. E.g., [ab^c] matches 'a', 'b', 'c', or ^.
- Example:
  JavaScript
  const text = "Hello, World! 123"; // Matches any character that is NOT a lowercase letter const notLowercaseRegex = /[^a-z]/g; console.log(text.match(notLowercaseRegex)); // Output: ["H", ",", " ", "W", "!", " ", "1", "2", "3"]

Additional Points on Character Classes:

Special Characters Inside []: Most metacharacters lose their special meaning when placed inside a character class, except for:
- ^ (only at the beginning, for negation)
- - (when used for a range; otherwise, it's literal)
- \ (for escaping)
- ] (if it's the first character, it's literal; otherwise, it closes the class)
- Example: [.+*?] would match a literal dot, plus, asterisk, or question mark.
Shorthand Character Classes: As discussed in the previous response, there are pre-defined character classes that are very useful:
- \d: Matches any digit (same as [0-9])
- \D: Matches any non-digit (same as [^0-9])
- \w: Matches any "word" character (alphanumeric + underscore, same as [a-zA-Z0-9_])
- \W: Matches any non-word character (same as [^a-zA-Z0-9_])
- \s: Matches any whitespace character
- \S: Matches any non-whitespace character

Character classes are fundamental building blocks for creating robust and flexible regular expressions, allowing you to easily match sets of characters based on specific criteria.

5. Anchors

Anchors are metacharacters that do not match actual characters but rather positions within the string. They assert that a certain position exists.

^ (Caret):
- Matches the beginning of the input string.
- If the m (multiline) flag is used, it also matches the beginning of each line (immediately after a newline character \n or \r).
- Example: /^start/ matches "start" only if it appears at the very beginning of the string.
$ (Dollar Sign):
- Matches the end of the input string.
- If the m (multiline) flag is used, it also matches the end of each line (immediately before a newline character \n or \r).
- Example: /end$/ matches "end" only if it appears at the very end of the string.
\b (Word Boundary):
- Matches a position that is a "word boundary." This occurs at the transition between a word character (\w) and a non-word character (\W), or at the beginning/end of the string if it's followed/preceded by a word character.
- Example: /\bcat\b/ will match "cat" in "The cat sat." but not in "category" or "tomcat".
\B (Non-Word Boundary):

Matches a position that is not a word boundary. This means it matches within a word.
Example: /\Bcat\B/ would match "cat" within "wildcat" but not "cat" as a standalone word.

6. Shorthand Character Classes

Shorthand character classes are special escape sequences that provide a concise way to match commonly used sets of characters. They are essentially equivalent to explicitly defined character sets using [].

\d (Digit):
- Matches any digit character (0, 1, 2, 3, 4, 5, 6, 7, 8, 9).
- Equivalent to [0-9].
- Example: /\d+/ matches one or more consecutive digits (e.g., "123", "45").
\D (Non-Digit):
- Matches any character that is NOT a digit.
- Equivalent to [^0-9].
- Example: /\D/ matches any non-digit character (e.g., "a", "!", " ").
\w (Word Character):
- Matches any alphanumeric character (a-z, A-Z, 0-9) or an underscore (_).
- Equivalent to [A-Za-z0-9_].
- Example: /\w+/ matches a "word" (e.g., "hello", "user_name", "ID123").
\W (Non-Word Character):
- Matches any character that is NOT a word character.
- Equivalent to [^A-Za-z0-9_].
- Example: /\W/ matches spaces, punctuation, symbols (e.g., " ", "!", "@").
\s (Whitespace Character):
- Matches any whitespace character, including space, tab (\t), newline (\n), carriage return (\r), form feed (\f), and vertical tab (\v).
- Example: /\s+/ matches one or more consecutive whitespace characters.
\S (Non-Whitespace Character):
- Matches any character that is NOT a whitespace character.
- Example: /\S/ matches any non-whitespace character.

These two categories, Anchors and Shorthand Character Classes, are fundamental tools for building powerful and efficient regular expressions to locate and manipulate text patterns.

Search This Blog

thunder_coding