RegEx (Regular Expressions) NOTES
RegEx is a defined set of characters, which is arrange in an order, that follows some specific rules. It helps us to find some specific patterns in the string.
Let suppose we have a big string and we want to find numbers/pattern in that, then we can use Regex to do it.
---------------------------------------------------- IMPORTANTS ---------------------------------------------------
1. U can use RegEx in VsCode by using CTRL + F (finding shortcut) then u will see 3 buttons in the finding input tag, then select the 3rd one -> Regex pattern. then u can write any regex u want in that input box.
2. Note: If u are using the regex in VsCode by using CTRL + F then you don't need to use the delimiters on that, it automatically put delimiters in input box.
1. Delimiters
Delimiters means determining the limits or setting the limits/bountries.
For example: In HTML comments we have boundries like <!-- comment --> or In div we have starting to closing tag like <div> bountries </div>
The concept of a "delimiter" for a regular expression (regex) itself isn't universal across all programming languages, as some languages integrate regex directly into string methods or dedicated regex objects without explicit delimiters. However, in many languages, especially those influenced by Perl, you'll find clear delimiters used to define the regex pattern.
Languages with Explicit Delimiters:
-
Perl: This is where the concept of regex delimiters is most prominent.
- Common: Forward slashes
/pattern/
are the most common. - Alternatives: You can use almost any non-alphanumeric, non-backslash, non-whitespace character as a delimiter. This is useful if your pattern contains forward slashes and you want to avoid escaping them. Examples include
m{pattern}
,m(pattern)
,m#pattern#
,m~pattern~
, etc. The same applies to substitution (e.g.,s/pattern/replacement/
) and transliteration (e.g.,tr/searchlist/replacementlist/
).
- Common: Forward slashes
-
PHP: PHP's PCRE (Perl Compatible Regular Expressions) functions require delimiters.
- Common: Forward slashes
/pattern/
are widely used. - Alternatives: Similar to Perl, you can use other non-alphanumeric, non-backslash, non-whitespace characters like
#
,~
,%
,{}
,()
,[]
,<>
.
- Common: Forward slashes
-
JavaScript: When using regular expression literals, delimiters are required.
- Common: Forward slashes
/pattern/
are the standard way to define a regex literal. - No other literal delimiters: Unlike Perl or PHP, you don't typically use other characters as literal delimiters. If you need to define a regex dynamically (e.g., from a string variable), you use the
RegExp
constructor, where the pattern is a string:new RegExp("pattern")
.
- Common: Forward slashes
-
Ruby: Ruby's regex literals typically use slashes.
- Common: Forward slashes
/pattern/
are the most common. - Alternatives: You can also use
%r{pattern}
(or()
,[]
,<>
) to define a regex, which can be useful if your pattern contains slashes.
- Common: Forward slashes
Languages without Explicit Delimiters (or where they are implied by method calls):
-
Python: Python's
re
module doesn't use explicit delimiters like slashes in its function calls. You pass the regex pattern as a string.- Example:
re.search('pattern', 'string')
orre.split(r'pattern', 'string')
. - Raw strings: It's common practice to use raw strings (prefixed with
r
, e.g.,r'\d+'
) for regex patterns in Python to avoid issues with backslash escaping.
- Example:
-
Java: Java's
java.util.regex
package also uses strings to define regex patterns.- Example:
Pattern.compile("pattern")
orstring.split("pattern")
. - Double backslashes: Because string literals in Java interpret backslashes, you often need to double escape them in regex patterns (e.g.,
"\\d+"
to match a digit).
- Example:
-
C# (.NET): The
System.Text.RegularExpressions
namespace in C# takes regex patterns as strings.- Example:
Regex.Match("string", "pattern")
orRegex.Split("string", "pattern")
. - Verbatim string literals: C# has verbatim string literals (prefixed with
@
, e.g.,@"pattern"
) that can be helpful for regex to avoid double escaping backslashes, similar to Python's raw strings.
- Example:
-
Go: Go's
regexp
package uses string patterns.- Example:
regexp.Compile("pattern")
orregexp.MustCompile("pattern").Split(text, -1)
.
- Example:
2. Literals (or Literal Characters)
Literal characters are characters that match themselves directly. If you put a character in your regex that isn't a metacharacter, it will be treated as a literal.
Examples:
/a/
will match the character "a"./hello/
will match the exact string "hello"./123/
will match the exact string "123"./-_=/
will match the exact string "-_=".
Escaping Metacharacters:
If you want to match a metacharacter literally, you need to "escape" it by placing a backslash (\
) before it. This tells the regex engine to treat the special character as a regular character.
Example:
- If you want to match a literal dot (
.
), you'd use/\./
. (Otherwise,.
is a metacharacter that matches any character except newline). - If you want to match a literal asterisk (
*
), you'd use/\*/
. (Otherwise,*
is a metacharacter for zero or more repetitions). - If you want to match a literal backslash (
\
), you'd use/\\/
.
3. Regex Flags in JavaScript
-
g
(Global search)-
Meaning: This flag ensures that the regular expression will search for all matches in the string, rather than stopping after the first match.
-
Behavior without
g
: Ifg
is not used, methods likeString.prototype.match()
will return only the first match, andRegExp.prototype.exec()
will also find only one match per call (though subsequent calls on the same regex object will find the next match). -
Example:
JavaScriptconst str = "apple banana apple orange"; const regexWithoutG = /apple/; const regexWithG = /apple/g; console.log(str.match(regexWithoutG)); // ["apple", index: 0, input: "apple banana apple orange", groups: undefined] console.log(str.match(regexWithG)); // ["apple", "apple"] let match; while ((match = regexWithG.exec(str)) !== null) { console.log(`Found ${match[0]} at index ${match.index}`); } // Output: // Found apple at index 0 // Found apple at index 13
-
-
i
(Case-insensitive search)-
Meaning: This flag makes the regular expression perform a case-insensitive match. It ignores the difference between uppercase and lowercase letters.
-
Example:
JavaScriptconst str = "Hello World"; const regexWithoutI = /hello/; const regexWithI = /hello/i; console.log(str.match(regexWithoutI)); // null console.log(str.match(regexWithI)); // ["Hello", index: 0, input: "Hello World", groups: undefined]
-
-
m
(Multiline search)-
Meaning: This flag changes the behavior of
^
(start of string) and$
(end of string) anchors.- Without
m
:^
matches only the very beginning of the entire input string, and$
matches only the very end of the entire input string. - With
m
:^
matches the beginning of the entire input string and the beginning of each line (after a newline character\n
or\r
). Similarly,$
matches the end of the entire input string and the end of each line.
- Without
-
Example:
JavaScriptconst str = "Line 1\nLine 2\nLine 3"; const regexWithoutM = /^Line/g; // Note: 'g' is used here to find all occurrences within the string const regexWithM = /^Line/gm; console.log(str.match(regexWithoutM)); // ["Line"] console.log(str.match(regexWithM)); // ["Line", "Line", "Line"] const regexEndWithoutM = /3$/g; const regexEndWithM = /2$/gm; console.log(str.match(regexEndWithoutM)); // ["3"] console.log(str.match(regexEndWithM)); // ["2"]
-
-
s
(DotAll mode) - (Introduced in ES2018)-
Meaning: This flag changes the behavior of the
.
(dot) special character.- Without
s
: The.
matches any character except newline characters (\n
,\r
,\u2028
,\u2029
). - With
s
: The.
matches any character, including newline characters.
- Without
-
Example:
JavaScriptconst str = "First line\nSecond line"; const regexWithoutS = /line.Second/; const regexWithS = /line.Second/s; console.log(str.match(regexWithoutS)); // null console.log(str.match(regexWithS)); // ["line\nSecond", index: 5, input: "First line\nSecond line", groups: undefined]
-
-
u
(Unicode support) - (Introduced in ES6)-
Meaning: This flag enables full Unicode support for the regular expression. It's crucial when working with Unicode characters beyond the basic Latin set (e.g., emojis, characters from different languages).
-
Key impacts of
u
:- Unicode code point escapes: Allows you to use
\u{xxxx}
for code points greater than0xFFFF
. - Proper handling of astral plane characters: Characters like emojis (which occupy two JavaScript "characters" because they're represented by surrogate pairs) are treated as single characters.
- Unicode property escapes (
\p{...}
): (Requiresv
flag as well, see below).
- Unicode code point escapes: Allows you to use
-
Example:
JavaScript// Example with astral plane character const emoji = "👍"; // Unicode code point U+1F44D const regexWithoutU = /./; // Matches one JavaScript "character" const regexWithU = /./u; // Matches one Unicode code point console.log(emoji.match(regexWithoutU).length); // 1 (because it matches the first surrogate character) console.log(emoji.match(regexWithU).length); // 1 (matches the entire emoji as one code point) console.log("Match of regexWithoutU: ", emoji.match(regexWithoutU)); // ["", index: 0, input: "👍", groups: undefined] (often displays as a replacement character) console.log("Match of regexWithU: ", emoji.match(regexWithU)); // ["👍", index: 0, input: "👍", groups: undefined] // Example with Unicode code point escape console.log("a\u{00F1}b".match(/a\u{F1}b/u)); // ["a\xF1b", index: 0, input: "a\xF1b", groups: undefined]
-
-
y
(Sticky search) - (Introduced in ES6)-
Meaning: This flag makes the regex match only from the
lastIndex
property of the regex object. It ensures that subsequent matches are "sticky" to the position where the previous match ended, or to the position specified bylastIndex
. -
Behavior:
- If a match is found,
lastIndex
is updated to the end of the match. - If no match is found at
lastIndex
, theexec
method returnsnull
, andlastIndex
is reset to0
.
- If a match is found,
-
Important: This flag is primarily useful with
RegExp.prototype.exec()
. It's not typically used withString.prototype.match()
asmatch
doesn't uselastIndex
in the same way. -
Example:
JavaScriptconst str = "foo bar baz"; const regexY = /bar/y; regexY.lastIndex = 4; // Set lastIndex to the start of "bar" console.log(regexY.exec(str)); // ["bar", index: 4, input: "foo bar baz", groups: undefined] console.log(regexY.lastIndex); // 7 (lastIndex is updated) // Try to match again from the new lastIndex (7). "baz" is at 8, so it won't match. console.log(regexY.exec(str)); // null console.log(regexY.lastIndex); // 0 (reset to 0 because no match was found) // Without 'y', the regex would still find "bar" even if lastIndex was elsewhere const regexWithoutY = /bar/; regexWithoutY.lastIndex = 0; console.log(regexWithoutY.exec(str)); // ["bar", index: 4, input: "foo bar baz", groups: undefined]
-
-
d
(Has indices) - (Introduced in ES2022)-
Meaning: This flag makes the
exec()
method return an array with an additionalindices
property. Thisindices
property is an array of arrays, where each inner array contains the[start, end]
indices for each captured group, including the full match itself (at index 0). -
Example:
JavaScriptconst str = "hello world"; const regex = /(hello) (world)/d; const match = regex.exec(str); console.log(match); /* [ 'hello world', 'hello', 'world', index: 0, input: 'hello world', groups: undefined, indices: [ [ 0, 11 ], // Full match [ 0, 5 ], // 'hello' [ 6, 11 ] // 'world' ] ] */ if (match && match.indices) { console.log("Full match indices:", match.indices[0]); console.log("Group 1 indices:", match.indices[1]); console.log("Group 2 indices:", match.indices[2]); }
-
-
v
(Unicode sets) - (Introduced in ES2024)-
Meaning: This flag enables "set notation" in regular expressions, allowing you to use Unicode property escapes and perform operations like set intersection, difference, and symmetric difference on character classes. It's an enhancement over the
u
flag, specifically for character classes. -
Key features with
v
:- Unicode property escapes:
\p{Property}
and\P{Property}
(e.g.,\p{Script=Greek}
,\p{Emoji}
). These are now fully supported for a wider range of properties. - Set operations:
- Intersection:
[a--b]
(matches characters that are ina
AND inb
) - Difference:
[a&&b]
(matches characters that are ina
BUT NOT inb
) - Symmetric difference:
[a~~b]
(matches characters that are ina
ORb
, but not both)
- Intersection:
- Nested character classes with improved behavior.
- Unicode property escapes:
-
Example (demonstrating Unicode property escapes with
v
):JavaScript// Example for a character from a specific Unicode script const greekChar = "α"; // Alpha const regexGreek = /\p{Script=Greek}/v; console.log(greekChar.match(regexGreek)); // ["α", index: 0, input: "α", groups: undefined] // Example for an emoji character const emojiChar = "😊"; const regexEmoji = /\p{Emoji}/v; console.log(emojiChar.match(regexEmoji)); // ["😊", index: 0, input: "😊", groups: undefined] // Example demonstrating set intersection (find digits that are also in a-f) const hexDigit = "5"; const regexHexDigit = /[0-9--a-f]/v; // Matches a digit that is also a hex char console.log(hexDigit.match(regexHexDigit)); // ["5", index: 0, input: "5", groups: undefined]
-
How to use flags:
1. Regex Literal:
const regex1 = /pattern/flags;
const regex2 = /hello/gi; // global and case-insensitive
2. RegExp
Constructor:
const regex3 = new RegExp("pattern", "flags");
const regex4 = new RegExp("world", "im"); // case-insensitive and multiline
Summary Table of Regex Flags in JavaScript:
4. Character Classes in Regex
Character classes (also known as character sets) allow you to define a set of characters, and the regex engine will match any one character from that set. They are denoted by square brackets []
.
Here's the information from your image, expanded:
-
Defining sets of characters to match
[abc]
- Meaning: This is the most basic form of a character class. It matches any single character that is literally listed inside the square brackets.
- Explanation: If you have
[abc]
, the regex will match 'a', 'b', or 'c'. It will match only one of them at any given position in the string. - Example:
In this example, it finds every instance of 'a', 'b', or 'c' in the string.JavaScriptconst text = "apple banana cherry"; const regex = /[abc]/g; // 'g' flag for global match console.log(text.match(regex)); // Output: ["a", "b", "a", "n", "a", "c", "h", "e", "r", "r", "y"]
-
Ranges within character classes
[a-z]
or[0-9]
- Meaning: Instead of listing every character, you can specify a range of characters using a hyphen
-
. This is a shorthand for commonly used character sets. - Explanation:
[a-z]
matches any lowercase letter from 'a' to 'z'.[A-Z]
matches any uppercase letter from 'A' to 'Z'.[0-9]
matches any digit from '0' to '9'.
- Important Note: The range is based on the character's Unicode (or ASCII) value. For example,
[A-z]
would include some non-alphabetic characters between 'Z' and 'a' in the ASCII table. Stick to well-defined ranges like[a-z]
,[A-Z]
,[0-9]
. - Example:
JavaScript
const text = "Item A: 123, Item B: 456"; const lettersRegex = /[A-Z]/g; const digitsRegex = /[0-9]/g; console.log(text.match(lettersRegex)); // Output: ["I", "A", "I", "B"] console.log(text.match(digitsRegex)); // Output: ["1", "2", "3", "4", "5", "6"]
- Meaning: Instead of listing every character, you can specify a range of characters using a hyphen
-
Combined ranges within character classes
[a-z0-9]
- Meaning: You can combine multiple characters and ranges within a single character class.
- Explanation:
[a-z0-9]
would match any lowercase letter OR any digit. The order of ranges and individual characters within the[]
generally doesn't matter, but it's good practice for readability. - Example:
JavaScript
const text = "Version 1.0.0 Alpha_2"; // Matches any lowercase letter, uppercase letter, or digit const alphanumericRegex = /[a-zA-Z0-9]/g; console.log(text.match(alphanumericRegex)); // Output: ["V", "e", "r", "s", "i", "o", "n", "1", "0", "0", "A", "l", "p", "h", "a", "2"]
-
Negated character classes
[^abc]
- Meaning: When a caret
^
is the first character inside a character class[]
, it negates the class. It means "match any single character that is not in this set." - Explanation:
[^abc]
matches any character except 'a', 'b', or 'c'. - Important: If
^
is not the first character inside[]
, it's treated as a literal caret. E.g.,[ab^c]
matches 'a', 'b', 'c', or^
. - Example:
JavaScript
const text = "Hello, World! 123"; // Matches any character that is NOT a lowercase letter const notLowercaseRegex = /[^a-z]/g; console.log(text.match(notLowercaseRegex)); // Output: ["H", ",", " ", "W", "!", " ", "1", "2", "3"]
- Meaning: When a caret
Additional Points on Character Classes:
- Special Characters Inside
[]
: Most metacharacters lose their special meaning when placed inside a character class, except for:^
(only at the beginning, for negation)-
(when used for a range; otherwise, it's literal)\
(for escaping)]
(if it's the first character, it's literal; otherwise, it closes the class)- Example:
[.+*?]
would match a literal dot, plus, asterisk, or question mark.
- Shorthand Character Classes: As discussed in the previous response, there are pre-defined character classes that are very useful:
\d
: Matches any digit (same as[0-9]
)\D
: Matches any non-digit (same as[^0-9]
)\w
: Matches any "word" character (alphanumeric + underscore, same as[a-zA-Z0-9_]
)\W
: Matches any non-word character (same as[^a-zA-Z0-9_]
)\s
: Matches any whitespace character\S
: Matches any non-whitespace character
Character classes are fundamental building blocks for creating robust and flexible regular expressions, allowing you to easily match sets of characters based on specific criteria.
5. Anchors
Comments
Post a Comment