I need help to improve regex performance (speed of execution) for ECMAscript (JavaScript on NodeJS 20) of:
/[u0000-u001fu0022u005cud800-udfff]|[ud800-udbff](?![udc00-udfff])|(?:[^ud800-udbff]|^)[udc00-udfff]/
This regular expression is designed to match certain Unicode characters in a string. Let’s break it down:
/[u0000-u001fu0022u005cud800-udfff]/
: This part matches any
character in the range ofu0000
tou001f
, the characters
u0022
(quotation mark"
) andu005c
(backslash), and any
character in the range ofud800
toudfff
. This range
ud800-udfff
covers surrogate pairs, which are used in UTF-16
encoding for characters outside of the basic multilingual plane
(BMP).|
: OR operator[ud800-udbff](?![udc00-udfff])
: This part matches the first
part of a surrogate pair (ud800
toudbff
) but only if it’s not
followed by the second part of a surrogate pair (udc00
to
udfff
). This ensures that only valid surrogate pairs are matched.|
: Again, an OR operator.(?:[^ud800-udbff]|^)[udc00-udfff]
: This part matches the second
part of a surrogate pair (udc00
toudfff
) but only if it’s not
preceded by the first part of a surrogate pair or if it’s at the
beginning of the string. The^
inside the square brackets[^]
denotes negation, meaning any character other than the ones
specified.
In summary, this regular expression is used to match and handle surrogate pairs correctly in Unicode strings, ensuring proper validation and handling of UTF-16 encoded characters.
Speed test over a string of 1000 chars
const REGEX = /[u0000-u001fu0022u005cud800-udfff]|[ud800-udbff](?![udc00-udfff])|(?:[^ud800-udbff]|^)[udc00-udfff]/
const start = performance.now()
console.log(REGEX.test('A'.repeat(10000) + '"')) // true
console.log(performance.now() - start)
Are there improvements that can be made?
The purpose is to have a very fast check for strings that do not need escaping while not being much slower than JSON.stringify() in case the input requires escaping.