It seems that when using the b word boundary in a Regex, which is used in JavaScript via the “RegExp” class, the logic of the boundary gets “inverted” if the string starts with an umlaut (though unicode modifier is present).
E.g., I am using the following regex to match the word “äm” (and in a next step replacing it in my code):
(b)(äm)(W)/gmu
The Regex is matching the word “Schäm”, though it should not as the phrase “äm” is just a port of the word. Furthermore, it does not match occurences of the exact word “äm”.
Though, regexes of the upper kind that try to match words not starting with an umlaut will work as expected, e.g. the following will match “mä” correctly as a single word and won’t match words like “Schmäh”:
(b)(mä)(W)/gmu
The clue is, that if I am using the first regex with the negated version of b, which is B, the regex will work – which leads to the thought that umlauts that are successing the word boundary token b are inverting the logic for regexes used in ECMAScript. When I switch to PCRE flavor, this is not the case.
Has anybody encountered similar problems?
Regex A
(b)(äm)(W)/gmu
Text: “Schäm some text mä “
Matches: “Schäm”, but not “mä” (–> unexpected behaviour)
Regex B
/(b)(mä)(W)/gmu
Text: “Schmäh some text mä “
Matches: “mä”, but not “Schmäh” (–> expected behaviour)
Regex C
(B)(äm)(W)/gmu
Text: “Schäm some text mä “
Matches: “mä”, but not “Schäm” (–> unexpected behaviour, but this is my current workaround)