Regular expressions
JavaScript regular expressions are different from Java regular expressions.
For java.util.regex.Pattern
(and its derivatives like scala.util.matching.Regex
and the .r
method), Scala.js implements the semantics of Java regular expressions, although with some limitations.
The semantics and feature set of JavaScript regular expressions is available through js.RegExp
, as any other JavaScript API.
Support
The set of supported features for Pattern
depends on the target ECMAScript version, specified in ESFeatures.esVersion
.
By default, Scala.js targets ECMAScript 2015.
It is possible to change that target with the following setting:
Attention! While this enables more features of regular expressions, it restricts your application to environments that support recent JavaScript features. If you maintain a library, this restriction applies to all downstream libraries and applications. We therefore recommend to try and avoid the additional features, and prefer additional logic in code if that is possible.
In particular, we recommend avoiding the MULTILINE
flag, aka (?m)
, which requires ES2018.
We give some hints on how to avoid it below.
Not supported
The following features are never supported:
- the
CANON_EQ
flag, - the
\X
,\b{g}
and\N{...}
expressions, \p{In๐ฏ๐ข๐ฎ๐ฆ}
character classes representing Unicode blocks,- the
\G
boundary matcher, except if it appears at the very beginning of the regex (e.g.,\Gfoo
), - embedded flag expressions with inner groups, i.e., constructs of the form
(?idmsuxU-idmsuxU:๐)
, - embedded flag expressions without inner groups, i.e., constructs of the form
(?idmsuxU-idmsuxU)
, except if they appear at the very beginning of the regex (e.g.,(?i)abc
is accepted, butab(?i)c
is not), and - numeric โbackโ references to groups that are defined later in the pattern (note that even Java does not support named back references like that).
Conditionally supported
The following features require esVersion >= ESVersion.ES2015
(which is true by default):
- the
UNICODE_CASE
flag.
The following features require esVersion >= ESVersion.ES2018
(which is false by default):
- the
MULTILINE
andUNICODE_CHARACTER_CLASS
flags, - look-behind assertions
(?<=๐)
and(?<!๐)
, - the
\b
and\B
expressions used together with theUNICODE_CASE
flag, \p{๐ฏ๐ข๐ฎ๐ฆ}
expressions where๐ฏ๐ข๐ฎ๐ฆ
is not one of the POSIX character classes.
Always supported
It is worth noting that, among others, the following features are supported in all cases, even when no equivalent feature exists in ECMAScript at all, or in the target version of ECMAScript:
- correct handling of surrogate pairs (natively supported in ES 2015+),
- the
\G
boundary matcher when it is at the beginning of the pattern (corresponding to the โyโ flag, natively supported in ES 2015+), - named groups and named back references (natively supported in ES 2018+),
- the
DOTALL
flag (natively supported in ES 2018+), - ASCII case-insensitive matching (
CASE_INSENSITIVE
on butUNICODE_CASE
off), - comments with the
COMMENTS
flag, - POSIX character classes in ASCII mode, or their Unicode variant with
UNICODE_CHARACTER_CLASS
(if the latter is itself supported, see above), - complex character classes with unions and intersections (e.g.,
[a-z&&[^g-p]]
), - atomic groups
(?>๐)
, - possessive quantifiers
๐*+
,๐++
and๐?+
, - the
\A
,\Z
and\z
boundary matchers, - the
\R
expression, - embedded quotations with
\Q
and\E
, both outside and inside character classes.
All the supported features have the correct semantics from Java. This is even true for features that exist in JavaScript but with different semantics, among which:
- the
^
and$
boundary matchers with theMULTILINE
flag (when the latter is supported), - the predefined character classes
\h
,\s
,\v
,\w
and their negated variants, respecting theUNICODE_CHARACTER_CLASS
flag, - the
\b
and\B
boundary matchers, respecting theUNICODE_CHARACTER_CLASS
flag, - the internal format of
\p{๐ฏ๐ข๐ฎ๐ฆ}
character classes, including the\p{java๐๐ฆ๐ต๐ฉ๐ฐ๐ฅ๐๐ข๐ฎ๐ฆ}
classes, - octal escapes and control escapes.
Guarantees
If a feature is not supported, a PatternSyntaxException
is thrown at the time of Pattern.compile()
.
If Pattern.compile()
succeeds, the regex is guaranteed to behave exactly like on the JVM, except for capturing groups within repeated segments (both for their back references and subsequent calls to group
, start
and end
):
- on the JVM, a capturing group always captures whatever substring was successfully matched last by that group during the processing of the regex:
- even if it was in a previous iteration of a repeated segment and the last iteration did not have a match for that group, or
- if it was during a later iteration of a repeated segment that was subsequently backtracked;
- in JS and hence in Scala.js, capturing groups within repeated segments always capture what was matched (or not) during the last iteration that was eventually kept.
The behavior of JavaScript is more โfunctionalโ, whereas that of the JVM is more โimperativeโ.
This imperative nature is also reflected in the hitEnd()
and requireEnd()
methods of Matcher
, which are not supported (they do not link).
The behavior of the JVM does not appear to be specified, and is questionable. There are several open issues that argue it is buggy:
Scala.js keeps the the JavaScript behavior, and does not try to replicate the JVM behavior (potentially at great cost).
Avoiding the MULTILINE
flag, aka (?m)
The โmโ flag of JavaScriptโs RegExp
is subtly different from that of Javaโs Pattern
.
It considers that the position in the middle of a \r\n
sequence is both the beginning and end of a line, whereas Pattern
considers that neither is true.
The semantics of Pattern
correspond to Unicode recommendations.
In general, we cannot implement the Pattern
behavior without look-behind asertions ((?<=๐)
), which are only available in ECMAScript 2018+.
However, in most concrete cases, it is possible to replace the usage of the โmโ flag with a combination of a) more complicated patterns and b) some ad hoc logic in the code using the regex.
Consider the following simple example, which matches every foo
or bar
or empty string on a line and prints them:
Assuming that, in the particular use case we are facing, only UNIX newlines can appear in the input
string, we can rewrite the regex without the (?m)
flag:
regex2
has exactly one match for each match of regex
, and can therefore be used instead.
However, the specific string being matched changes, since the newline characters are included in the matched substrings.
The surrounding code can compensate for that discrepancy, using the capturing group in the middle:
If other newline characters must be recognized, a more complicated pattern needs to be used.
If it is acceptable to consider the position in the middle of \r\n
as the start and end of a line (like JavaScriptโs RegExp
does), the following regex works:
If not, invalid matches must be rejected a posteriori using ad hoc logic: