Changes:
* Remove the "value" property (it is replaced by other properties).
* Add the "parts", "inverted", and "ignoreCase" properties (which
allow more structured access to expectation data).
Changes:
* Rename the "value" property to "text" (because it doesn't contain
the whole value, which also includes the case sensitivity flag).
* Add the "ignoreCase" property (which was missing).
If the described class is case-sensitive, nothing changes.
If the described class is case-insensitive, its description doesn't
indicate that anymore. The indication was awkward and it was meaningful
only for parser users familiar with PEG.js grammar syntax (typically a
minority). For cases where case insensitivity indication is vital, named
rules can be used to customize the reporting.
Note that literal descriptions already ignore the case-sensitivity flag;
this commit only makes things consistent.
Simplify regexps that specify ranges of characters to escape with "\xXX"
and "\uXXXX" in various escaping functions. Until now, these regexps
were (mostly) mutually exclusive with more selective regexps applied
before them, but this became a maintenance headache. I decided to
abandon the exclusivity, which allowed to simplify these regexps (at the
cost of introducing an ordering dependency).
Change how found strings are escaped when building syntax error
messages:
* Do not escape non-ASCII characters (U+0100-U+FFFF). They are
typically more readable in their raw form.
* Escape DEL (U+007F). It is a control character.
* Escape NUL (U+0000) as "\0", not "\x00".
* Do not use less known shortcut escape sequences ("\b", "\f"), only the
well-known ones ("\0", "\t", "\n", "\r").
These changes mirror expectation escaping changes done in
4fe682794d.
Part of work on #428.
Before this commit, descriptions of literals used in error messages were
built by applying JavaScript string escaping to their values, making the
descriptions look like JavaScript strings. Descriptions of character
classes were built using their raw text. These approaches were mutually
inconsistent and lead to descriptions which were over-escaped and not
necessarily human-friendly (in case of literals) or coupled with details
of the grammar (in case of character classes).
This commit changes description building code in both cases and unifies
it. The intent is to generate human-friendly descriptions of matched
expressions which are clean, unambiguous, and which don't escape too
many characters, while handling special characters such as newlines
well.
Fixes#127.
I no longer think that using raw literal texts in error messages is the
right thing to do. The main reason is that it couples error messages
with details of the grammar such as use of single or double quotes in
literals. A better solution is coming in the next commit.
This reverts commit 69a0f769fc.
The wrapping functions are also generated by PEG.js, so the comment
should be above them to mark them as such. This shouldn't cause any
problems technically.
Introduce two ways of specifying parser dependencies: the "dependencies"
option of PEG.buildParser and the -d/--dependency CLI option. Specified
dependencies are translated into AMD dependencies and Node.js's
"require" calls when generating an UMD parser.
Part of work on #362.
Extract "generateWrapper" code which generates code of the intro and the
returned parser object into helper functions. This is pure refactoring,
generated parser code is exactly the same as before.
This change will make it easier to modifiy "generateWrapper" to produce
UMD modules.
Part of work on #362.
Code which was at the toplevel of the "generateJS" function in the code
generator is now split into "generateToplevel" (which genreates parser
toplevel code) and "generateWrapper" (which generates a wrapper around
it). This is pure refactoring, generated parser code is exactly the same
as before.
This change will make it easier to modifiy the code genreator to produce
UMD modules.
Part of work on #362.
Instead of testing arguments.length to see whether an optional parameter
was passed to a function, compare its value to "undefined". This
approach has two advantages:
* It is in line with handling of default parameters in ES6.
* Optional parameters are actually spelled out in the parameter
list.
There is also one important disadvantage, namely that it's impossible to
pass "undefined" as an optional parameter value. This required a small
change in two tests.
Additional notes:
* Default parameter values are set in assignments immediately
after the function header. This reflects the fact that these
assignments really belong to the parameter list (which is where they
are in ES6).
* Parameter values are checked against "void 0" in places where
"undefined" can potentially be redefiend.
Labels in expressions like "(a:"a")" or "(a:"a" b:"b" c:"c")" were
visible to the outside despite being wrapped in parens. This commit
makes them invisible, as they should be.
Note this required introduction of a new "group" AST node, whose purpose
is purely to provide label scope isolation. This was necessary because
"label" and "sequence" nodes don't (and can't!) provide this isolation
themselves.
Part of a fix of #396.
Before this commit, generated parsers considered the following character
sequences as newlines:
Sequence Description
------------------------------
"\n" Unix
"\r" Old Mac
"\r\n" Windows
"\u2028" line separator
"\u2029" paragraph separator
This commit limits the sequences only to "\n" and "\r\n". The reason is
that nobody uses Unicode newlines or "\r" in practice.
A positive side effect of the change is that newline-handling code
became simpler (and likely faster).
The expectation deduplication algorithm called |Array.prototype.splice|
to eliminate each individual duplication, which was slow. This caused
problems with grammar/input combinations that generated a lot of
expecations (see #377 for an example).
This commit replaces the algorithm with much faster one, eliminating the
problem.
The |found| property wasn't very useful as it mostly contained just one
character or |null| (the exception being syntax errors triggered by
|error| or |expected|). Similarly, the "but XXX found" part of the error
message (based on the |found| property) wasn't much useful and was
redundant in presence of location info.
For these reasons, this commit removes the |found| property and
corresponding part of the error message from syntax errors. It also
modifies error location info slightly to cover a range of 0 characters,
not 1 character (except when the error is triggered by |error| or
|expected|). This corresponds more precisely to the actual situation.
Fixes#372.
Report left recursion also in cases where the recursive rule invocation
is not a direct element of a sequence, but is wrapped inside an
expression.
Fixes#359.
Before this commit, the |reportLeftRecursion| pass was written in
functional style, passing the |visitedRules| array around as a parameter
and making a new copy each time a rule was visited. This apparently
caused performance problems in some deeply recursive grammars.
This commit makes it so that there is just one array which is shared
across all the visitor functions via a closure and modified as rules are
visited.
I don't like losing the functional style (it was elegant) but
performance is more important.
Fixes#203.
* In strict mode code, functions can only be declared at top level or
immediately within another function. This means functions defined in
the initializer would throw.
Before this commit, position details (line and column) weren't computed
efficiently from the current parse position. There was a cache but it
held only one item and it was rarely hit in practice. This resulted in
frequent rescanning of the whole input when the |location| function was
used in various places in a grammar.
This commit extends the cache to remember position details for any
position they were ever computed for. In case of a cache miss, the cache
is searched for a value corresponding to the nearest lower position,
which is then used to compute position info for the desired position
(which is then cached). The whole input never needs to be rescanned.
No items are ever evicted from the cache. I think this is fine as the
max number of entries is the length of the input. If this becomes a
problem I can introduce some eviction logic later.
The performance impact of this change is significant. As the benchmark
suite doesn't contain any grammar with |location| calls I just used a
little ad-hoc benchmark script which measured time to parse the grammar
of PEG.js itself (which contains |location| calls):
var fs = require("fs"),
parser = require("./lib/parser");
var grammar = fs.readFileSync("./src/parser.pegjs", "utf-8"),
startTime, endTime;
startTime = (new Date()).getTime();
parser.parse(grammar);
endTime = (new Date()).getTime();
console.log(endTime - startTime);
The measured time went from ~293 ms to ~54 ms on my machine.
Fixes#337.
Replace |line|, |column|, and |offset| properties of tracing events with
the |location| property. It contains an object similar to the one
returned by the |location| function available in action code:
{
start: { offset: 23, line: 5, column: 6 },
end: { offset: 25, line: 5, column: 8 }
}
For the |rule.match| event, |start| refers to the position at the
beginning of the matched input and |end| refers to the position after
the end of the matched input.
For |rule.enter| and |rule.fail| events, both |start| and |end| refer to
the current position at the time the rule was entered.
Replace |line|, |column|, and |offset| properties of |SyntaxError| with
the |location| property. It contains an object similar to the one
returned by the |location| function available in action code:
{
start: { offset: 23, line: 5, column: 6 },
end: { offset: 25, line: 5, column: 8 }
}
For syntax errors produced in the middle of the input, |start| refers to
the first unparsed character and |end| refers to the character behind it
(meaning the span is 1 character). This corresponds to the portion of
the input in the |found| property.
For syntax errors produced the end of the input, both |start| and |end|
refer to a character past the end of the input (meaning the span is 0
characters).
For syntax errors produced by calling |expected| or |error| functions in
action code the location info is the same as the |location| function
would return.
Preform the following renames:
* |reportedPos| -> |savedPos| (abstract machine variable)
* |peg$reportedPos| -> |peg$savedPos| (variable in generated code)
* |REPORT_SAVED_POS| -> |LOAD_SAVED_POS| (instruction)
* |REPORT_CURR_POS| -> |UPDATE_SAVED_POS| (instruction)
The idea is that the name |reportedPos| is no longer accurate after the
|location| change (seea the previous commit) because now both
|reportedPos| and |currPos| are reported to user code. Renaming to
|savedPos| resolves this inaccuracy.
There is probably some better name for the concept than quite generic
|savedPos|, but it doesn't come to me.
Replace |line|, |column|, and |offset| functions with the |location|
function. It returns an object like this:
{
start: { offset: 23, line: 5, column: 6 },
end: { offset: 25, line: 5, column: 8 }
}
In actions, |start| refers to the position at the beginning of action's
expression and |end| refers to the position after the end of action's
expression. This allows one to easily add location info e.g. to AST
nodes created in actions.
In predicates, both |start| and |end| refer to the current position.
Fixes#246.