The idea behind linting lib/parser.js was that it would improve quality
of code generated by PEG.js in general. However, there is a couple of
problems with it:
1. Code in lib/parser.js is ES5 while the rest of the code is ES2015.
This would mean a separate ESLint configuration and a separate set
of code style rules just for lib/parser.js once code style checks
are added.
2. Code in lib/parser.js is generated. This means that even today it
violates checks like "no-unused-var", which have to be disabled.
This would get worse once code style checks are added, again
requiring a separate ESLint configuration just for lib/parser.js.
3. Linting lib/parser.js checks only small portion of possible code
generator output. For example, code generated when optimizing for
size or when tracing is not checked at all. Thus, linting
lib/parser.js gives a false sense of security.
Because of these problems I decided not to lint lib/parser.js at all and
rely on ad-hoc linting of parser files produced by PEG.js with ignoring
false-positives. I consider this more of a pragmatic cost vs. benefits
decision than a principial one.
Part of #407.
The lib/parser.js file is a CommonJS module like all the other files in
lib/, so setting the environment explicitly is not needed. Besides, the
environment set by "eslint-env" was wrong (since transitioning from the
AMD format).
This will allow to use ES2015 constructs in benchmark code.
The change required introducing a small server, which serves both PEG.js
and benchmark code passed through Babel and bundled together. This
allowed to convert the benchmark to regular modules and to get rid of
the hackery that was previously needed to make it run both in Node.js
and in the browser.
Note the benchmark no longer exercises the browser version.
See #442.
This will allow to use ES2015 constructs in spec code.
The change required introducing a small server, which serves both PEG.js
and spec code passed through Babel and bundled together. This allowed to
convert the specs to regular modules and get rid of the hackery that was
previously needed to make them run both in Node.js and in the browser.
Note the specs no longer exercise the browser version. This will allow
to spec PEG.js internals in the future.
See #442.
So far, PEG.js was exported in a "PEG" global variable when no module
loader was detected. The same variable name was also conventionally used
when requiring it in Node.js or otherwise referring to it. This was
reflected in various places in the code, documentation, examples, etc.
This commit changes the variable name to "peg" and fixes all relevant
occurrences. The main reason for the change is that in Node.js, modules
are generally referred to by lower-case variable names, so "PEG" was
sticking out when used in Node.js projects.
Fix the following errors:
31:9 error "parser" is defined but never used no-unused-vars
406:14 error "expected" is defined but never used no-unused-vars
1304:15 error "s1" is defined but never used no-unused-vars
1386:15 error "s1" is defined but never used no-unused-vars
1442:15 error "s1" is defined but never used no-unused-vars
Implement the swap and change various directives in the source code. The
"make hint" target becomes "make lint".
The change leads to quite some errors being reported by ESLint. These
will be fixed in subsequent commits.
Note the configuration enables just the recommended rules. Later I plan
to enable more rules to enforce the coding standard. The configuration
also sets the environment to "node", which is far from ideal as the
codebase contains a mix of CommonJS, Node.js and browser code. I hope to
clean this up at some point.
Beside the recursion detector, the visitor will also be used by infinite
loop detector.
Note the newly created |asts.matchesEmpty| function re-creates the
visitor each time it is called, which makes it slower than necessary.
This could have been worked around in various ways but I chose to defer
that optimization because real-world performance impact is small.
Split lib/utils.js into multiple files. Some of the functions were
generic, these were moved into files in lib/utils. Other funtions were
specific for the compiler, these were moved to files in lib/compiler.
This commit only moves functions around -- there is no renaming and
cleanup performed. Both will come later.
Before this commit, the |expected| and |error| functions didn't halt the
parsing immediately, but triggered a regular match failure. After they
were called, the parser could backtrack, try another branches, and only
if no other branch succeeded, it triggered an exception with information
possibly based on parameters passed to the |expected| or |error|
function (this depended on positions where failures in other branches
have occurred).
While nice in theory, this solution didn't work well in practice. There
were at least two problems:
1. Action expression could have easily triggered a match failure later
in the input than the action itself. This resulted in the
action-triggered failure to be shadowed by the expression-triggered
one.
Consider the following example:
integer = digits:[0-9]+ {
var result = parseInt(digits.join(""), 10);
if (result % 2 === 0) {
error("The number must be an odd integer.");
return;
}
return result;
}
Given input "2", the |[0-9]+| expression would record a match
failure at position 1 (an unsuccessful attempt to parse yet another
digit after "2"). However, a failure triggered by the |error| call
would occur at position 0.
This problem could have been solved by silencing match failures in
action expressions, but that would lead to severe performance
problems (yes, I tried and measured). Other possible solutions are
hacks which I didn't want to introduce into PEG.js.
2. Triggering a match failure in action code could have lead to
unexpected backtracking.
Consider the following example:
class = "[" (charRange / char)* "]"
charRange = begin:char "-" end:char {
if (begin.data.charCodeAt(0) > end.data.charCodeAt(0)) {
error("Invalid character range: " + begin + "-" + end + ".");
}
// ...
}
char = [a-zA-Z0-9_\-]
Given input "[b-a]", the |charRange| rule would fail, but the
parser would try the |char| rule and succeed repeatedly, resulting
in "b-a" being parsed as a sequence of three |char|'s, which it is
not.
This problem could have been solved by using negative predicates,
but that would complicate the grammar and still wouldn't get rid of
unintuitive behavior.
Given these problems I decided to change the semantics of the |expected|
and |error| functions. They don't interact with regular match failure
mechanism anymore, but they cause and immediate parse failure by
throwing an exception. I think this is more intuitive behavior with less
harmful side effects.
The disadvantage of the new approach is that one can't backtrack from an
action-triggered error. I don't see this as a big deal as I think this
will be rarely needed and one can always use a semantic predicate as a
workaround.
Speed impact
------------
Before: 993.84 kB/s
After: 998.05 kB/s
Difference: 0.42%
Size impact
-----------
Before: 1019968 b
After: 975434 b
Difference: -4.37%
(Measured by /tools/impact with Node.js v0.6.18 on x86_64 GNU/Linux.)
The |expected| function allows users to report regular match failures
inside actions.
If the |expected| function is called, and the reported match failure
turns out to be the cause of a parse error, the error message reported
by the parser will be in the usual "Expected ... but found ..." format
with the description specified in the |expected| call used as part of
the message.
Implements part of #198.
Speed impact
------------
Before: 1146.82 kB/s
After: 1031.25 kB/s
Difference: -10.08%
Size impact
-----------
Before: 950817 b
After: 973269 b
Difference: 2.36%
(Measured by /tools/impact with Node.js v0.6.18 on x86_64 GNU/Linux.)
This is a complete rewrite of the PEG.js code generator. Its goals are:
1. Allow optimizing the generated parser code for code size as well as
for parsing speed.
2. Prepare ground for future optimizations and big features (like
incremental parsing).
2. Replace the old template-based code-generation system with
something more lightweight and flexible.
4. General code cleanup (structure, style, variable names, ...).
New Architecture
----------------
The new code generator consists of two steps:
* Bytecode generator -- produces bytecode for an abstract virtual
machine
* JavaScript generator -- produces JavaScript code based on the
bytecode
The abstract virtual machine is stack-based. Originally I wanted to make
it register-based, but it turned out that all the code related to it
would be more complex and the bytecode itself would be longer (because
of explicit register specifications in instructions). The only downsides
of the stack-based approach seem to be few small inefficiencies (see
e.g. the |NIP| instruction), which seem to be insignificant.
The new generator allows optimizing for parsing speed or code size (you
can choose using the |optimize| option of the |PEG.buildParser| method
or the --optimize/-o option on the command-line).
When optimizing for size, the JavaScript generator emits the bytecode
together with its constant table and a generic bytecode interpreter.
Because the interpreter is small and the bytecode and constant table
grow only slowly with size of the grammar, the resulting parser is also
small.
When optimizing for speed, the JavaScript generator just compiles the
bytecode into JavaScript. The generated code is relatively efficient, so
the resulting parser is fast.
Internal Identifiers
--------------------
As a small bonus, all internal identifiers visible to user code in the
initializer, actions and predicates are prefixed by |peg$|. This lowers
the chance that identifiers in user code will conflict with the ones
from PEG.js. It also makes using any internals in user code ugly, which
is a good thing. This solves GH-92.
Performance
-----------
The new code generator improved parsing speed and parser code size
significantly. The generated parsers are now:
* 39% faster when optimizing for speed
* 69% smaller when optimizing for size (without minification)
* 31% smaller when optimizing for size (with minification)
(Parsing speed was measured using the |benchmark/run| script. Code size
was measured by generating parsers for examples in the |examples|
directory and adding up the file sizes. Minification was done by |uglify
--ascii| in version 1.3.4.)
Final Note
----------
This is just a beginning! The new code generator lays a foundation upon
which many optimizations and improvements can (and will) be made.
Stay tuned :-)
Includes:
* Moving the source code from /src to /lib.
* Adding an explicit file list to package.json
* Updating the Makefile.
* Updating the spec and benchmark suites and their READMEs.
Part of a fix for GH-32.
PEG.js source code becomes a set of Node.js modules that include each
other as needed. The distribution version is built by bundling these
modules together, wrapping them inside a bit of boilerplate code that
makes |module.exports| and |require| work.
Part of a fix for GH-32.
When the Git repository will be a npm package, there will be no
preprocessing step and thus no @VERSION substitution. Let's get rid of
it.
Part of a fix for GH-32.
Before this commit, package.json in the project root directory was
preprocessed in order to insert correct version into it. This made it
invalid JSON and thus unusable for npm purposes.
This commit makes package.json a valid JSON by hardcoding the version
into it. I think that introducing this small duplicity is outweighted by
being able to use npm in project root directory. For example, it is now
possible to make the "npm test" command work and introduce Travis CI
integration.
This is the first of many commits that gradually convert PEG.js's test
suite from QUnit to Jasmine, cleaning it up on the way.
Main reason for the change is that Jasmine allows nested contexts,
allowing to structure the tests in a better way than QUnit. Moreover,
the tests needed to be cleaned up a bit.