83 Commits (4fe682794d15d37d79b9f0037ebfcd5ad46d76ff)

Author SHA1 Message Date
David Majda 4fe682794d Improve expression descriptions in error messages
Before this commit, descriptions of literals used in error messages were
built by applying JavaScript string escaping to their values, making the
descriptions look like JavaScript strings. Descriptions of character
classes were built using their raw text. These approaches were mutually
inconsistent and lead to descriptions which were over-escaped and not
necessarily human-friendly (in case of literals) or coupled with details
of the grammar (in case of character classes).

This commit changes description building code in both cases and unifies
it. The intent is to generate human-friendly descriptions of matched
expressions which are clean, unambiguous, and which don't escape too
many characters, while handling special characters such as newlines
well.

Fixes #127.
8 years ago
David Majda 2fd77b96fc Revert "Use literal raw text in error messages"
I no longer think that using raw literal texts in error messages is the
right thing to do. The main reason is that it couples error messages
with details of the grammar such as use of single or double quotes in
literals. A better solution is coming in the next commit.

This reverts commit 69a0f769fc.
8 years ago
David Majda 057a93fbc7 Move the "Generated by ..." comment out of wrapping functions
The wrapping functions are also generated by PEG.js, so the comment
should be above them to mark them as such. This shouldn't cause any
problems technically.
8 years ago
David Majda fe6ce38d08 UMD parsers: Regenerate lib/parser.js
Part of work on #362.
8 years ago
David Majda ce44c62f14 Support passing custom location info to "error" and "expected"
Based on a pull request by Konstantin (@YemSalat):

  https://github.com/pegjs/pegjs/pull/391

Resolves #390.
8 years ago
David Majda da2378d887 Rewrite handling of optional parameters
Instead of testing arguments.length to see whether an optional parameter
was passed to a function, compare its value to "undefined". This
approach has two advantages:

  * It is in line with handling of default parameters in ES6.

  * Optional parameters are actually spelled out in the parameter
    list.

There is also one important disadvantage, namely that it's impossible to
pass "undefined" as an optional parameter value. This required a small
change in two tests.

Additional notes:

  * Default parameter values are set in assignments immediately
    after the function header. This reflects the fact that these
    assignments really belong to the parameter list (which is where they
    are in ES6).

  * Parameter values are checked against "void 0" in places where
    "undefined" can potentially be redefiend.
8 years ago
David Majda 0c39f1cf86 Fix labels leaking to outer scope
Labels in expressions like "(a:"a")" or "(a:"a" b:"b" c:"c")" were
visible to the outside despite being wrapped in parens. This commit
makes them invisible, as they should be.

Note this required introduction of a new "group" AST node, whose purpose
is purely to provide label scope isolation. This was necessary because
"label" and "sequence" nodes don't (and can't!) provide this isolation
themselves.

Part of a fix of #396.
8 years ago
David Majda 88f1d1369b Detect newlines using charCodeAt instead of charCode
In generated parsers, detect newlines using charCodeAt instead of
charCode. This should be slightly faster.
8 years ago
David Majda 18d266be67 Remove support for newlines other than "\n" and "\r\n"
Before this commit, generated parsers considered the following character
sequences as newlines:

  Sequence   Description
  ------------------------------
  "\n"       Unix
  "\r"       Old Mac
  "\r\n"     Windows
  "\u2028"   line separator
  "\u2029"   paragraph separator

This commit limits the sequences only to "\n" and "\r\n". The reason is
that nobody uses Unicode newlines or "\r" in practice.

A positive side effect of the change is that newline-handling code
became simpler (and likely faster).
8 years ago
David Majda ddd4ac3787 Fix ESLint errors in lib/parser.js
Fix the following errors:

    31:9   error  "parser" is defined but never used    no-unused-vars
   406:14  error  "expected" is defined but never used  no-unused-vars
  1304:15  error  "s1" is defined but never used        no-unused-vars
  1386:15  error  "s1" is defined but never used        no-unused-vars
  1442:15  error  "s1" is defined but never used        no-unused-vars
8 years ago
David Majda d34faba59e Speed up deduplication of expectations
The expectation deduplication algorithm called |Array.prototype.splice|
to eliminate each individual duplication, which was slow. This caused
problems with grammar/input combinations that generated a lot of
expecations (see #377 for an example).

This commit replaces the algorithm with much faster one, eliminating the
problem.
9 years ago
David Majda a4a66a2e5b Switch from first/rest to head/tail in the PEG.js grammar
In the past year I worked on various grammars where first/rest or
head/tail were used as labels for parts of lists. I found I associate
head/tail with a list immediately, while in case of first/rest I have to
"parse" grammar rules for a while before understanding their structure.

Moreover, I tend to assume that rest is a list of the same thigs as
first, but I don't have such assumption in case of head/tail. This
assumption was in conflict with the grammar structure.

I'm not sure how much these observations are applicable to others, but I
decided to act on them and switch from first/rest to head/tail.
9 years ago
David Majda 69a0f769fc Use literal raw text in error messages
Fixes #127.
9 years ago
David Majda 25ab98027d Remove info about found string from syntax errors
The |found| property wasn't very useful as it mostly contained just one
character or |null| (the exception being syntax errors triggered by
|error| or |expected|). Similarly, the "but XXX found" part of the error
message (based on the |found| property) wasn't much useful and was
redundant in presence of location info.

For these reasons, this commit removes the |found| property and
corresponding part of the error message from syntax errors. It also
modifies error location info slightly to cover a range of 0 characters,
not 1 character (except when the error is triggered by |error| or
|expected|). This corresponds more precisely to the actual situation.

Fixes #372.
9 years ago
David Majda 20a4fb2e7f Update version to 0.9.0 9 years ago
David Majda 4b154e177f Update character categories in grammars to Unicode 8.0.0 9 years ago
David Majda 5e6b5da4e9 Merge pull request #347 from mbaumgartl/errorstack
Add stack trace in engines based on V8
9 years ago
Marco Baumgartl 940a66fb38 Add stack trace in engines based on V8. Fixes #331 9 years ago
Arlo Breault 45e39c3ac8 Make generated parsers use strict mode
* Issue #324

 * JSHint complains about two possible strict violations. But are valid
   uses of `this`, so we suppress the warnings.
9 years ago
Arlo Breault 7285ccfd4e Remove block around initialize code
* In strict mode code, functions can only be declared at top level or
   immediately within another function.  This means functions defined in
   the initializer would throw.
9 years ago
Arlo Breault 7695e5e3c5 Fix complaints from `make hint` 9 years ago
David Majda f2200e48af Optimize location info computation
Before this commit, position details (line and column) weren't computed
efficiently from the current parse position. There was a cache but it
held only one item and it was rarely hit in practice. This resulted in
frequent rescanning of the whole input when the |location| function was
used in various places in a grammar.

This commit extends the cache to remember position details for any
position they were ever computed for. In case of a cache miss, the cache
is searched for a value corresponding to the nearest lower position,
which is then used to compute position info for the desired position
(which is then cached). The whole input never needs to be rescanned.

No items are ever evicted from the cache. I think this is fine as the
max number of entries is the length of the input. If this becomes a
problem I can introduce some eviction logic later.

The performance impact of this change is significant. As the benchmark
suite doesn't contain any grammar with |location| calls I just used a
little ad-hoc benchmark script which measured time to parse the grammar
of PEG.js itself (which contains |location| calls):

  var fs     = require("fs"),
      parser = require("./lib/parser");

  var grammar = fs.readFileSync("./src/parser.pegjs", "utf-8"),
      startTime, endTime;

  startTime = (new Date()).getTime();
  parser.parse(grammar);
  endTime = (new Date()).getTime();

  console.log(endTime - startTime);

The measured time went from ~293 ms to ~54 ms on my machine.

Fixes #337.
9 years ago
David Majda 89146915ce Add location information to AST nodes
This will allow to add location information to |GrammarError| exceptions
thrown in various passes.
9 years ago
David Majda 065f4e1b75 Improve location info in syntax errors
Replace |line|, |column|, and |offset| properties of |SyntaxError| with
the |location| property. It contains an object similar to the one
returned by the |location| function available in action code:

  {
    start: { offset: 23, line: 5, column: 6 },
    end:   { offset: 25, line: 5, column: 8 }
  }

For syntax errors produced in the middle of the input, |start| refers to
the first unparsed character and |end| refers to the character behind it
(meaning the span is 1 character). This corresponds to the portion of
the input in the |found| property.

For syntax errors produced the end of the input, both |start| and |end|
refer to a character past the end of the input (meaning the span is 0
characters).

For syntax errors produced by calling |expected| or |error| functions in
action code the location info is the same as the |location| function
would return.
9 years ago
David Majda b1ad2a1f61 Rename |reportedPos| to |savedPos|
Preform the following renames:

  * |reportedPos| -> |savedPos| (abstract machine variable)
  * |peg$reportedPos| -> |peg$savedPos| (variable in generated code)
  * |REPORT_SAVED_POS| -> |LOAD_SAVED_POS| (instruction)
  * |REPORT_CURR_POS| -> |UPDATE_SAVED_POS| (instruction)

The idea is that the name |reportedPos| is no longer accurate after the
|location| change (seea the previous commit) because now both
|reportedPos| and |currPos| are reported to user code. Renaming to
|savedPos| resolves this inaccuracy.

There is probably some better name for the concept than quite generic
|savedPos|, but it doesn't come to me.
9 years ago
David Majda 4f7145e360 Improve location info available in action code
Replace |line|, |column|, and |offset| functions with the |location|
function. It returns an object like this:

  {
    start: { offset: 23, line: 5, column: 6 },
    end:   { offset: 25, line: 5, column: 8 }
  }

In actions, |start| refers to the position at the beginning of action's
expression and |end| refers to the position after the end of action's
expression. This allows one to easily add location info e.g. to AST
nodes created in actions.

In predicates, both |start| and |end| refer to the current position.

Fixes #246.
9 years ago
David Majda fb7de36051 Update website URL
PEG.js website was moved from http://pegjs.majda.cz/ to http://pegjs.org/.
10 years ago
David Majda 2b06476c69 Regenerate lib/parser.js after bytecode changes 10 years ago
David Majda 57f7fae684 Fix a bug in |stringEscape|
The |stringEscape| function both in lib/compiler/javascript.js and in
generated parsers didn't escape characters in the U+0100..U+107F and
U+1000..U+107F ranges.
10 years ago
David Majda 46ac1bf171 Wrap initializer code in generated parsers into |{...}|
Initializer code is usually indented and this indentation is carried
over to generated code. This resulted in a piece of indented code in the
middle of the parser.

This commit wraps initializer code in |{...}|, which makes indentation
in generated parsers look a bit more natural.
10 years ago
David Majda 39084496ca Expose the parser object in action/predicate code
The action/predicate code didn't have access to the parser object. This
was mostly a side effect actions/predicates being implemented as nested
functions, in which |this| is a reference to the global object (an ugly
JavaScript quirk). The initializer, being implemented differently, had
access to the parser object via |this|, but this was not documented.

Because having access to the parser object can be useful, this commits
introduces a new |parser| variable which holds a reference to it, is
visible in action/predicate/initializer code, and is properly
documented.

See also:

  https://groups.google.com/forum/#!topic/pegjs/Na7YWnz6Bmg
10 years ago
David Majda c7521fb868 Mark |parse| and |SyntaxError| as internal identifiers
The |parse| function and the |SyntaxError| exception were meant as
internal, so let's mark them as such.
10 years ago
David Majda 7e3b4ec4f8 PEG.js grammar: Remove reserved word detection
This is mostly done for consistency with the JavaScript example grammar,
from which the |Identifier| rule is taken from. See the previous commit
for details.
10 years ago
David Majda e78ffbba9c PEG.js grammar: Improve the |Code| rule a bit
Instead of matching segments between blocks character by character,
match them as a whole. Also align the style with other similar rules
(e.g. the comment ones).
10 years ago
David Majda 64eb5faf54 PEG.js grammar: Fix line continuation handling in character classes
Before this commit, line continuations in character classes contributed
an empty string to the list of characters and character ranges matched
by a class. While this didn't lead to a buggy behavior with the current
code generator, the AST was wrong and the problem could have caused bugs
later.

This commit fixes the problem.
10 years ago
David Majda 0678bd8a0c PEG.js grammar: Add missing semicolon
Fixes the following JSHint error:

  lib/parser.js: line 108, col 54, Missing semicolon.
10 years ago
David Majda 0459ab6b37 PEG.js grammar: Formatting & comments 10 years ago
David Majda 6f2510e49e PEG.js grammar: Make rules with operators more generic 10 years ago
David Majda 45c29a886f PEG.js grammar: Extract the |SemanticPredicateExpression| rule
Semantic predicates are kind of |PrimaryExpression|, not kind of
|PrefixedExpression|. Therefore I extracted a rule for them and
referenced it from the |PrimaryExpression|.
10 years ago
David Majda da18f6a729 PEG.js grammar: Extract the |RuleReferenceExpression| rule
This makes the |Primary| rule a bit more tidy.
10 years ago
David Majda 8e6f98e45c PEG.js grammar: Extract the |ActionExpression| rule
Having it separated from the |SequenceExpression| rule is cleaner and
more logical.
10 years ago
David Majda 5c6f4dd38b PEG.js grammar: Append |Expression| to expression rule names
Makes the rule names a bit longer but also clearer.
10 years ago
David Majda 27c2d26203 PEG.js grammar: More JavaScript-like initializer/rule separation
Initializer and rules are now separated in a similar way as JavaScript
statements -- either by a semicolon or a line terminator, possibly with
whitespace and comments mixed in.

One consequence is that the grammars like this are now illegal:

  foo = "a" bar = "b"

A semicolon needs to be inserted between the rules:

  foo = "a";bar = "b"

I consider this a good change as the now-illegal syntax was somewhat
confusing.
10 years ago
David Majda 4ce7593f5f PEG.js grammar: Extract the |AnyMatcher| rule
This makes the |Primary| rule a bit more tidy. Also, matching the |.|
character really belongs to the lexical part of the grammar, next to
literals and character classes.
10 years ago
David Majda c0df01b092 PEG.js grammar: Improve code block handling
* Rename the |Action| rule to |CodeBlock| (it better describes what
    the rule matches).

  * Implement the rule in a simpler way and move it after more basic
    lexical elements.
10 years ago
David Majda 13f72bb19d PEG.js grammar: More JavaScript-like rules for identifiers
This change has two side effects:

  * Label names can no longer be JavaScript reserved words.

  * |$| is allowed again in label names. However, because of the
    preference rules, names starting with it will be usually parsed as a
    text operator followed by another identifier (denoting a rule
    reference or label name).
10 years ago
David Majda 0d6b91cb20 PEG.js grammar: More JavaScript-like rules for strings/literals/classes 10 years ago
David Majda bcb5271649 PEG.js grammar: More JavaScript-like rules for skipped elements 10 years ago
David Majda a5a0609505 PEG.js grammar: Inline trivial character rules 10 years ago
David Majda ae89f5e469 PEG.js grammar: Change whitespace handling
Before this commit, whitespace was handled at the lexical level by
making tokens consume any whitespace coming after them. This was
accomplished by appending |__| to every token rule.

This commit changes whitespace handling to be more explicit. Tokens no
longer consume whitespace coming after them and syntactic rules have to
cope with it. While this slightly complicates the syntactic grammar, I
think it's a cleaner way. Moreover, it is what JavaScript example
grammar does.

One small side-effect of thich change is that the grammar is now
stand-alone (it doesn't require utils.js anymore).
10 years ago