SBP: the Scannerless Boolean Parser

News

15-Jan-2009: Mailing List Created
There is now a mailing list for SBP. You can subscribe here or read it on gmane as gmane.comp.parsers.sbp.user

11-Jan-2009: SBP 1.0 Released
This release isn't happening so much as a result of something occurring as a result of something not occurring – it's been over a year since I've come across a correctness bug, so I figure now is a good time for a “1.0” release. See the releases section below.

10-Jan-2009
The repository has been converted from darcs to git. The git repo can be checked out with this command:

git clone http://research.cs.berkeley.edu/project/sbp/edu.berkeley.sbp.git

There is also a gitweb interface.

The old darcs repository, to which patches will no longer be pushed, can be found here:

http://research.cs.berkeley.edu/project/sbp/darcsrepo-no-longer-updated/

15-Jul-2008
I've resolved to start documenting the metagrammar. Here are the first steps towards that goal.

09-Jun-2008
Preliminary support for serialization has been checked in. Note that this is still being developed; right now deserializing a parse table actually takes longer than rebuilding it from scratch. See RegressionTests.java for an example of how to use this feature.

08-Apr-2008
Another checkin with some minor bugfixes.

03-Mar-2008
Just got around to checking in a few changes that had been lying around; the only user-visible change is that grammar files may now include tab characters as whitespace in the grammar as well as “\t”, which matches the tab character in the input.

09-Sep-2007
I've finally gotten around to fixing the GSS node-sharing algorithm; it should now be at least as good as Rekers'. This gives a major performance improvement for some grammars, and none for others. Also added a test case to check for regressions on this.

08-Sep-2007
Parse tables (class Parser) are now serializable (implement Serializable), which means that the time to construct a parse table should no longer be a major factor in overall performance. This should make it possible to trade off more intensive analysis at parse-table-construction-time for better parse-time performance.

27-May-2007
Another checkin; the tokens [>>] and [<<] now represent increasing and decreasing indentation (respectively) and you can lift (backtick) on arbitrary components of a sequence. I've also taken big steps towards the near-future goal of being able to serialize parse tables.

24-May-2007
WIX has been released. This is the major project that I'm currently using SBP for.

19-Apr-2007
Just checked in a whole bunch of changes from the last three months of development (everything since mid-February). The complete release notes are here; notable changes include:

  • Major performance improvements

  • SBP is now able to “garbage collect” stacks corresponding to secondary conjuncts; parse errors are now reported immediately.

  • Added the “...” symbol (pronounced “anything”) and “generalized negation” operator (~~) to the metagrammar.

  • Added the IndentingReader class

  • Added the sbp.ansi=off and sbp.verbose=true system properties

  • Removed the reflective bindings and annotations APIs.

  • Minor changes to the rules for tagging in metagrammars; as a result, parse trees are now much more predictable.

17-Apr-2007
A preview of WIX is now available. WIX was one of the main motivations for writing SBP.

16-Aug-2006
A new snapshot is here.

29-July-2006
Error handling has been massively improved. Here's an example parsing from a substantial portion of the Java 1.5 grammar. The input is missing a closing angle-bracket on a generic type definition. Type make java15 after a checkout to try it yourself.

22-July-2006
The API has been finalized and includes a decent example/mini-tutorial.

17-July-2006
There is now a mailing list.

05-July-2006
The reflective grammar-to-java bindings are complete, so SBP is now vastly easier to use. You can find example code here and the companion grammar here.

What is it?

The Scannerless Boolean Parser (SBP) is a scannerless parser for boolean grammars (a superset of context-free grammars). It is written in Java and emits Java source code.

What is interesting about it?

SBP deliberately sacrifices performance in favor of ease of extensibility.

Since it is an implementation of the (modified) Lang-Tomita GLR algorithm, SBP supports all context-free languages.

It is scannerless (does not require a lexer). This allows it to easily handle languages which have non-regular lexical structure or lack a clear lexer-parser distinction, such as TeX, XML, RFC1738 (URLs), ASN.1, SMTP headers, and Wiki markup. For example, when using SBP as a parser one can make URLs first-class lexenes.

In addition to the juxtaposition and union operators provided in context-free languages, SBP supports grammars which use the intersection operator (conjunctive grammars) and the complement operator (boolean grammars).

What features does it have?

All features below are fully implemented

  • An implementation of the Lang-Tomita GLR parsing algorithm

    • Including Johnstone & Scott's RNGLR algorithm for epsilon-productions

      • Visser's extensions for scannerless parsing

        • Follow, Avoid, Prefer, Reject constraints

        • Character ranges

        • Automatic insertion of whitespace/comments

      • Any topological space can be used as an alphabet (need not be discrete). Unicode/ASCII is currently implemented; parsing trees should be fairly straightforward.

      • Associativity constraints on n-ary operators

  • Ability to parse a wide variety of grammars in O(n3 log n) time:

    • all context-free grammars

    • epsilon productions, included in the parse forest

    • circularities, (not included in the parse forest).

    • Regular expression operators (*, ?, + )

    • boolean grammars (intersection, intersect-with-complement, and generalized-complement)

  • Facilitates experimenting with grammars

    • Interpreted mode, in which the parse table is interpreted directly, eliminating the need for a compiler and making it easier for grammars to operate on grammars.

    • Simple API makes it easy to generate, analyze, and modify grammars programmatically.

      • Components of a grammar (nonterminals, productions, etc) represented as objects

      • composite elements implement Iterable

What is it deliberately missing?

  • Semantic actions; the only option is to return a parse forest.

    • This keeps the grammar specification language-neutral.

    • A grammar can, however, indicate that certain parts of the parse tree should be dropped.

What features would be nice to have?

  • Compiled mode, in which Java source code is emitted; compiling this code yields a parser. The resulting parser is much faster.

  • Drop Farshi's algorithm and use GRMLR Done!

  • An implementation of the McPeak-Necula optimization for bounded-depth determinism.

  • Lazy parse trees, to decrease the space requirements from o(n) to o(1) but still O(n).

  • Consider implementing  Aycock-Horspool unrolling. Improves performance with only highly localized increase in algorithmic complexity. Subsumes many other optimizations.

  • Substring parsing. This determines if a given input string is a substring of some string accepted by the grammar. This is very useful for error recovery; when an error is encountered, the substring parser is started on the next symbol to discover additional errors.

  • “Rule node sharing” and “symbol node sharing” from page 30 or Rekers' thesis.

  • Semantic actions.

  • Ability to read in PEG grammar definitions.

What are the long term goals?

As we come to a more mature understanding of the pragmatic aspects of boolean grammars, a long-term goal is to migrate support for these features to existing high-performance GLR implementations (Elkhound, bison-glr).

Where can I read more about it?

Where can I get it?

The color coding above accurately reflects the state of the implementation (27-May-2007).

SBP is available under the BSD license.

Git Repository

git clone http://research.cs.berkeley.edu/project/sbp/

There is also a gitweb interface.

Releases

The latest release is sbp-1.00.jar and sbp-1.00.tgz.

You can find all the releases here.