SBP: the Scannerless Boolean Parser
15-Jan-2009: Mailing List Created
There is now a mailing list for SBP. You can subscribe here or read it on gmane as gmane.comp.parsers.sbp.user
11-Jan-2009: SBP 1.0 Released
This release isn't happening so much as a result of something occurring as a result of something not occurring – it's been over a year since I've come across a correctness bug, so I figure now is a good time for a “1.0” release. See the releases section below.
The repository has been converted from darcs to git. The git repo can be checked out with this command:
git clone http://research.cs.berkeley.edu/project/sbp/edu.berkeley.sbp.git
There is also a gitweb interface.
The old darcs repository, to which patches will no longer be pushed, can be found here:
I've resolved to start documenting the metagrammar. Here are the first steps towards that goal.
Preliminary support for serialization has been checked in. Note that this is still being developed; right now deserializing a parse table actually takes longer than rebuilding it from scratch. See RegressionTests.java for an example of how to use this feature.
Another checkin with some minor bugfixes.
Just got around to checking in a few changes that had been lying around; the only user-visible change is that grammar files may now include tab characters as whitespace in the grammar as well as “\t”, which matches the tab character in the input.
I've finally gotten around to fixing the GSS node-sharing algorithm; it should now be at least as good as Rekers'. This gives a major performance improvement for some grammars, and none for others. Also added a test case to check for regressions on this.
Parse tables (class Parser) are now serializable (implement Serializable), which means that the time to construct a parse table should no longer be a major factor in overall performance. This should make it possible to trade off more intensive analysis at parse-table-construction-time for better parse-time performance.
Another checkin; the tokens [>>] and [<<] now represent increasing and decreasing indentation (respectively) and you can lift (backtick) on arbitrary components of a sequence. I've also taken big steps towards the near-future goal of being able to serialize parse tables.
WIX has been released. This is the major project that I'm currently using SBP for.
Just checked in a whole bunch of changes from the last three months of development (everything since mid-February). The complete release notes are here; notable changes include:
Major performance improvements
SBP is now able to “garbage collect” stacks corresponding to secondary conjuncts; parse errors are now reported immediately.
Added the “...” symbol (pronounced “anything”) and “generalized negation” operator (~~) to the metagrammar.
Added the IndentingReader class
Added the sbp.ansi=off and sbp.verbose=true system properties
Removed the reflective bindings and annotations APIs.
Minor changes to the rules for tagging in metagrammars; as a result, parse trees are now much more predictable.
A preview of WIX is now available. WIX was one of the main motivations for writing SBP.
A new snapshot is here.
Error handling has been massively improved. Here's an example parsing from a substantial portion of the Java 1.5 grammar. The input is missing a closing angle-bracket on a generic type definition. Type make java15 after a checkout to try it yourself.
The API has been finalized and includes a decent example/mini-tutorial.
There is now a mailing list.
The reflective grammar-to-java bindings are complete, so SBP is now vastly easier to use. You can find example code here and the companion grammar here.
What is it?
The Scannerless Boolean Parser (SBP) is a scannerless parser for boolean grammars (a superset of context-free grammars). It is written in Java and emits Java source code.
What is interesting about it?
SBP deliberately sacrifices performance in favor of ease of extensibility.
Since it is an implementation of the (modified) Lang-Tomita GLR algorithm, SBP supports all context-free languages.
It is scannerless (does not require a lexer). This allows it to easily handle languages which have non-regular lexical structure or lack a clear lexer-parser distinction, such as TeX, XML, RFC1738 (URLs), ASN.1, SMTP headers, and Wiki markup. For example, when using SBP as a parser one can make URLs first-class lexenes.
In addition to the juxtaposition and union operators provided in context-free languages, SBP supports grammars which use the intersection operator (conjunctive grammars) and the complement operator (boolean grammars).
What features does it have?
All features below are fully implemented
An implementation of the Lang-Tomita GLR parsing algorithm
Ability to parse a wide variety of grammars in O(n3 log n) time:
all context-free grammars
epsilon productions, included in the parse forest
circularities, (not included in the parse forest).
Regular expression operators (*, ?, + )
boolean grammars (intersection, intersect-with-complement, and generalized-complement)
Facilitates experimenting with grammars
Interpreted mode, in which the parse table is interpreted directly, eliminating the need for a compiler and making it easier for grammars to operate on grammars.
Simple API makes it easy to generate, analyze, and modify grammars programmatically.
Components of a grammar (nonterminals, productions, etc) represented as objects
composite elements implement Iterable
What is it deliberately missing?
What features would be nice to have?
Compiled mode, in which Java source code is emitted; compiling this code yields a parser. The resulting parser is much faster.
Drop Farshi's algorithm and use GRMLR Done!
An implementation of the McPeak-Necula optimization for bounded-depth determinism.
Lazy parse trees, to decrease the space requirements from o(n) to o(1) but still O(n).
Consider implementing Aycock-Horspool unrolling. Improves performance with only highly localized increase in algorithmic complexity. Subsumes many other optimizations.
Substring parsing. This determines if a given input string is a substring of some string accepted by the grammar. This is very useful for error recovery; when an error is encountered, the substring parser is started on the next symbol to discover additional errors.
“Rule node sharing” and “symbol node sharing” from page 30 or Rekers' thesis.
Ability to read in PEG grammar definitions.
What are the long term goals?
As we come to a more mature understanding of the pragmatic aspects of boolean grammars, a long-term goal is to migrate support for these features to existing high-performance GLR implementations (Elkhound, bison-glr).
Where can I read more about it?
Where can I get it?
The color coding above accurately reflects the state of the implementation (27-May-2007).
SBP is available under the BSD license.
git clone http://research.cs.berkeley.edu/project/sbp/
There is also a gitweb interface.
The latest release is sbp-1.00.jar and sbp-1.00.tgz.
You can find all the releases here.