GPoSTTL is an enhanced version of Eric Brill's rule-based
Parts-of-Speech Tagger, with built-in Tokenizer and
Lemmatizer. It reads from FILE or STDIN and writes to
STDOUT. It is based on LPost package by Jimmy Lin
(jimmylin at umd.edu). LPost itself is based on Benjamin
Han's ePost package, which is a cleaned-up version of
Eric Brill's original code. The primary lemma list was taken
from e_lemma.txt (Ver.1), compiled by Prof. Yasumasa Someya
(someya at someya-net.com), with permission.
Later it has been and being enhanced by hundreds of
additional entries.
Motivations:
-
GPoSTTL has been developed as an open-source alternative
for TreeTagger ,
a Penn Treebank tagger which was used as a crucial component of
Anubadok: A GPL'ed machine translator for Bengali.
GPoSTTL is now used as the default tagger in the Anubadok system.
The default mode of GPoSTTL uses enhanced Penn tagset that makes
its output compatible with the output of TreeTagger.
This ensures that GPoSTTL can be used as a drop-in substitute for TreeTagger.
New Features:
-
GPoSTTL contains several new features compared to the original Brill Tagger.
The main improvements over Brill taggers are the following.
- GPoSTTL has built-in Tokenizer that allows one to feed raw English text
directly into the tagger. This eliminates the need for pre-tokenization of English text.
- Initialization method for unknown numerals has been added. It ensures unknown
cardinal numbers are tagged by the tag "CD" in the start-state-tagger sequence.
-
Brill tagger's original lexicon has been replaced by a lemmatized lexicon. This
allows Lemmatizer of GPoSTTL to print lemma information for each token.
-
The default mode of GPoSTTL uses enhanced Penn tagset
to make its output compatible with the output of TreeTagger.
In particular, second letter of the verb tags distinguishes
between "be" verbs (B), "have" verbs (H) and other verbs
(V). The enhancement is done at last step of tagging
procedure as its lexicon contains the original Penn
tagset.
Installation:
- It has autoconf based standard installation procedure
i.e. "configure (--prefix=$HOME for non-root installations),
make, make install". Please see the
manpage for usage details. It is known to work with GNU/Linux and
FreeBSD.
Download:
- Latest official release (0.9.3)
of GPoSTTL can be downloaded from here
under Historical Permission Notice and Disclaimer
as classified by the OpenSource Initiative
for license naming.
- To download latest source code from SVN repository please use the
following command
- svn checkout
https://gposttl.svn.sourceforge.net/svnroot/gposttl/trunk gposttl
To browse GPoSTTL's SVN repository, please
click here.