DELPH-IN members share a commitment to re-usable, multi-purpose resources and
active exchange.
Based on contributions from several members and joint development over many
years, an
open-source repository of
software and linguistic resources has been created that has wide usage in
education, research, and application building.
At the core of the DELPH-IN repository is agreement among partners on a shared
set of linguistic assumptions (grounded in
HPSG and Minimal Recursion Semantics)
and on a common formalism (i.e. logic) for linguistic description in typed
feature structures.
The formalism is implemented in several development and processing environments
(that can serve differing purposes) and enables the exchange of grammars and
lexicons across platforms.
Formalism continuity, on the other hand, has allowed DELPH-IN researchers to
develop several comprehensive, wide-coverage grammars of diverse languages that
can be processed by a variety of software tools.
Over time, the following configuration of core components has emerged as a
typical grammar engineering configuration that is commonly used both by
DELPH-IN members and other research initiatives.
- The Linguistic
Knowledge Builder (LKB) provides an interactive grammar development
environment for typed feature structure grammars. The LKB includes a
parser and generator, visualization tools for all relevant data structures
(including trees, feature structures, MRSs, hierarchies, parse and
generation charts), and a set of specialized debugging facilities (like
‘interactive unification’) and well-formedness tests for
grammar and lexicon.
The LKB is implemented in ANSI Common-Lisp and available in full source
code or as precompiled binaries for common platforms, including Linux,
Solaris, and MS Windows.
- The PET System for the
high-efficiency processing of typed feature structure grammars complements
the LKB as a run-time and application delivery component. PET interprets
the same logical formalism (in fact reads the exact same grammar source
files) and provides a parser that is (much) less resource-demanding than
the LKB, more robust, portable, and available as a library that can be
embedded into NLP applications. Unlike the LKB, PET includes only very
limited debugging facilities.
The PET System is implemented in ANSI C++ (with critical parts in pure ANSI
C to improve run-time efficiency) and has been ported to several Unix
flavours and MS Windows. Its industrial-strength code quality has already
been confirmed in a commercial product built on top of PET. Full source
code and pre-compiled binaries for (currently) Linux are available.
- The [incr tsdb()] Competence and
Performance Profiler provides an evaluation and benchmarking tool to
grammar writers and system developers alike. [incr tsdb()] (‘tee
ess dee bee plus plus’) acts like an umbrella application to a
range of processing systems for typed feature structure grammars, including
the LKB and PET, and defines a common format for the organization of test
suites or corpora and the storage of precise and fine-grained measures of
grammar
and processor behavior. [incr tsdb()] profiles abstract over the
idiosyncrasies of individual platforms and, thus, facilitate
contrastive cross-platform comparison as well as in-depth analysis.
The [incr tsdb()] environment is implemented in ANSI C (for a simple DBMS),
ANSI Common-Lisp (core functionality), and Tcl/Tk (GUI) and has been used
successfully in various Un*x flavours. Besides a distribution in full
source-code, pre-compiled object files are available that can be loaded on
top of common LKB run-time binaries.
Linguistic resources that are available as part of the DELPH-IN open-source
repository include broad-coverage grammars for English, German, and Japanese,
as well as a set of ‘emerging’ grammars for French, Korean,
Modern Greek, Norwegian, Portuguese, and Spanish.
Additionally, a proprietory grammar for Italian
(developed by CELI s.r.l. in Torino) uses the exact same DELPH-IN formalism
(and many of the Matrix assumptions) and is available for licensing.
Following is some more background information on select grammars:
- The LinGO English
Resource Grammar (ERG) is being developed at the Center for the
Study of Language and Information (CSLI) at Stanford University since 1993.
The ERG was originally developed within the Verbmobil machine translation
effort, but over the past few years has been ported to additional domains
(most notably in an ecommerce and financial services self-help product that
is marketed by a CSLI industrial affiliate) and significantly extended.
The grammar includes a hand-built lexicon of around ten thousand lexemes
and allows interfacing to external lexical resources (like COMLEX). The
main grammar developer is Dan Flickinger, with contributions by (among
others) Emily Bender, Rob Malouf, and Jeff Smith.
- La Grenouille,
The French Resource Grammar was originally designed as a tool
for modeling selected linguistic phenomena by incorporating
insights from ongoing research into the formal analysis of
French in HPSG (Abeillé, Bonami, Boyé, Desmets, Godard, Miller,
Sag, Tseng). In addition to basic clausal structures, the
grammar provides a treatment of (for example) complex predicate
constructions (compound tenses, causatives) and
morpho-syntactic and phono-syntactic effects (clitic climbing,
contraction, vowel elision, consonant liaison). La Grenouille,
currently in its tadpole stage, is undergoing metamorphosis; a
generation-enabled version has been made available for public
distribution in mid-2006. Further inquiries can be addressed
to Jesse Tseng, the primary developer at Loria (Nancy, France).
- The
JaCY
Grammar of
Japanese is being developed at the German National Research Center in AI
(DFKI GmbH) and Saarland University (both in Saarbrücken, Germany)
since about 1996. Like the ERG, the JaCY grammar had its origins in
Verbmobil and has since been extended in numerous aspects, including
(through a cooperation with YY Technologies of Mountain View, CA) for use
in an email auto-response product, for the analysis of newspaper texts (at
Saarbrücken), and lately for knowledge extraction from dictionary
entries (at the NTT Communications Research Laboratories).
The grammar is comparable in scope and size to the LinGO ERG and builds on
the ChaSen package for
word segmentation, morphological analysis, and a treatment of unknown
words.
Melanie Siegel (Saarbrücken) is the main JaCY developer; Emily Bender
(formerly YY Technologies) has made significant contributions,
and Atsuko Shimada (Saarbrücken) and Francis Bond and colleagues (NTT)
continue to add to the lexicon and grammar proper.
- The
Korean
Resource Grammar is a computational grammar for Korean
currently under development by Jong-Bok Kim at Kyung Hee
University and Jaehyung Yang at Kangnam University. The grammar,
adopting the formalism of HPSG and Minimal Recursion Semantics,
aims to develop an open-source grammar of Korean. The
morphological analyzer we use for Korean is MACH. The grammar
developing team has a close cooperation with the LinGO Research
Laboratory at CSLI, Stanford, and the JaCY developer team. The
current grammar covers basic sentence types, relative clauses,
light verb constructions, case phenomena, auxiliary constructions,
and so forth.
- The Modern Greek Resource
Grammar
is a computational grammar for Modern Greek currently being developed at
the Department of
Computational Linguistics of Saarland University.
The grammar includes, among others, analyses
of basic clause syntax, word order and cliticization
phenomena in Modern Greek, valence alternating and ditratransitive
constructions, subject-verb inversion, subordinate clauses, relative
clauses, UDCs, raising and control, politeness contructions, as well as
the implementation of the syntax of noun phrases, passives, and
coordination phenomena.
Valia Kordoni and Julia Neu are the main developers of the Modern Greek
Resource Grammar.
- The NorSource
Grammar of Norwegian is under development at the Norwegian University
of Science and Technology (NTNU) in Trondheim.
Similar in spirit to the other resource grammars, NorSource aims for a
re-usable and precise grammar of Norwegian, adapting the theory of HPSG and
Minimal Recursion Semantics to a language (family) that arguably presents
a couple of novel challenges to existing work within the HPSG framework.
Grammar development is partially funded by the EU
Deep-Thought initiative
and currently focuses on core syntactic constructions, argument structure
and the syntax–semantics interface, and interfacing to an existing
computational lexicon for Norwegian.
Lars Hellan and Petter Haugereid at NTNU are the main NorSource developers,
working with a team of other researchers and students.
- The
Spanish Resource Grammar (SRG) is a computational
grammar for Spanish currently being developed at Institut Universitari de
Lingüística Aplicada of Universitat Pompeu Fabra.
MSG development is currently funded by the Juan de la Cierva program
(MEC, Spain) within the TEXTERM-II project (BFF2003-2111).
Montserrat Marimon and Núria Bel are the main developers of the SMG.
As several HPSG implementations evolved within the same common formalism, it
became clear that homogeneity among existing grammars could be increased and
development cost for new grammars greatly reduced by compiling an inventory of
cross-linguistically valid (or at least useful) types and constructions. The
LinGO Grammar Matrix provides a
starter
kit to grammar engineers, facilitating not only efficient bootstrapping but
also rapid growth towards the wide coverage necessary for robust natural
language processing and the precision parses and semantic representations that
the ‘deep’ processing paradigm has to offer.
The Matrix (in its current release version 0.4) comprises (a) types definitions
for the basic feature geometry and technical devices, (b) the representation
and composition machinery for with Minimal Recursion Semantics in a type
feature structure grammar, (c) general classes of rules, including derivational
and inflectional (lexical) rules, unary and binary phrase structure rules,
headed and non-headed rules, and head-initial and head-final rules, and (d)
types for basic constructions such as head-complement, head-specifier,
head-subject, head-filler, and head-modifier rules, coordination, as well as
more specialized classes of constructions.
Finally, as processing efficiency and grammatical coverage have become less
pressing aspects for ‘deep’ NLP applications, the research focus of
several DELPH-IN members has shifted to combinations of ‘deep’
processing with stochastic approaches to NLP, on the one hand, and to building
hybrid NLP systems that integrate ‘deep’ and ‘shallow’
techniques in novel ways.
More specifically, the transfer of DELPH-IN resources into industry has
amplified the need for improved parse ranking, disambiguation, and robust
recovery techniques and there is now broad consensus that applications of
broad-coverage linguistic grammars for analysis or generation require the use
of sophisticated stochastic models.
The LinGO Redwoods initiative is
providing the methodology and tools for a novel type of treebanks, far richer
in the granularity of available linguistic information and dynamic in both the
access to treebank information and its evolution over time. Redwoods has
completed two sets of treebanks, each of around 7,000 sentences, for Verbmobil
transcribed dialogues and customer emails from an ecommerce domain. On-going
research for the Redwoods group at Stanford (and partners in Edinburgh and
Saarbrücken) is investigating generative and conditional probabilistic
models for parse disambiguation in conjunction with the LinGO ERG (and other
DELPH-IN grammars).
The Heart of Gold environment is
an XML-based middleware for the integration of deep and shallow natural
language processing components, with the focus on robust, multilingual,
application-oriented HPSG parsing assisted by, for example, shallow
part-of-speech taggers, chunkers and named entity recognizers.
The Heart of Gold provides a uniform inrastructure for building applications
that use RMRS-based and/or XML-based natural language processing
components.
The middleware itself has been developed at DFKI and Saarland
University within the
DeepThought and
Quetal projects, and is published under
LGPL.
However, many NLP components for which adapters (‘Modules’)
are provided, come with different licenses, most of them free for research
purposes. The deep
component
that is currently integrated is PET, with all deep HPSG grammars mentioned on
the DELPH-IN site. Additional deep and shallow NLP components can be integrated
easily by providing a simple Java class or an XML-RPC interface.