REPP (the Regular Expression Pre-Processor) uses a set of rules to tokenise and normalise text, while maintaining character offset information for each token. For details, see http://moin.delph-in.net/ReppTop and http://aclweb.org/anthology-new/P/P12/P12-2074.bib. INSTALL In the top level directory, run autoreconf -i ./configure make DEPENDENCIES a C++ compiler (g++, clang, etc) Boost Program_options Boost Filesystems Boost Regex ICU USAGE repp [options] [input-file] Options: -h [ --help ] This usage information. -c [ --config ] arg Configuration file (REQUIRED). --format arg Token format: string, line, triple (default string). -r [ --rpp ] arg Specify non-default location of directory containing repp modules. (Default locations are relative to the config file. See README.) A configuration file containing required options (see CONFIGURATION) must be given. Input text will be taken from input-file if given, else from STDIN. The --format option is optional (and can be set in the config file). The currently implemented output formats are: - string: Tokens are delimited by spaces, line breaks as in the input. - line: One token per line, double newline for each line in the input. - triple: As for line, but tokens are represented as a start, end, token triple, where the start and end positions represent line-based inter-character positions in the input text. The -rpp option is also optional and settable in the config file. It specifies where to look for the repp modules. If not specified, it will look in three standard places, in the following order: * the directory where the configuration file is * an `rpp' subdirectory of the directory where the configuration file is * an `rpp' sibling directory of the directory where the configuration file is To reduce complication and confusion, all modules are expected to be in the same directory. EXAMPLE cat test.txt|src/repp -c etc/repp.set --format triple CONFIGURATION The format for options in the config file is option := val1 val2 val3. Multiple values are white space separated. Each option can be written over more than one line, but must end with `.'. Only the last instantiation of an option is retained. Lines beginning with a semicolon are comments. Two example configuration files are included in the etc directory. The rules used along with the English Resource Grammar for tokenising various styles of English are included in the etc/rpp directory. The configuration file must define: - repp-modules: the list of modules that are read on initialisation - repp-tokenizer: the top level rule file Other options: - repp-calls: the subset of repp modules to be used at any one time. If this is not set, only the repp-tokenizer module will run. This is set separately from repp-modules to allow more fine-grained realtime control. - format: token output format - repp-directory: the directory where the repp rule files can be found