[223000240010] |
com.aliasi.spell.JaroWinklerDistance
JaroWinklerDistance.java
com.aliasi.spell.StringDistance
.
[223000290190] |I implemented all of Winkler's tests as unit tests along with a number of pure boundary condition tests.
[223000290200] |It'll be out in the next release of LingPipe, along with some TF/IDF-based distances.
[223000290210] |Here's the class javadoc, which contains links to all of the relevant papers and test cases.
[223000290220] |----------------------------------------
[223000290230] |The JaroWinklerDistance
class implements the original Jaro string comparison as well as Winkler's modifications.
[223000290240] |As a distance measure, Jaro-Winkler returns values between 0
(exact string match) and 1
(no matching characters).
[223000290250] |Note that this is reversed from the original definitions of Jaro and Winkler in order to produce a distance-like ordering.
[223000290260] |The original Jaro-Winkler string comparator returned 1
for a perfect match and 0
for complete mismatch; our method returns one minus the Jaro-Winkler measure.
[223000290270] |The Jaro-Winkler distance measure was developed for name comparison in the U.S. Census.
[223000290280] |It is designed to compae surnames to surnames and given names to given names, not whole names to whole names.
[223000290290] |There is no character-specific information in this implementation, but assumptions are made about typical lengths and the significance of initial matches that may not apply to all languages.
[223000290300] |The easiest way to understand the Jaro measure and the Winkler variants is procedurally.
[223000290310] |The Jaro measure involves two steps, first to compute the number of "matches" and second to compute the number of "transpositions".
[223000290320] |The Winkler adjustment involves a final rescoring based on an exact match score for the initial characters of both strings.
[223000290330] |cs1
and cs2
.
[223000290350] |The Jaro-Winkler distance is defined by the following steps.
[223000290360] |After the definitions, we consider some examples.
[223000290370] |Step 1: Matches: The match phase is a greedy alignment step of characters in one string against the characters in another string.
[223000290380] |The maximum distance (measured by array index) at which characters may be matched is defined by:
[223000290390] |The match phase is a greedy alignment that proceeds character by character through the first string, though the distance metric is symmetric (that, is reversing the order of arguments does not affect the result).
[223000290400] |For each character encountered in the first string, it is matched to the first unaligned character in the second string that is an exact character match.
[223000290410] |If there is no such character within the match range window, the character is left unaligned.
[223000290420] |Step 2: Transpositions: After matching, the subsequence of characters actually matched in both strings is extracted.
[223000290430] |These subsequences will be the same length.
[223000290440] |The number of characters in one string that do not line up (by index in the matched subsequence) with identical characters in the other string is the number of "half transpositions".
[223000290450] |The total number of transpoisitons is the number of half transpositions divided by two, rounding down.
[223000290460] |The Jaro distance is then defined in terms of the number of matching characters matches
and the number of transpositions, transposes
:
[223000290470] |In words, the measure is the average of three values; (a) the percentage of the first string matched, (b) the percentage of the second string matched, and (c) the percentage of matches that were not transposed.
[223000290480] |Step 3: Winkler Modification The Winkler modification to the Jaro comparison, resulting in the Jaro-Winkler comparison, boosts scores for strings that match character for character initially.
[223000290490] |Let boostThreshold
be the minimum score for a string that gets boosted.
[223000290500] |This value was set to 0.7
in Winkler's papers (see references below).
[223000290510] |If the Jaro score is below the boost threshold, the Jaro score is returned unadjusted.
[223000290520] |The second parameter for the Winkler modification is the size of the initial prefix considered, prefixSize
.
[223000290530] |The prefix size was set to 4
in Winkler's papers.
[223000290540] |Next, let prefixMatch(cs1,cs2,prefixSize)
be the number of characters in the prefix of cs1
and cs2
that exactly match (by original index), up to a maximum of prefixSize
.
[223000290550] |The modified distance is then defined to be:
[223000290560] |Examples: We will present the alignment steps in the form of tables, with offsets in the second string below the first string positions that match.
[223000290570] |For a simple example, consider comparing the given (nick)name AL
to itself.
[223000290580] |Both strings are of length 2.
[223000290590] |Thus the maximum match distance is max(2,2)/2 - 1 = 0
, meaning all matches must be exact.
[223000290600] |The matches are illustrated in the following table:
[223000290610] |The notation in the matches row is meant to indicate that the A
at index 0
in cs1
is matched to the A
at index 0
in cs2
.
[223000290620] |Similarlty for the L
at index 1 in cs1
, which matches the L
at index 1 in cs2
.
[223000290630] |This results in matches(AL,AL) = 2
.
[223000290640] |There are no transpositions, so the Jaro distance is just:
[223000290650] |Applying the Winkler modification yields the same result:
[223000290660] |Next consider a more complex case, matching MARTHA
and MARHTA
.
[223000290670] |Here the match distance is max(5,5)/2 - 1 = 1
, allowing matching characters to be up to one character away.
[223000290680] |This yields the following alignment.
[223000290690] |Note that the T
at index 3 in the first string aligns with the T
at index 4 in the second string, whereas the H
at index 4 in the first string alings with the H
at index 3 in the second string.
[223000290700] |The strings that do not directly align are rendered in bold.
[223000290710] |This is an instance of a transposition.
[223000290720] |The number of half transpositions is determined by comparing the subsequences of the first and second string matched, namely MARTHA
and MARHTA
.
[223000290730] |There are two positions with mismatched characters, 3 and 4.
[223000290740] |This results in two half transpositions, or a single transposition, for a Jaro distance of:
[223000290750] |Three initial characters match, MAR
, for a Jaro-Winkler distance of:
[223000290760] |Next, consider matching strings of different lengths, such as JONES
and JOHNSON
:
[223000290770] |The unmatched characters are rendered in italics.
[223000290780] |Here the subsequence of matched characters for the two strings are JONS
and JONS
, so there are no transpositions.
[223000290790] |Thus the Jaro distance is:
[223000290800] |The strings JONES
and JOHNSON
only match on their first two characters, JO
, so the Jaro-Winkler distance is:
[223000290810] |We will now consider some artificial examples not drawn from (Winkler 2006).
[223000290820] |First, compare ABCVWXYZ
and CABVWXYZ
, which are of length 8, allowing alignments up to 8/4 - 1 = 3
positions away.
[223000290830] |This leads to the following alignment:
[223000290840] |Here, there are 8/8 matches in both strings.
[223000290850] |There are only three half-transpositions, in the first three characters, because no position of CAB
has an identical character to ABC
.
[223000290860] |This yields a total of one transposition, for a Jaro score of:
[223000290870] |There is no initial prefix match, so the Jaro-Winkler comparison produces the same result.
[223000290880] |Now consider matching ABCVWXYZ
to CBAWXYZ
.
[223000290890] |Here, the initial alignment is 2, 1, 0
, which yields only two half transpositions.
[223000290900] |Thus under the Jaro distance, ABC
is closer to CBA
than to CAB
, though due to integer rounding in computing the number of transpositions, this will only affect the final result if there is a further transposition in the strings.
[223000290910] |Now consider the 10-character string ABCDUVWXYZ
.
[223000290920] |This allows matches up to 10/2 - 1 = 4
positions away.
[223000290930] |If matched against DABCUVWXYZ
, the result is 10 matches, and 4 half transposes, or 2 transposes.
[223000290940] |Now consider matching ABCDUVWXYZ
against DBCAUVWXYZ
.
[223000290950] |Here, index 0 in the first string (A
) maps to index 3 in the second string, and index 3 in the first string (D
) maps to index 0 in the second string, but positions 1 and 2 (B
and C
) map to themselves.
[223000290960] |Thus when comparing the output, there are only two half transpositions, thus making the second example DBCAUVWXYZ
closer than DABCUVWXYZ
to the first string ABCDUVWXYZ
.
[223000290970] |Note that the transposition count cannot be determined solely by the mapping.
[223000290980] |For instance, the string ABBBUVWXYZ
matches BBBAUVWXYZ
with alignment 4, 0, 1, 2, 5, 6, 7, 8, 9, 10
.
[223000290990] |But there are only two half-transpositions, because only index 0 and index 3 mismatch in the subsequences of matching characters.
[223000291000] |Contrast this with ABCDUVWXYZ
matching DABCUVWXYZ
, which has the same alignment, but four half transpositions.
[223000291010] |distance(CharSequence,CharSequence)
and compare(CharSequence,CharSequence)
methods do not require its inputs to be padded with spaces.
[223000291080] |In addition, spaces are treated just like any other characters within the algorithm itself.
[223000291090] |There is also no case normalization in this class's version.
[223000291100] |Furthermore, the boundary conditions are changed so that two empty strings return a score of 1.0
rather than zero, as in the original algorithm.
[223000291110] |The algorithm, along with applications in record linkage, is sketched in the following highly readable survey article:
[223000291120] |strcmp95.c
or the results of this class, which matches strcmp95.c
).
[223000291160] |The description of the matching procedure above is based on the actual strcmp95
code, the boundary conditions of which are not obvious from the text descriptions in the literature.
[223000291170] |An additional difference is that strcmp95
, but not the algorithms in Winkler's papers nor the algorithm in this class, provides the possibility of partial matches with similar-sounding characters ( .g. c
and k
).
[223000291180] |The greedy nature of the alignment phase in the Jaro-Winkler algorithm actually prevents the optimal alignments from being found in some cases.
[223000291190] |Consider the aligning ABCAWXYZ
with BCAWXYZ
:
[223000291200] |Here the first pair of A
characters are matched, leading to three half transposes (the first three matched characters).
[223000291210] |A better scoring, though illegal, alignment would be the following, because it has the same number of matches, but no transposes:
[223000291220] |The illegal links are highlighted in yellow.
[223000291230] |Note that neither alignment matches in the initial character, so the Winkler adjustments do not apply.
[223000291240] |strcmp95.c
code as a reference implementation.
[223000300010] |lm.LanguageModel.Dynamic
was also changed to support sequence-level training with counts.
[223000320130] |New Class spell.TfIdfDistance: This matches strings based on token frequency and inverse document frequency (TF/IDF).
[223000320140] |Any tokenizer may be used, including character n-grams.
[223000320150] |New Class spell.JaroWinklerDistance: This matches strings based on the Jaro-Winkler distance used to deduplicate name data at the U.S. Census Bureau.
[223000330010] |O(k2 n)
where n
is input length and k
is the number of tags.
[223000350070] |The square comes from considering all pairs of tags for the previous token and current token.
[223000350080] |Our pruning attacks both sizes, reducing the k2
factor to c1 c2
where c1
is the number of forward hypotheses for the last tag surviving pruning and c2
is the number of tags surviving the emission pruning.
[223000350090] |In practice, for our Brown corpus-based part-of-speech tagger, there are 90 tags, and therefore 90*90=8100 tag pairs.
[223000350100] |Pruning reduces this number roughly ten-fold on average without loss of first-best accuracy on held-out data.
[223000350110] |The beam-configurable version should be out soon in LingPipe 2.4.1.
[223000350120] |Spelling
[223000350130] |We’ve been working on developing our second large-scale spelling corrector, this time trained with 600MB of movie and TV-related data versus our last round training with 50GB of legal case law data (all English with some light Spanish in titles in the TV case).
[223000350140] |In both cases, the out-of-the-box speed wasn’t good enough —around 10 queries/second.
[223000350150] |All of the time is spent exploring insertions and substitutions.
[223000350160] |But these are constrained to continue within a known token in the token-sensitive case.
[223000350170] |This means we never correct to a token we didn’t see in training data, which is generally good for accuracy on languages with light morphology like English.
[223000350180] |But it also means we have a huge lever to pull for increasing speed.
[223000350190] |As the number of terms is reduced, speed goes way up.
[223000350200] |The second parameter to tune is the do-not-edit tokens.
[223000350210] |We generally don’t edit user tokens of 1 or 2 characters, nor do we edit longer tokens with high corpus counts.
[223000350220] |Combining the do-not-edit list with a pruned known-token list, we increased speed ten-fold.
[223000350230] |Unfortunately, to hit the uncached rate of 100 queries/second cost us about 5% in accuracy overall, mostly in false negatives (failure to suggest corrections for errors).
[223000350240] |You may not realize this from the demo on trivial amounts of data, but our spelling corrector is surprisingly accurate when you train it on a gigabyte of domain specific data.
[223000350250] |We’re seeing correction rates of 60-70% (percentage of user errors for which our first-best suggestion is correct) with false positive rates around 1%.
[223000350260] |Of course, if 10% of the queries contain errors, the final breakdown is: 89% user correct, no system suggestion; 7% user error, system correct suggestion, 3% user error, system wrong or no suggestion, 1% user correct, system spurious suggestion.
[223000350270] |Many of the errors are fairly unrecoverable by local edits.
[223000350280] |Keep in mind that this is a devilishly difficult domain for which to annotate truth, because user intentions are implicit.
[223000350290] |For instance, is “toomestone” the name of a band on YouTube (152 hits on Google) or a misspelling of “tombstone” (6.5 million hits)?
[223000350300] |Google suggests “tombstone” and that’s a true-positive in the second case, but a false-positive in the first case.
[223000350310] |Of course, if a user with the same IP address types “tombstone”, you can at least guess it was misspelling case.
[223000350320] |Google will suggest “tomb stone” might mean “tombstone”, but clearly “tomb” and “stone” could make up a query that’s not related to tombstones (for instance, for a survey of tomb construction).
[223000350330] |At best, we can hope the statistics play out to let us guess right most of the time.
[223000350340] |After all, if the band Toomestone had more fans and more people searching for it, there’d be more hits for it online.
[223000350350] |This is the whole point of building the source side (modeling what to expect) of the noisy channel model.
[223000350360] |This time around we also built some spelling-specific weighted edit distances.
[223000350370] |We found that users simply don’t know when something is one word or two words, especially for names of products (e.g. they’d type “ling pipe” for “lingpipe” or “postagger” for “pos tagger”), so space insertion/deletion needed to be relatively cheap.
[223000350380] |Second, they made all the usually noted errors of reduced vowel confusion (“memoribilia” vs. “memorabilia”).
[223000350390] |Unfortunately, our distance model is single character and context-free, so we can’t easily model the confusion around when consonants are duplicated (“parallel” vs. “parralell” vs. “parallell”).
[223000350400] |Of course, you’re going to add your own cache, aren’t you?
[223000350410] |In both cases, query logs show a huge number of repetitions (cache hits of over 2/3 in the TV-related queries after 20K queries).
[223000350420] |That’s one of the advantages Google has —virtually infinite cache space for things they can caste as indexing/search/lookup problems.
[223000350430] |Caching and Pruning
[223000350440] |Combined, these two general techniques have taken our very slow HMM and spelling correctors up to usable speeds.
[223000360010] |com.aliasi.util
; the Iterators
class and SmallSet
were rather challenging for a rookie like me.
[223000410040] |My major complaint about Java when I first saw it was the lack of parametric typing.
[223000410050] |Many, if not most, of my programming errors are from assigning something to the wrong argument.
[223000410060] |I’m hopeless in Perl, for instance.
[223000410070] |As always, be careful what you wish for.
[223000410080] |Little prepared me for the complexity of Java 1.5 generics.
[223000410090] |My hat’s off to the architects and developers of this amazing retro-fit.
[223000410100] |The addition of generics to a non-generic language is a heroic feat of backward compatibility engineering.
[223000410110] |What I should’ve realized before I started, but only dawned on me about 10 classes in is that just adding types doesn’t break backward compatibility anywhere.
[223000410120] |That’s why it’s possible to run 1.4-style programs in a 1.5 or 1.6 compiler or JVM.
[223000410130] |Very very cool.
[223000410140] |Luckily, using generic classes is a whole lot easier than writing them.
[223000410150] |Even writing them is easy if you’re only worried about type safety and saving some work, and not on making something general and extensible.
[223000410160] |My main gripe has to do with arrays.
[223000410170] |As in pre-generics Java, they’re the poor cousins of types.
[223000410180] |There’s no way to create arrays of generic types without having a handle on the java.lang.Class
object for the type being created.
[223000410190] |For instance, there are two toArray()
methods in java.util.Collection
.
[223000410200] |Suppose we have a Collection
.
[223000410210] |The first array creation method has the signature Object[] toArray()
, not E[] toArray()
as one might expect, because it’s simply not possible to define the latter.
[223000410220] |The second array creation method has the signature T[] toArray(T[] a)
, which is problematic in two ways.
[223000410230] |First, you need to pass in an array to fill.
[223000410240] |Second, you can still generate type errors if you do something like this:
[223000410250] |If you omit the second line (in bold), the program compiles and runs with no complaint.
[223000410260] |If you add the second line, the program compiles, but throws a runtime exception during the underlying call to System.arrayCopy()
.
[223000410270] |In LingPipe, I often used arrays exactly because they were type safe compared to returning collections.
[223000410280] |They also tend to be smaller than collections in memory, as there’s no overhead for wrappers, buckets, tree balancing counts, etc.
[223000410290] |Now I’ve painted myself into a corner with generics.
[223000410300] |I’ve had to replace some return types that would’ve been E[]
for a type variable E
, with List
, because it’s the only way I could ensure type safety without requiring a class object (or array) be passed in. Luckily, none of that stuff’s in deep loops.
[223000410310] |That’s all arrays of primitive types, which nobody ever expected to play nicely with generics or anything else.
[223000410320] |As always, my advice to someone wanting to learn Java types is to study the java.util
collections framework.
[223000410330] |That, and the Kernighan and Ritchie of Java:
[223000410340] |Ken Arnold, James Gosling and David Holmes. 2006.
[223000410350] |The Java Programming Language, Fourth Edition.
[223000410360] |Addison-Wesley.
[223000410370] |I had to read Chapter 11 (Generic Types) many many many many times.
[223000410380] |Check out Section 16.10.1 (Genericity and Dynamic Arrays) for a deep explanation of what’s going on in the above example.
[223000410390] |Here are some links to complement the book:
[223000410400] |T[] toArray(Class)
; makes me think he didn’t read his own book!
[223000410460] |Update: April 11, 2007
[223000410470] |It occurred to me yesterday that java.util.ArrayList
must allocate generic arrays, because it implements List based on arrays.
[223000410480] |Checking out Java’s source shows you how to “cheat” the type system:
[223000410490] |It’s easy to forget that types are simply erased in the implementation.
[223000410500] |This kind of operation is safe if it’s under your control, as it is in java.util.ArrayList
and many of the LingPipe internals.
[223000410510] |It’ll still cause the compiler to gripe.
[223000410520] |I think I can live with that.
[223000410530] |Update April 26, 2007: I’m done.
[223000410540] |We’re about to release 3.0.
[223000410550] |But there was one final bug that Breck caught compiling on 1.6.
[223000410560] |Turns out that it’s a known issue with Java 1.6 throwing an assertion error from the compiler.
[223000410570] |It’s already in Sun’s DB of bugs, as bug 6500207.
[223000410580] |It arises with conjunctions of types in generic restriction.
[223000410590] |The bug report has a simple example; it arose for us in a class that wasn’t going out with 3.0 anyway, an EM/soft clusterer with Dirichlet priors, so hopefully it’ll be fixed soon.
[223000420010] |LingPipe 3.0 Released
[223000420020] |We’re happy to announce the release of LingPipe 3.0.
[223000420030] |As usual, you may view it or download it from the LingPipe Home Page.
[223000420040] |Major Release
[223000420050] |The latest release of LingPipe is LingPipe 3.0.0.
[223000420060] |This major release replaces LingPipe 2.4.1.
[223000420070] |Although 3.0 maintains all of the functionality of 2.4, it is not 100% backward compatible.
[223000420080] |All demos and tutorials have been updated to the new API.
[223000420090] |Generics
[223000420100] |The major change that will be evident is that the API has been modified to use generic types.
[223000420110] |This has allowed us to remove many redundant classes and generalize other common behavior.
[223000420120] |Many of the classes now implement java.lang.Iterable
, allowing them to be used in the Java 1.5 foreach construct.
[223000420130] |Most of the type-specific utility methods have been replaced with generic alternatives.
[223000420140] |Keep in mind that like in Java 1.5 itself, you don’t need to use the generics.
[223000420150] |You may continue to write non-generic code against our API in the same way as for the Java 1.5 collections framework.
[223000420160] |Clustering
[223000420170] |The com.aliasi.cluster
package was completely rewritten from the ground up.
[223000420180] |Now, rather than clustering the rows of a labeled matrix, a clusterer clusters a set of objects under a distance measure.
[223000420190] |The dendrogram classes for hierarchical clustering results have been cleaned of their indexing behavior, which was only necessary for the previous implementations.
[223000420200] |For the new API, there’s a completely new clustering tutorial, which among other things, uses linguistic examples such as clustering documents by topic or entity mentioned.
[223000420210] |We’ve included Bagga and Baldwin’s John Smith data (197 New York Times Articles annotated for which of 35 different John Smiths is mentioned; it’s available as the tarball demos/data/johnSmith.tar.gz
.
[223000420220] |LingPipe in Eclipse
[223000420230] |We added a tutorial on Lingpipe in Eclipse, which explains how to get started building LingPipe in the Eclipse Integrated Development Environment (IDE).
[223000420240] |Distance and Proximity
[223000420250] |Two generic classes were added to the utility package, Distance
, and Proximity
.
[223000420260] |These are not only used in clustering, but also in the distance functions in com.aliasi.spell
package.
[223000420270] |Matrices and Vectors
[223000420280] |The com.aliasi.matrix
package was simplified to remove the complexities of labeling.
[223000420290] |In the future, we plan to build this package out with sparse and memory-mapped matrices.
[223000420300] |Iterators
[223000420310] |Iterators that were formerly in util.Arrays
, namely ArrayIterator
and ArraySliceIterator
may now be found in the unified Iterators
utility class.
[223000420320] |A new Iterators.Empty
class was added in order to support genericity; it replaces the overloaded constant.
[223000420330] |Finally, util.SequenceIterator
was made rolled into util.Iterators
along with the others, util.Iterators.Sequence
.
[223000420340] |MEDLINE Parsing Standardized
[223000420350] |The medline.MedlineParser
class was modified to implement corpus.Parser .
[223000420360] |At the same time, the class medline.MedlineHandler
was modified to implement corpus.Handler
.
[223000420370] |The unusued corpus.MedlineCitationHandler
interface was removed.
[223000420380] |ObjectToCounter Simplified
[223000420390] |The util.ObjectToCounter
interface was removed; we only ever used the util.ObjectToCounterMap
implementation, a generic version of which remains.
[223000420400] |Unused Classes Removed
[223000420410] |In the code review for generics, we found unused classes in the com.aliasi.coref
package, Entity
and EntityFactory
.
[223000420420] |The class util.SmallArray
was removed.
[223000420430] |The interface util.StringDistance
was removed; it is replaced with the generic util.Distance
interface.
[223000420440] |Finally, the util.Visitor
interface was removed; the corpus.Handler
interface is doing its job.
[223000430010] |TF/IDF, Scaling and Symmetry
[223000430020] |Search engines typically use TF/IDF scaling only on the query, because it’s too hard to compute dynamically on the documents, leading to an asymmetry in scoring.
[223000430030] |I’m building some discriminitave classifiers for LingPipe.
[223000430040] |I just finished with voted kernel perceptrons, which are way cool, and though lighter than SVMs, still awfully heavy.
[223000430050] |A simpler discriminative classifier is based on inverse-document frequency weighting, as is commonly used in search engines.
[223000430060] |In this situation, the weight applied to a search term is inversely related to the number of documents in which it appears.
[223000430070] |For instance, a term appearing in docFrequency
documents, where there is a total number of numDocs
documents, is typically weighted by a damped inverse ratio like:
[223000430080] |For base 2 logs, and numDocs=1024
, we have
[223000430090] |Basically, this approach upweights terms that don’t show up in many documents.
[223000430100] |If we are using TF/IDF ranking for classification, IDF weighting provides a parametric form of discriminitive feature weighting.
[223000430110] |Term frequencies are typically dampened as well, often with a square root penalty:
[223000430120] |This has the effect of making score rise linearly with the number of overlapping counts, rather than quadratically, as it’d do in the un-dampened raw frequency counts.
[223000430130] |Scoring is typically done by cosine.
[223000430140] |That is, a length-normalized dot product of query and document vectors.
[223000430150] |What I didn’t notice until I dug into the internals of the Apache Lucene search engine and really paid attention to the scoring section in Witten et al.’s Managing Gigabytes is that IDF normalization is only applied to the query vector.
[223000430160] |That is:
[223000430170] |The reason for this is that if doc frequencies and the number of docs continually changes, so would the length of a document vector if it were normalized by IDF.
[223000430180] |In this case, every document in the result set would have to be renormalized for every query, requiring multiplications and additions for each term in the document.
[223000430190] |Even if we could afford the arithmetic overhead, we find that reverse indexes don’t even support looking up all of the terms in a document easily, because they’re maps from terms to documents, not the other way around.
[223000430200] |Instead, in dynamic search engines, IDF is not applied to doc vectors, allowing their lengths to be computed as soon as they are added to the index.
[223000430210] |The upshot of only IDF-weighting queries is a surprising asymmetry.
[223000430220] |If we index a document by its text and then create a query out of the same text, the score will be less than 1 (recall that if all entries are positive counts, cosines range between 0 and 1 inclusive).
[223000430230] |To score maximally against a document, what is required is a query with weights that will invert the IDF weighting applied to queries so that the query vector actually matches the document vector.
[223000430240] |That is, the best scoring vector against a document is the vector created from the document term frequencies with inverse IDF weighting.
[223000430250] |For the TF/IDF classifier I’m building into LingPipe, the IDFs are computed for document vectors at compile time, thus ensuring that the only query that will score 1.0 against a document is the query constructed out of the document itself.
[223000430260] |Of course if you’re a major search engine, and thus matching by boolean match and sorting by some social network metric of influence with some fiddling, none of this really matters.
[223000430270] |I actually miss Excite, which did TF/IDF search.
[223000430280] |Perhaps that’s not surprising given that Doug Cutting, the founder and original developer of Lucene, was their lead search engine developer.
[223000440010] |LingPipe 3.1.0 Released
[223000440020] |LingPipe 3.1.0 has some important bug fixes: small set generics, dictionary chunker case sensitivity, and dictionary training of chunkers.
[223000440030] |There are new classifiers (Perceptron, K-nearest-neighbor and TF/IDF) and a new clusterer (K-means) based on a new feature extraction interface and vector distances.
[223000440040] |There are also some new tutorial sections: chunking text into noun phrases and verb phrases, and dictionary-based entity extraction.
[223000440050] |Read more about and download the latest version from the LingPipe Home Page.
[223000450010] |Is 90% Entity Detection Good Enough?
[223000450020] |Ryan McDonald made the following comment on the NLPers blog: Tracking the State of the Art:
[223000450030] |Discriminative models, rich feature sets and other developments have led to named-entity taggers with accuracies above 90% This is not only for simple categories like people names and places, but also for complex entity types like genes and chemical compounds.
[223000450040] |Entity taggers are so robust today, that they can often be (and are) used out-of-the-box for many real world applications.
[223000450050] |Could someone point me to 90% accuracy gene taggers?
[223000450060] |Ideally, boxed ones that could be used for real-world applications.
[223000450070] |This is better than the best result from the last Biocreative on a single category, GENE, (not that anyone would know, because they haven’t made results public), and that evaluation used soft boundary decisions by allowing multiple “correct” boundaries.
[223000450080] |Results for the GENIA corpus, with the kind of structured categories Ryan’s talking about have been much lower.
[223000450090] |Most of the 90% accuracy of high-accuracy taggers is derived from tagging the same thing right again and again (e.g. “the” and “be” for English POS taggers, “George Bush” and “IBM” for English entity taggers trained on WSJ).
[223000450100] |Entity mentions have a power law (Zipfian) distribution, so most of them aren’t mentioned very frequently.
[223000450110] |Thus 90% per-token accuracy translates into terrible per-type accuracy.
[223000450120] |Especially when one moves away temporally and topically from the training data (e.g. from well-edited, conventiaonlly structured Wall St. Journal news articles to movie or music reviews).
[223000450130] |The situation gets worse if we’re looking for simple relations through collocations.
[223000450140] |Errors in sentences tend to be correlated, not independent, as many folks have noticed over the years.
[223000450150] |Thus we’re likely to get higher than .9 * .9 recall of common binary relations, but much lower than .9 * .9 recall of rarer binary relations.
[223000450160] |For applications, we simply don’t need the correlation between “George Bush” and “White House”, or between “p53″ and “apoptosis”; everyone knows those relations.
[223000450170] |Anyway, this problem is what’s led us to focus on high-recall approaches.
[223000450180] |Check out our named entity recognition tutorial for more information and even examples that run out of the box.
[223000450190] |Ryan McDonald was at the talk at Columbia Uni I gave on entity extraction with high recall, and his point was that often the errors are one of boundaries or of the type assignment, and that it’s really no big deal for an app if the phrase is off by a token or two on the boundaries.
[223000450200] |I suppose that depends on the application.
[223000450210] |Boundary words are often discriminative in gene names or person names, and we’ve had a world of trouble historically in coreference by clustering when the names get off by one.
[223000470010] |Release: LingPipe GUI Named Entity and Chunk Annotator
[223000470020] |We’re soft-releasing our new GUI tool for annotating corpora with named entities or other chunks.
[223000470030] |It’s set up a little differently than other such tools with which you might be familiar, such as Callisto or WordFreak.
[223000470040] |It’s basically token-oriented with combo-box controls, making it look a lot like the CoNLL named entity data in the GUI view.
[223000470050] |The goal was to make it easy to drive from the keyboard rather than working as a text editor with highlight and menu select.
[223000470060] |It does tag-a-little, learn-a-little style semi-automatic annotation (a la MITRE’s seminal Alembic Workbench) and tracks progress through a corpus annotation project.
[223000470070] |It assumes inputs are in a specified XML format.
[223000470080] |For now, the annotator’s in the sandbox, under project name citationEntities.
[223000470090] |The name arose because the project started as something to annotate a bunch of bibliographic data.
[223000470100] |Even though it’s in the sandbox, I’ve used it to annotate a ton of data already and I’ve ironed out all the bugs I’ve found.
[223000470110] |You can find our general instructions for checking projects out of the sandbox at:
[223000470120] |LingPipe Sandbox
[223000470130] |You can also use the following command to download the entire project into the current working directory:
[223000470140] |Just make sure to put it all on one line; it’s on two here because of the blog formatting.
[223000470150] |The place to start is the top-level read-me file:
[223000470160] |Let us know if you have any comments at lingpipe@alias-i.com
[223000480010] |Read/Write Locks for Multi-Threaded Training and Execution
[223000480020] |I finally built an application that exercised the write half of the read/write lock pattern used for most of LingPipe’s classifiers and chunkers (see the recent blog entry describing our chunk annotation GUI).
[223000480030] |Perhaps not coincidentally, we’re getting customer requests and mailing-list requests for the same functionality: the ability to have multiple users (perhaps over the web), add new training data dynamically, while having the server stay up to date.
[223000480040] |It’s an issue for everything from significant phrase extraction to chunking.
[223000480050] |This post will explain how LingPipe’s models need to be synchronized so that they are thread safe under concurrent training and decoding (where decoding may be significant phrase extraction, language model estimation, chunking, classification or tagging).
[223000480060] |Throughout, I’ve implemented our models to be thread-safe for multiple readers.
[223000480070] |That’s not an issue, and if that’s all you want to do, have at it.
[223000480080] |We have demos that run on servlets this way.
[223000480090] |A model is a member object of a servlet and multiple threads access it in the default configuration.
[223000480100] |The problem comes in for writes, which means training in this case, including resetting training and decoding parameters.
[223000480110] |When a thread is writing to a model, no other thread may read or write from the model.
[223000480120] |Luckily, this is a very common pattern of synchronization, known as the read-write lock pattern.
[223000480130] |In addition to generics, the 1.5 JDK finally brought Doug Lea‘s wonderful util.concurrent
library into the language as java.util.concurrent
.
[223000480140] |If you don’t know anything about threads, but like detailed coding books, I can recommend Brian Goetz’s master class textbook, Java Concurrency in Practice, to which Doug Lea, Josh Bloch and other Java API design and textbook gurus contributed.
[223000480150] |In code, read-write locks are almost trivial.
[223000480160] |Here’s what you do.
[223000480170] |Let’s say you have a dynamic classifier:
[223000480180] |We’ll also need a read-write lock:
[223000480190] |The parameter true
makes the annotator fair in that it schedules threads for the lock in the order of requests.
[223000480200] |This is not necessary for LingPipe, but may be desirable for applications.
[223000480210] |Then, if we have a piece of training data on a thread, we simply grab the write lock:
[223000480220] |Then do the training inside the scope of the lock, making sure to release the lock in a finally block in case the handler throws an exception (this may be unchecked, such as an underlying out-of-memory exception, or may be a checked exception, such as an I/O exception in some cases).
[223000480230] |As with training, decoding requires acquiring a lock, doing some work, and then freeing the lock in a finally block:
[223000480240] |Note that only the action requiring synchronization is locked.
[223000480250] |Whatever action the thread needs to take with the resulting chunking will happen after the try/finally block.
[223000480260] |That’s it.
[223000480270] |And it works for any of the LingPipe classes that allow training and execution to be interleaved.
[223000480280] |Just remember that anything that modifies the object itself is a write operation.
[223000480290] |One final note.
[223000480300] |LingPipe’s cache class, com.aliasi.util.FastCache, which may be used, for example, with HMMs, is already thread safe.
[223000480310] |Thus even though it does writes, it doesn’t need the class’s write lock.
[223000480320] |It may be used in models without requring any modification to the above read/write synchronziation pattern.
[223000480330] |For instance, we’ve used it in servlets that do spelling correction to handle caches in that multi-threaded environment.
[223000490010] |Genes are Generic, People are Specific
[223000490020] |Although many reasons are cited for genes and product names being more difficult to find than person names in text, the main linguistic difference is explained by the syntactic and semantic distinction between proper names and common nouns.
[223000490030] |Named entity recognizers for people look for names, like “Pervez Musharraf”.
[223000490040] |Syntactically, person named entity mentions take the form of proper names.
[223000490050] |Semantically, these names refer to a single person.
[223000490060] |By convention in English, they’re capitalized.
[223000490070] |Note that names may be ambiguous in that the name "John Smith" may be used to refer to many different people.
[223000490080] |But each use as a named entity will refer to a single individual.
[223000490090] |Of course, a name may be mentioned as a string without referring to an individual; I can say "There are many John Smiths." without referring to one.
[223000490100] |This is an example of the use-mention distinction.
[223000490110] |Named entity recognizers for genes and proteins look for names of genes or proteins, like P53 or McrBC.
[223000490120] |Syntactically, gene mention names take the form of proper names.
[223000490130] |Semantically, they refer to a kind, and are thus more like common nouns.
[223000490140] |There are many instances of P53 and the use of "P53" doesn’t typically refer to a single sample, much less a single molecule.
[223000490150] |Another distinction is that “p53″ can be used as a mass term, like “water”, or a countable term like “glass(es)” (compare “much water” vs. “many glasses”, or “water is everywhere” with “there are glasses everywhere”).
[223000490160] |In addition to not being conventionally capitalized in English, a fundamental problem with common nouns is that they can be subclassed or superclassed.
[223000490170] |For instance, we see "P53-wt" and "P53-m" for a wild type (natural) or mutated form of P53.
[223000490180] |P53 is also found in everything from mice to zebrafish, so the name may be qualified further for species.
[223000490190] |Similarly, there is more than one known human mutation of p53, so the particular mutant strain may be identified (often by a catalog number).
[223000490200] |The genericity in a gene or protein name is different than the ambiguity in a proper name like "Clinton", which may refer to Bill, Hilary, the fort in New York’s Central Park, or any number of other people, towns, monuments or manufacturers.
[223000490210] |It’s not like these are all types of Clintons the way a cricket is a type of insect.
[223000490220] |Linguistically, you can’t use the word "Clinton" to refer to all of these things the way you can use P53
to refer to all of its variants.
[223000490230] |Because gene “names” act like nouns, it makes it particularly difficult to draw the line between plain old common descriptive nouns and more proper-name-like gene names.
[223000490240] |For instance, consider the common noun “tumor supressor gene”.
[223000490250] |This is a noun like “starting first-baseman”, not a name such as “Jason Giambi”.
[223000490260] |Functional descriptors may also be added to gene names, as in “p53 tumor supressor protein”.
[223000490270] |Semantically, the “tumor supressor protein” is a non-restrictive modifier, with the whole meaning is “protein that supresses tumors”.
[223000490280] |All by itself, the name “p53″ picks out the type of protein.
[223000490290] |Syntactically, “protein” is the head (root noun) of the phrase (that is, what’s being referred to is a protein, not a tumor).
[223000490300] |With proper names, English syntax requires apositives, such as “Albany, capital of New York State”, or modifiers as in “New York State capital Albany”; with people these may be honorifics, as in the use of “Senator” in “Senator Hilary Clinton” or apositives, as in “Hilary Clinton, Senator from New York.”
[223000490310] |Most of the other problems with gene-name finding are shared with person-name finding.
[223000490320] |For instance, metonymy is a huge issue in both domains.
[223000490330] |The phrase “New York” may refer to the city, the government, or any of the many sports franchises calling the city home.
[223000490340] |So you find less ambiguous forms, like “New York Yankees”, used in the same way you find “p53 gene”.
[223000490350] |The main difference is that the former is still proper name, whereas the latter is still a common noun syntactically.
[223000490360] |As another example, string variation is highly prevalent in company names (e.g. “IBM”, “I.B.M.”, “International Business Machines, Inc”, “Big Blue”, etc.) which also have multiple subsidiaries (e.g. “IBM Europe”).
[223000490370] |Names transliterated from foreign languags are particularly problematic (e.g. “Al Jazirah”, “Al Jazeera”, “El Jazira”; ..).
[223000490380] |Nicknames are also popular for celebrities (e.g. “Sean Combs”, “P. Diddy”, “Puff Daddy”; or “Reggie Jackson”, “Mr. October”).
[223000490390] |Product names, by the way, are just like gene names.
[223000490400] |A term such as “Canon EOS” refers to a whole range of different cameras such as “Canon EOS Rebel Xti”.
[223000490410] |There are multiple instances of these cameras in the world, just like P53.
[223000490420] |And that’s why sites like ShopWiki Search “canon eos” and Froogle Search=”canon eos” have a hard time sorting them out.
[223000490430] |In contrast, there’s only one New York City, one New York Yankees, and one Senator Hilary Rodham Clinton.
[223000500010] |LingPipe 3.1.2 Released
[223000500020] |LingPipe 3.1.2 is a minor revision of 3.1.1.
[223000500030] |Here’s the change list:
[223000500040] |Medline Citation Parser
[223000500050] |Properly handle reprinted-in elements.
[223000500060] |TF/IDF Classifier Unknown Features
[223000500070] |Added code to ignore unknown features at classification time.
[223000500080] |Cached Mention Identity Conditions
[223000500090] |The identity conditions have been converted back to those for an Object and the unit tests changed.
[223000500100] |This was to make the code compatible with our forthcoming cross-document coreference code.
[223000500110] |In particular, this will not change the behavior of within-document coreference resolution.
[223000500120] |Dynamic Classifier Handler Interfaces
[223000500130] |Modified classifer evaluators and the dynamic LM classifiers to implement the classification handler interface.
[223000500140] |LM Handler Interfaces
[223000500150] |The trainable LM classes now implement the text handler interface.
[223000500160] |Static Converter in Chunk Handler Adapter
[223000500170] |Added a static method to convert chunkings to BIO-taggings.
[223000500180] |Spelling Sizing and Tuning
[223000500190] |There’s an added section on sizing and tuning the spelling corector.
[223000510010] |Batch vs. Online Learning : Handler, Parser, and Corpus APIs
[223000510020] |Online learning assumes that training cases are presented one at a time in a stream and the model may be output at any point.
[223000510030] |Batch learning assumes the training cases are all available at once.
[223000510040] |Note that this distinction is independent of the generative/discriminitive distinction.
[223000510050] |Simple frequency-based discriminitive models could easily be online whereas latent mixture generative models may require batch processing for estimation.
[223000510060] |Most of LingPipe’s online trainable models are also online executable (e.g. language models, LM-based classifiers and taggers, etc.).
[223000510070] |This means you can interleave training and executing without compiling in between.
[223000510080] |Some models may require compilation after being given online training (e.g. the spelling trainer), but that’s only a matter of efficient execution —the decoder is written against the compiled LM implementation, not the generic LM interface.
[223000510090] |For online learning, we use the Handler
marker interface.
[223000510100] |Sub-interfaces define actual handlers, such as the ChunkHandler
, which defines a single method handle(Chunking)
.
[223000510110] |The idea is that training data is incrementally provided through the handle()
methods of the various handlers.
[223000510120] |Although the handle methods of handlers may be called directly, they are typically called by parsers.
[223000510130] |The Parser
(where H extends Handler
) interface is generic, requiring the type of its handler to be specified.
[223000510140] |The setHandler(H)
and getHandler()
methods manage the handlers.
[223000510150] |A wide range of parse()
methods accept arguments for the input data to be parsed, which may be presented as a file, input source, character array slice or character sequence.
[223000510160] |This setup is a classic instance of the visitor pattern.
[223000510170] |The individual models with handler methods don’t need to know anything about looping over data.
[223000510180] |Without the pattern baggage, this is simply what’s known as a callback.
[223000510190] |In Java, the executable code is encapsulated as an object, which usually implements some interface for abstraction from implementation.
[223000510200] |In C, you just have a function pointer used as a callback, and there’s no type checking.
[223000510210] |The setup should be familiar from XML parsing using SAX.
[223000510220] |A SAX parser must implement the XMLReader
interface, whereas a SAX (content) handler implements the ContentHandler
interface.
[223000510230] |This is sometimes called an event-based model, with the callbacks from the reader to the handler being called "events".
[223000510240] |The nice thing about LingPipe’s implementation is that many of the gory details are implemented in two abstract classes, InputSourceParser
and StringParser
.
[223000510250] |The input source parser implements the string parsing methods in terms of input source parsing, and the string parser does the reverse.
[223000510260] |These are instances of the adapter pattern, which is used elsewhere in LingPipe.
[223000510270] |There is also an adapter for the case where the input’s XML, namely the XMLParser
subclass of InputSourceParser
.
[223000510280] |By allowing plug-and-play with parsers, we do not require our data to be in any standard on-disk representation.
[223000510290] |We need merely write parsers for representations as they arise.
[223000510300] |We prefer not to munge underlying data on disk if we can help it, as it almost invariably becomes a version management nightmare.
[223000510310] |So everything’s simple and standard for online parsers.
[223000510320] |But how do we handle batch processing?
[223000510330] |If you’re used to thinking of files, the idea’s that you’d just run your loops multiple times over the files.
[223000510340] |But our parser/handler setup isn’t quite so simple, as there’s no guarantee the data’s ever in files in a simple way.
[223000510350] |Instead, we introduce a higher-order abstraction, the Corpus
interface, where H extends Handler
.
[223000510360] |This interface specifies a method VisitCorpus(H)
, where the argument is a handler.
[223000510370] |The idea is that the handler is passed the entire corpus, one argument at a time.
[223000510380] |There’s also a convenience abstract implementation for on-disk corpora, DiskCorpus
, which is constructed with a parser and disk directory.
[223000510390] |Our batch learning methods don’t define handlers, they require corpora to be given to them wholesale.
[223000510400] |For instance, the PerceptronClassifier
takes a corpus argument in its constructor.
[223000510410] |It is then able to call the corpus’s visitCorpus()
method as many times as is necessary to achieve convergence or max out the number of allowed visits.
[223000510420] |This looks more complicated than it is in practice.
[223000510430] |If you’re not convinced, you’ll find further grist for your mill by reading Peter Norvig’s take on "patterns" (pitting Lisp against Java).
[223000530010] |But does it do Greek? Unicode, Java Chars and the BMP
[223000530020] |Yes, Extended Greek will work just fine in LingPipe.
[223000530030] |LingPipe processes at the char level, not at the byte level.
[223000530040] |Extended Greek involves characters in the 0x1F00 to 0x1FFF range.
[223000530050] |That’s in the basic multilingual plane (BMP) of unicode.
[223000530060] |The Ancient Greek numbers at 10140-1018F could be problematic if you need them, depending on what you need to do.
[223000530070] |Even these should work OK in most of LingPipe.
[223000530080] |The significance of being in the BMP is that the UTF-16 coding and hence Java’s char encoding is transparent.
[223000530090] |Specifically, for a code point in the BMP, the UTF-16 is just the two-byte sequence making up the unsigned integer representation of the code point.
[223000530100] |For any such code points, Java’s char representation is just the unsigned integer corresponding to the code point.
[223000530110] |The significance of all this for LingPipe is that we treat char values in java.lang.String and java.lang.CharSequence as characters.
[223000530120] |Thus if we have a 5-gram language model, we treat that as 5 chars.
[223000530130] |The problem for code points beyond the BMP is that they take more than one char in Java to represent.
[223000530140] |For these code points, Java’s String.length() is not the length of the the string in terms of number of code points, but in terms of number of chars.
[223000530150] |Thus it actually overstates the length.
[223000530160] |Non-BMP code points can be problematic depending on the tokenizers used, which we define at the char level, not at the code point level.
[223000530170] |So you could write a nefarious tokenizer that split surrogate pairs of chars representing non-BMP code points into different tokens.
[223000530180] |Our character n-gram tokenizers are nefarious in this way, but even they won’t necessarily be problematic in their intended applications (like classification).
[223000530190] |As long as you’re not nefarious, and don’t mind a little inaccuracy in things like per-char cross-entropy rates due to the length, everything should actually work OK.
[223000530200] |If you still tokenize on whitespace, you should be OK with non-BMP code points in our classifiers, POS taggers and chunkers.
[223000530210] |Places where they won’t make sense is in things like edit distance, where our API is defined in terms of chars, not code points.
[223000530220] |And string comparison like TF/IDF over character n-grams will be off in terms of using chars, not code points, so the dimensionality winds up being a little higher.
[223000530230] |For applications like classification, this shouldn’t matter.
[223000530240] |Of course, it’d be possible to write a proper n-gram code point (vs. char) tokenizer by taking the surrogate pairs into account.
[223000530250] |Here’s a little test program that doesn’t involve LingPipe so you (and I) can see what’s up with the encodings:
[223000540010] |Why Didn’t You Win the Netflix Prize?
[223000540020] |Netflix awarded its first progress prize in the Netflix Prize competition.
[223000540030] |The winner was a group from AT&T Research consisting of Robert Bell, Yehuda Koren and Chris Volinsky.
[223000540040] |According to Netflix’s press release on the award, the AT&T team spent 2000 hours working on this project.
[223000540050] |That’s a full-time, don’t go to any meetings, don’t take any vacation, person-year.
[223000540060] |And it pretty much sums up why we weren’t even in the running.
[223000540070] |How’d AT&T Do It?
[223000540080] |Here’s all the juicy details, in AT&T’s Disclosure Tech Report.
[223000540090] |(Props to AT&T for letting them release this stuff.)
[223000540100] |I’m not surprised by how they did it.
[223000540110] |I doubt anyone who’s studied machine learning in the past ten years will be surprised.
[223000540120] |The goal was to reduce prediction variance So they built an ensemble (aka committee).
[223000540130] |Of course they built an ensemble.
[223000540140] |It’s the most widely practiced method for reducing the variance of a predictor.
[223000540150] |AT&T used a really big ensemble.
[223000540160] |One with 107 reasonably tuned learners, drawn from a half dozen classes or so.
[223000540170] |They implemented a rich K nearest neighbors approach, Toronto’s Boltzman machine approach, singular value decompositions (standard L2 and Lasso-style L1 norms with regularization), some specialized factor models borrowed from Arek Paterek (another Netflix prize participant), and several flavors of regression model.
[223000540180] |How Good is a 10% Reduction in Error?
[223000540190] |The US$1M prize is for a 10% reduction in Netflix’s current system’s error.
[223000540200] |The current Netflix system has an (average) error of around 0.951, and the US$1M prize would go to a system with error of around 0.856.
[223000540210] |That’s on a 1-5 scale.
[223000540220] |That’s an improvement of only 1/10th of a star.
[223000540230] |The subsequent error, +/- 0.86 stars, is relatively large.
[223000540240] |Because there are only 4 stars of range (5-1=4), the progress prize is for a system with 21.4% estimation error overall (.85/4), down from the 23.8% or so of Netflix’s (stripped down) Cinematch algorithm.
[223000540250] |That’s pretty much the difference between 3 stars (liked it) and 2 stars (didn’t like it).
[223000540260] |Put another way, if Netflix used a 0-100 scale, these algorithms would be more than 20 points off in their predictions on average.
[223000540270] |(It’s actually a little more subtle than that, because they use root mean square error, not mean absolute error, which reduces the effect of small errors (less than 1 star) but increases the effect of larger errors, but the point still applies.)
[223000540280] |The naive system that predicts the average rating for a user and average rating for the movie has something like a 1.1 error.
[223000540290] |It’s very difficult to squeak out performance on such noisy data.
[223000540300] |It may simply not be possible to win the grand prize due to inherent variability of people’s ratings.
[223000540310] |Will Netflix Use It?
[223000540320] |The AT&T approach is unlikely to be practical for Netflix.
[223000540330] |Their approach involved combining 100s of predictors, each of which is very slow to either run or train, and most of which can only be run as-is on static data collections.
[223000540340] |Why’d AT&T (or Anyone) Participate?
[223000540350] |In the machine learning community, it’s common to participate in bakeoffs, even if there are no prizes at all.
[223000540360] |For instance, the 2007 KDD Cup used the Netflix data to run a bakeoff to predict which movies a user would rent and how many times a given title would be rented.
[223000540370] |These are perhaps even more pressing problems for Netflix.
[223000540380] |The research community’s happy to play along because we’re starved for real data on which to test our models.
[223000540390] |The researchers even published their results.
[223000540400] |The AT&T system was only possible, as they themselves acknowledge, because several other teams published their results, which were then incorporated as predictors into AT&T’s system.
[223000540410] |Would LingPipe Help?
[223000540420] |We’re about to roll out the next release of LingPipe, which contains an implementation of one of the factor models: (ridge) regularized singular value decomposition.
[223000540430] |This model alone gets you to 0.91 error on Netflix, but like the others, is both static and expensive to compute (though runtime predictions are fast).
[223000540440] |I finally sorted out the regularized least squares stochastic gradient descent algorithm by following an implementation published through the Netflix prize forums (full citations forthcoming in the Javadoc and tutorial) and a stack of books with chapters on regression (e.g. Hastie et al., Bishop, Gelman et al., etc.)
[223000540450] |You’ll also see that AT&T used some text features in comparing films in their KNN classifiers.
[223000540460] |There’s a wide range of other features (like actors, directors, rating, length, country of origin, etc.) that could be extracted, aligned with the film, and used.
[223000540470] |I also implemented an EM-estimated variant of probabilistic latent semantic analysis (pLSA), which only achieved a 0.95 error.
[223000540480] |That may go into LingPipe at some time, as we’re working on general factor models (such as latent Dirichelt and EM clustering) for text and document clustering.
[223000540490] |I think it’d work better with Gibbs sampling; the EM really gets stuck.
[223000550010] |LingPipe 3.2.0 Released
[223000550020] |The latest release of LingPipe is LingPipe 3.2.0.
[223000550030] |This release replaces LingPipe 3.1.2, with which it is backward compatible with one exception (see next section).
[223000550040] |Backward Incompatibility
[223000550050] |The p-value methods in stats.BinomialDistribution
and stats.MultinomialDistribution
have been removed.
[223000550060] |The javadoc for the classes includes the code for the removed implementations, which were based on the Jakarta Commons Math library.
[223000550070] |Zero External Dependencies
[223000550080] |The reason we removed p-values is that was the last remaining piece of functionality in LingPipe that required external dependencies.
[223000550090] |As of this release, there is no longer a dependency on the Jakarta Commons Math library or the Apache Xerces XML libraries.
[223000550100] |The latter functionality has been folded into Java itself.
[223000550110] |New Features
[223000550120] |Singular Value Decomposition
[223000550130] |We’ve included an implementation of singular value decomposition.
[223000550140] |It uses stochastic gradient descent, so is scalable to large matrices and allows partial matrix input (matrices with some unknown values).
[223000550150] |The implementation is encapsulated in a single class, matrix.SvdMatrix
, the javadoc of which explains the mathematics of SVD and our implementation.
[223000550160] |SGML Normalizer
[223000550170] |We were tired of writing one-off SGML entity normalizers, so we imported a class representing the entire standard, util.Sgml
.
[223000550180] |Line Tokenizer Factory
[223000550190] |To support our work on document parsing (text extraction, bibliography extraction, e-mail signature extraction, etc.), we’ve included a new tokenizer, tokenizer.LineTokenizerFactory
, that produces tokens based on lines of text.
[223000550200] |It is used in the sandbox project citationEntities
[223000550210] |Soundex Tokenizer Filter
[223000550220] |We’ve provided an implementation of Soundex as a tokenizer filter in tokenizer.SoundexFilterTokenizer
.
[223000550230] |Soundex reduces strings of words to a simplified, (English) pronunciation-based representation.
[223000550240] |We were mainly motivated by exploring features in our feature-based classifiers.
[223000550250] |The javadoc contains a complete description of the Soundex algorithm.
[223000550260] |Distances and Proximities
[223000550270] |We made sure that if one of the interfaces util.Distance
or util.Proximity
was implemented by an object, so was the other.
[223000550280] |This allows them all to be plugged into distance-based clusterers and classifiers (e.g. k nearest neighbors).
[223000550290] |Tutorials
[223000550300] |Singular Value Decomposition
[223000550310] |The SVD Tutorial that walks through the applications to latent semantic indexing and basic n-gram smoothing.
[223000550320] |It also covers all the tuning parameters (learning rate and annealing, initial values, early stopping, regularization, etc.)
[223000550330] |Word Sense Disambiguation Tutorial
[223000550340] |The Word Sense Tutorial provides details on creating a complete Senseval word sense disambiguation system using contextual classification.
[223000550350] |Word sense disambiguation is the problem of determining which dictionary entry (for a given dictionary) corresponds to a given token of a word in a text.
[223000560010] |Alias-i is Hiring
[223000560020] |Alias-i, the makers of the LingPipe suite of NLP tools, is hiring one or two developers to work on:
[223000560030] |Natural language processing API development
[223000560040] |Natural language processing application development
[223000560050] |Enterprise architecture design and integration
[223000560060] |User interface design and coding
[223000560070] |We need one full time equivalent position devoted to our Phase II Small Business Innovation Research (SBIR) grant from the U.S. National Institutes of Health (NIH).
[223000560080] |Our goal is a topic- and entity-faceted search engine making heavy use of query refinement and topic discovery, and suited to queries derived from high throughput lab experiments.
[223000560090] |We are collaborating with researchers at the Harvard University Medical School, who have real biology problems to solve.
[223000560100] |We need part of a full time position devoted to improving our “source released” commercial product, LingPipe, a suite of Java NLP tools.
[223000560110] |This will involve expanding current packages as well as developing new functionality.
[223000560120] |A strong background in machine learning and statistical methods is a big plus.
[223000560130] |This project overlaps with the NIH project.
[223000560140] |We need at least half of a full time equivalent position devoted to enterprise application development and integration.
[223000560150] |Our customers range from from small startups to the Fortune 500.
[223000560160] |In 2007, we have delivered product licenses and consulting for structured database deduplication, part-of-speech tagging, named entity extraction and database coreference, classification, and spell checking.
[223000560170] |We’re located in a sunny loft office in Brooklyn, New York, just across the East River from downtown Manhattan, and convenient to public transportion.
[223000560180] |We are not considering full-time telecommuters.
[223000560190] |If you’re interested, please send us an electronic copy of your resume, a cover note briefly explaining how your skills and interests match our requirements, and a description of your visa status for working in the United States.
[223000560200] |No headhunters, please.
[223000560210] |Contact: Breck Baldwin
[223000560220] |E-mail: jobs@alias-i.com
[223000560230] |Home: http://www.alias-i.com/
[223000560240] |LingPipe: http://www.alias-i.com/lingpipe
[223000560250] |About: http://www.alias-i.com/lingpipe/web/about.html
[223000570010] |(Junior) Varsity Interview: String Reverse
[223000570020] |We’re hiring, so interviews have been on my mind.
[223000570030] |I’m cross-posting what was mostly a response to reading Stephans Blog [sic], which was itself a riff on Joel on Software’s Guerilla Guide to Interviewing.
[223000570040] |Both ask us to look at the string reverse problem.
[223000570050] |Apparently, if you are interviewing at a tech company, at some point along the way you’re going to be asked to reverse a string.
[223000570060] |Or a sequence of tokens.
[223000570070] |Really.
[223000570080] |No, really, I’m not kidding.
[223000570090] |All of the answers I’ve seen are pure junior varsity when it comes to string processing.
[223000570100] |Before reversing a “string”, we need to get clear on what we want.
[223000570110] |At one level, a Java String object is just an array of chars.
[223000570120] |At another level, it represents a sequence of unicode code points.
[223000570130] |The tension arises from the fact that String(char[])
lets you construct a string with a sequence of chars that don’t correspond to valid unicode code points.
[223000570140] |This tension was ratcheted up in the 1.5 JDK when they moved to Unicode 4.0.
[223000570150] |In the 1.5 JDK, Sun changed the behavior of the StringBuffer.reverse()
method in a way that was not backward compatible with 1.4.
[223000570160] |That is, there are StringBuffer
instances for which reverse()
in 1.4 returns a different value than in 1.5.
[223000570170] |The 1.5 version of reverse()
is sensitive to surrogate pairs (unicode code points requiring more than 16 bits, and hence more than one char, in UTF-16).
[223000570180] |In 1.5, both java.lang.StringBuilder
and java.lang.StringBuffer
use the implementation of reverse()
from their common superclass, java.lang.AbstractStringBuilder
.
[223000570190] |Here’s the first paragraph of doc for java.lang.StringBuffer.reverse() from JDK 1.5:
[223000570200] |“Causes this character sequence to be replaced by the reverse of the sequence.
[223000570210] |If there are any surrogate pairs included in the sequence, these are treated as single characters for the reverse operation.
[223000570220] |Thus, the order of the high-low surrogates is never reversed.”
[223000570230] |And here’s the first paragraph of doc from java.lang.StringBuffer.reverse()
from JDK 1.4:
[223000570240] |“The character sequence contained in this string buffer is replaced by the reverse of the sequence.”
[223000570250] |Following Stephan’s suggestion to use the built-in has either a good or bad side effect.
[223000570260] |Moving from 1.4 to 1.5 either breaks backward compatibility for the string as char sequence representation, or appropriately handles unicode 5.0 in the string as sequence of code points representation.
[223000570270] |Extra credit 1: Recursion won’t work because it’ll blow out the stack if we’re using Sun’s JDKs, which (at least so far) don’t perform tail recursion optimization (a kind of last call optimization).
[223000570280] |Extra credit 2: The exception thrown when trying to reverse a null string should be a null pointer exception.
[223000570290] |That’s how Sun codes the JDK itself (see, e.g., the java.lang.String.String(String)
constructor).
[223000570300] |It’s a runtime exception because it’s a coding error to send reverse(String)
a null string (assuming you want behavior like your call to StringBuffer.reverse()
).
[223000570310] |It should be a null pointer exception, as that’ll lead you right to the problem while debugging.
[223000570320] |Do I get a callback for a second interview?