[223001260010] |
java.util.concurrent
libraries for execution rather than managing my own threads.
[223001260040] |It also reminded me that without variables being synchronized on both sides or being declared volatile, changes are not guaranteed to be visible across threads.
[223001260050] |There aren’t member variables changing here, but there are values in arrays that change.
[223001260060] |So I wondered about the liveness of LingPipe’s util.FastCaceh thread-safe cache.
[223001260070] |The existing tests were for safety, not liveness.
[223001260080] |What if every thread had to recreate its own cache entries?
[223001260090] |First, a batch of imports:
[223001260100] |Then the definition of the java.lang.Runnable
, instances of which will be run in multiple threads simultaneously.
[223001260110] |The first part just sets up the member variables, which store the cache, number of entries, and number of cache hits:
[223001260120] |An instance of FastCache
implements java.util.Map
; in this case, the cache maps integers to integers.
[223001260130] |In our HMM decoder implementation, the cache maps strings (tokens) to arrays of doubles (emission probabilities indexed by tag/category ID).
[223001260140] |The meat is in the run()
method:
[223001260150] |Basically, it looks up each entry in the cache, and if it finds it, increments the number of hits, and if it doesn’t find it, it adds it to the cache.
[223001260160] |It sleeps a small random number of milliseconds for each entry; without the sleep, the first thread populates the cache before the others even get going!
[223001260170] |The actual test harness is run through JUnit (3.8 —I haven’t gone over to the annotation version —that’s coming up, too).
[223001260180] |Here’s the method:
[223001260190] |It’s very straightforward: set up the cache, then create an array for all of the test class instances, then create the executor, execute the test classes, then shut down the executor, wait for all the threads it’s working on to terminate, validate the results, then calculate the number of cache hits to print. Note that the values are chosen so the cache’s size and load factor guarantee it’ll have to expire entries.
[223001260200] |Here’s what the run produces on my dual quad-core Xeon Windows Vista 64 workstation:
[223001260210] |Overall, I’m happy with the way this works.
[223001260220] |Of course, any of you thread gurus out there who want to help make it better, I’m all ears.
[223001260230] |I’m still a relative novice.
[223001270010] |index.Term
class, which is constructed with a field and a textual value for that field.
[223001270070] |The simplest possible query is a search.TermQuery
, which consists of a single term.
[223001270080] |So far, no problem.
[223001270090] |But what about a search.RangeQuery
?
[223001270100] |It’s constructed using a pair of terms.
[223001270110] |There’s a note in the constructor that both terms must be from the same field.
[223001270120] |What’s the problem?
[223001270130] |Range queries shouldn’t be built out of two field/text pairs, but rather out of one field and two text objects.
[223001270140] |Phrase queries (search.PhraseQuery
) have the same problem as range queries, in that they should only apply to terms with the same field.
[223001270150] |Although the add term checks that added terms have the same field and throws an exception if they don’t, there is no warning in the javadoc.
[223001270160] |The solution to the problem is simple.
[223001270170] |Split the structure of queries into two types, one for the text part of a query and one for restricting it to fields.
[223001270180] |It’s easiest to see in BNF, though what I’m really talking about is object structure, not the query syntax:
[223001270190] |The logical operations all distribute through fields, so a query like [AUTHOR: (Smith OR Jones)]
is equivalent to ([AUTHOR:Smith] OR [AUTHOR:Jones])
.
[223001270200] |An advantage the query object representation has over the query syntax is that there are no default fields.
[223001270210] |Lucene’s query syntax, on the other hand, allows a fielded query to consist of a term query.
[223001270220] |The use of an analyzer in constructing an actual query parser then fills in the missing field.
[223001270230] |The class search.WildcardQuery
is equally problematic in that it takes a term as an argument.
[223001270240] |It’s overloading the textual value of the term to represent not a term, but a string with structure including a special character for the multiple character (*
) and single character (?
) wildcards.
[223001270250] |But what if I want a question mark or asterisk in my term itself?
[223001270260] |The query syntax handles this problem, but not the object representation.
[223001270270] |The classes search.PrefixQuery
and search.FuzzyQuery
have the same problem with conflating terms and a syntax for search.
[223001270280] |Phrase queries and boolean queries have the additional problem of being mutable, which makes their use in collections problematic (just as using mutable collections within other collections is problematic).
[223001270290] |For instance, if you construct a phrase query, its notion of equality and hash code change as more terms are added, because equality is defined to require having an equal list of terms.
[223001280010] |cluster.LatentDirichletAllocation
, which is explained with simple and real-world examples in our clustering tutorial.
[223001280130] |If you’re not familiar with LDA and its collapsed Gibbs sampler, I’m not going to have time to explain it in this note!
[223001280140] |Collapsed Gibbs sampling for LDA was introduced in (Griffiths and Steyvers 2004).
[223001280150] |The collapsing is of the multinomials, so that a Gibbs sample is represented by an assignment of topics to each word in each document.
[223001280160] |The Dirichlet priors are hyperparameters (that is, set to constants, not fitted) in the collapsed sampler.
[223001280170] |Porteous et al. present a method for speeding up the sampling step of the inner loop of collapsed LDA, but their method is actually much more general.
[223001280180] |The approach is pretty straightforward, and reminds me of arithmetic coding.
[223001280190] |It relies on being able to visit outcomes in roughly decreasing order or probability, compute their unnormalized probability efficiently, and computing an upper bound on the normalization constant given only the unnormalized probabilities of the items seen so far (plus other knowledge if it’s easy to compute).
[223001280200] |Suppose we have K
topics and W
words and we’re looking at word w
in document j
.
[223001280210] |We can compute unnormalized probabilities q[k]
for each topic k
, so that the probability of a topic k
is p[k] = q[k]/Z
, where the normalization term Z=q[1]+...+q[K]
.
[223001280220] |If we can create a sequence of upper bounds Z[1],...,Z[K]
, such that Z[i] >Z[K] = Z
, then we have a sequence of decreasing upper bounds on p[k]
, namely q[k]/Z[1],...,q[k]/Z[K]
, such that the final bound is exact q[k]/Z[K] = q[k]/Z = p[k]
.
[223001280230] |The point of the algorithm is to commit to a sampled outcome as soon as soon as the probability is bounded.
[223001280240] |We start by generating random number u in Uniform(0,1)
.
[223001280250] |If it is lower than q[1]/Z[1]
, we can return topic 1 without computing any of the other q[k]
or Z[k]
.
[223001280260] |If it is greater, we have to compute q[2]
and Z[2]
, at which point we may or may not have bounded the result (we have if (q[1]+q[2])/Z[2] ).
[223001280270] |If the bounds are fairly tight and we consider the topics with high values of q[k]
first, we can potentially return answers with much less work than computing all normalized probabilities.
[223001280280] |The amount of speedup is clearly related to the amount of skew in the distributions and the tightness of the upper bounds on the normalization constants.
[223001280290] |LDA topic distribution per word is highly skewed, especially as the Dirichlet parameter gets smaller, as is typically the case with large topic models.
[223001280300] |In cases where the distribution is uniform, the sort is poor, or the upper bounds are very loose, Porteous et al.’s optimization can take longer than Griffiths and Steyvers’ direct sampling.
[223001280310] |For LDA, Porteous et al. sort topics by document probability, which is a good proxy for overall topic probability.
[223001280320] |They compute upper bounds Z[k]
using a nifty inequality discovered in the late 1800s called Hölder’s inequality (which is based on vector norms).
[223001280330] |They consider two parameterizations, but this is not the only way to bound probabilities.
[223001280340] |Porteous et al.’s technique can be applied in many situations in graphical models.
[223001280350] |The situation in LDA has us estimate p(topic=k|word,document,θ,φ)
which is proportional to p(topic|θ[doc]) * p(word|φ[k])
.
[223001280360] |This is a typical setup if we’re Gibbs sampling an interior node in a graphical model.
[223001280370] |It involves the parent node (θ[doc]
), the daughter node (w
), and the daughter’s mother (aka an aunt) φ[k]
.
[223001280380] |I wouldn’t be surprised if general Gibbs sampling packages like HBC could incorporate Porteous et al.’s algorithm as a compiler optimization.
[223001280390] |It’d be even better if we could make like Java’s Just in Time (JIT) compiler and do run-time profiling to decide when to apply Porteous et al.’s algorithm.
[223001280400] |Finally, a big raspberry for KDD for making their proceedings closed source through the pay-to-access ACM portal.
[223001290010] |Lacks a Convincing Comparison with Simple Non-Bayesian Approaches
[223001290020] |That was just one comment from my 2009 NAACL rejection letter (reprinted in full with a link to the paper below).
[223001290030] |As Andrew says, I’m glad to hear they had so many great submissions.
[223001290040] |This was for a paper on gold-standard inference that I’ve been blogging about.
[223001290050] |Here’s a link to the submission:
[223001290060] |Bob Carpenter. 2008.
[223001290070] |A Multilevel Bayesian Model of Categorical Data Annotation.
[223001290080] |Rejected by NAACL 2009.
[223001290090] |The only thing it adds to the white paper and graphs on the blog is another mechanical Turk experiment that recreated the MUC 6 PERSON entity annotations.
[223001290100] |As with the Snow et al. work, the Turkers did better than the gold standard.
[223001290110] |I should’ve paid more attention to Simon Peyton Jones’s most excellent advice.
[223001290120] |Ironically, I used to give most of that advice to our own students.
[223001290130] |Always easier medicine to give than to take.
[223001290140] |In particular, I needed to make the use cases much clearer to those who weren’t knee deep in Bayesian analysis and inter-annotator agreement statistics.
[223001290150] |The trap I fell into has a name.
[223001290160] |It’s called “the curse of knowledge”.
[223001290170] |The best general study of this phenomenon I know is Elizabeth Newton’s research, which is described in Heath and Heath’s great book Made to Stick (see the tappers and listeners section of the excerpt).
[223001290180] |Good psych experiments are so compelling they give me goosebumps.
[223001290190] |It’s hard to compare the Bayesian posterior intervals with non-Bayesian estimates.
[223001290200] |The whole point of the Bayesian analysis is to analyze the posteriors.
[223001290210] |But if you’re not a Bayesian, what do you care?
[223001290220] |The standard in NLP is first-best analyses on held-out data sets with a simple accuracy measure.
[223001290230] |No one ever talks about log loss except in model fitting and language modeling evaluation.
[223001290240] |So even a probabilistic model that’s not Bayesian can’t get evaluated on its own terms.
[223001290250] |This goes for simple posterior predictive inferences in Bayesian analyses.
[223001290260] |I guess the reviewer missed the comparison with simple voting, which is the de facto standard (coupled with censorship of uncertain cases).
[223001290270] |What I really should’ve addressed is two sides of the issue of what happens with fewer annotators.
[223001290280] |The answer is that the Bayesian approach has an even stronger advantage in agreeing with the gold standard annotations than simple voting.
[223001290290] |I only did that after the analysis after the submission in a “Doh!” moment anticipating the problem reviewers might have with 10 annotators.
[223001290300] |The second aspect is to push harder on the fools gold standard point that the state of the art produces very poor corpora in terms of consistency.
[223001290310] |It is possible to see if the informative priors help with cross-validation.
[223001290320] |But even without cross-validation, it’s obvious with 20 samples when the annotator made no mistakes —they’re unlikely to have 100% accuracy on the rest of the corpus.
[223001290330] |You don’t estimate a player’s batting average (in baseball) to be .500 after he goes 10 for 20 in his first 20 at bats.
[223001290340] |Everyone in NLP knows low count estimates need to be smoothed.
[223001290350] |So now that we’ve decided we need a prior (call it regularization or smoothing if you like), we’re just haggling about price.
[223001290360] |So I just optimized that, too.
[223001290370] |It’s what any good Bayesian would do.
[223001290380] |Here’s the full set of comments and ratings (on a 1-5 scale, with 5 being the best).
[223001300010] |Java Quiz: Iterating over Binary Tree Elements
[223001300020] |I was thinking this’d make a good Java quiz question.
[223001300030] |It’s something I have to do all the time in order to make collections accessible in a streaming fashion.
[223001300040] |Getting an iterator is the easiest way for a client to visit the elements of a collection one at a time.
[223001300050] |The usual reason is to apply some operation to them.
[223001300060] |If the collection’s large (sometimes offline, even), it’s less memory intensive to iterate.
[223001300070] |Problem: Suppose you have a standard binary tree implementation (see below) and want to implement an instance of java.util.Iterator that iterates over the elements.
[223001300080] |To satisfy the spirit of the exercise, the implementation must take constant time per item iterated and may take at most an amount of memory proportional to the depth of the tree.
[223001300090] |Here’s a standard CS 101 binary tree implementation, with the add()
and toString()
methods for convenience:
[223001300100] |To see how a tree gets built, here’s an example:
[223001300110] |If I compile and run, here’s what I get:
[223001300120] |Hint: It’s easy to walk over the nodes in order using a recursive program.
[223001300130] |Here’s a simple string builder:
[223001300140] |If I add the following to my main()
:
[223001300150] |I get the following line of output:
[223001310010] |Quiz Answer: Continuation Passing Style for Converting Visitors to Iterators
[223001310020] |Here’s the answer to the question raised in my previous post, Java Quiz: Iterating over Binary Tree Elements.
[223001310030] |The answer is surprisingly easy once you see it.
[223001310040] |What’s really cool is that it’s an instance of a general programming technique called continuation passing style (CPS).
[223001310050] |In the form discussed here, CPS is a general technique to convert recursion into iteration.
[223001310060] |Enough of the generalities, here’s my answer:
[223001310070] |I now add the following to the main()
from the last post:
[223001310080] |and here’s what I see:
[223001310090] |Conceptually, the way the program works is very simple.
[223001310100] |The main idea is to keep track of the information that would be on the call stack in the recursive implementation in an explicit programmatic stack data structure.
[223001310110] |The elements in the queue represent the continuations.
[223001310120] |They essentially tell the program where to pick up processing when called next, just like the stack does in recursion.
[223001310130] |Envious Aside: Continuation passing (literally, not just in style) is what lets Python implement the super-cool yield
construction.
[223001310140] |Yields in Python are a general way to turn visitors into iterators.
[223001310150] |They’re possible because continuations in Python are first-class object.
[223001310160] |You can even write completely stackless Python.
[223001310170] |Back to Java.
[223001310180] |Our code maintains a stack of the nodes whose values are left to be printed, in order of how they should be printed.
[223001310190] |Because you print the left daughters before the root, you always queue up the entire set of left daughters for a node.
[223001310200] |After initialization, the stack consists from bottom to top of the root, the root’s left daughter, the root’s left daughter’s left daughter, and so on, down to the lower “left corner” of the tree.
[223001310210] |The iterator has more elements if the stack’s not empty.
[223001310220] |The next element is always the value on the top of the stack.
[223001310230] |After returning it, you need to push it’s right daughter, as the descendants of the current node’s right daughter all have to be output before the other elements on the stack.
[223001310240] |One thing to look at is how I implemented the recursive add using iteration (I probably would’ve used a for-loop, but thought the while loop easier to undertsand).
[223001310250] |You could’ve also done this recursively:
[223001310260] |The recursive version’s clearer (at least to me), but programming languages tend to be much better at iteration than recursion (even languages designed for recursion, like Lisp and Prolog).
[223001310270] |Java has relatively small stack sizes (typically, they’re configurable), and miniscule maximum depths (usually in the thousands).
[223001310280] |Of course, if your binary tree is 1000 nodes deep, you have bigger problems than Java’s stack!
[223001310290] |Also check out the little details.
[223001310300] |I made variables final that could be final, kept the class’s privates to itself, and implemented the proper exceptions.
[223001310310] |To see that it satisfies the performance constraints, first note that the amount of memory used is never greater than the depth of the tree, because the elements of the stack always lie on a single path from the bottom of a tree up.
[223001310320] |Constant time per element is a bit tricky, and you might want to call me on it, because it requires an amortized analysis.
[223001310330] |For each element returned, its node was pushed onto the stack once and popped off the stack once.
[223001310340] |That’s all the work that happens during next()
.
[223001310350] |You also do tests for every value’s returned daughters, which is still only a constant amount of work per element.
[223001310360] |Extra Credit: Use java.util.AbstractSet to implement SortedSet
.
[223001310370] |The only methods that need to be implemented are size()
and iterator()
, so you’re mostly there.
[223001310380] |Extra Extra Credit: Discuss the problem of concurrent modification and implement a solution for the iterator like Java’s.
[223001310390] |You might also want to note that the tree should be balanced somehow, but that seems like an orthogonal problem to me from the issue with iterators.
[223001310400] |Check out Java’s own java.util.TreeSet implementation.
[223001310410] |The code for the collections framework is very useful as a study tool.
[223001320010] |Matthew Wilkins Evaluates POS Taggers
[223001320020] |There’s a fascinating thread on Matthew Wilkins‘s blog Work Product.
[223001320030] |Matthew’s a humanities postdoc studying literature.
[223001320040] |His thread starts in a 25 October 2008 entry:
[223001320050] | Work Product: POS Taggers
[223001320060] |with this quote:
[223001320070] |I surely just need to test a bunch of them [part of speech taggers] is some semi-systematic way, but is there any existing consensus about what works best for literary material?
[223001320080] |I would highly recommend reading from the first post in the thread forward.
[223001320090] |It’s a great fly-on-the-wall view of a non-specialist coming to grips with natural language processing.
[223001320100] |Over the past three months, Matthew’s evaluated a whole bunch of different part-of-speech taggers looking for something that’ll satisfy his accuracy, speed, and licensing needs.
[223001320110] |The data is literary English, for now, mostly culled from Project Gutenberg.
[223001320120] |The current entry, Evaluating POS Taggers: Conclusions, dated 27 January 2009, starts with:
[223001320130] |OK, I’m as done as I care to be with the evaluation stage of this tagging business, which has taken the better part of three months of intermittent work.
[223001320140] |This for a project that I thought would take a week or two.
[223001320150] |There’s a lesson here, surely, …
[223001320160] |Amen, brother.
[223001320170] |That’s one reason why Kevin Cohen’s organizing the software workshops at ACL (this year Marc Light co-chairs, but I’m still on the PC).
[223001320180] |So I suggested to Matthew that he submit this diary of his work to the NAACL 2009 Software Workshop, which is explicitly calling for just such case studies.
[223001330010] |Denis, Gilleron &Tommasi (2002) Text Classification from Positive and Unlabeled Examples
[223001330020] |I’m finally getting back to the problem of how to learn with only positively labeled data (and lots of unlabeled data).
[223001330030] |I discussed this last in the entry How Can I Build a Classifier with No Negative Data?.
[223001330040] |This paper takes an interesting (and, in retrospect, obvious) approach:
[223001330050] | Denis, François, Rémi Gilleron, and Marc Tommasi. 2002.
[223001330060] |Text Classification from Positive and Unlabeled Examples.
[223001330070] |In Proceedings of the 9th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU 2002).
[223001330080] |The foundation is naive Bayes with a bag of words representation for a binary classification task with two categories {0,1}
.
[223001330090] |They remind us in their equation (4) that naive Bayes yields a mixture of unigrams language mode for words w
:
[223001330100] |Denis et al. then (well, actually before) observe that we can estimate p(w|cat=1)
from positive examples.
[223001330110] |And we can estimate p(w)
over the entire corpus without labels.
[223001330120] |We can probably make a pretty good guess at p(cat=0)
, which gives us p(cat=1) = 1.0 - p(cat=0)
.
[223001330130] |If the documents are at all long and the category probabilities are not horrendously skewed, the cat probs usually don’t contribute much to the category estimates in naive Bayes.
[223001330140] |Aha!
[223001330150] |Now we can solve for p(w|cat=0)
, which they do for us in their equation (6):
[223001330160] |Denis et al. note that for this to make sense, the document length distributions have to be the same.
[223001330170] |This not being the case may not matter much in practice.
[223001330180] |Recall that the independence assumptions of Naive Bayes need to be met, which they almost never are in bags of words under naive Bayes —actual counts are almost always overdispersed relative to the expectation under a multinomial as in naive Bayes.)
[223001330190] |The paper includes some eval over WebKB Course Dataset, a set of 1000 or so academic department web pages from four universities classified into types.
[223001330200] |Denis et al. took the course pages (22% of total) to be category 1 and all other pages to be category 0.
[223001330210] |They got quite decent results with both 20 and 50 positively-labeled documents and up to 200 unlabeled docs.
[223001330220] |They showed that classification performance was robust to errors in the category distribution specification p(cat=0)
.
[223001330230] |One good idea’s all a paper (or a thesis for that matter) needs.
[223001330240] |They did more eval and added cotraining in:
[223001330250] |François Denis, Anne Laurent, Rémi Gilleron, Marc Tommasi. 2003.
[223001330260] |Text classification and co-training from positive and unlabeled examples.
[223001330270] |In Proceedings of ICML.
[223001340010] |Unit Testing Exceptions with JUnit 4.5 Annotations
[223001340020] |I not only decided to clean up all the unchecked warnings from the compiler, I decided to upgrade our unit testing framework to JUnit 4.5.
[223001340030] |There’s a very nice introduction to JUnit 4.5 at:
[223001340040] | cavdar.net’s JUnit 4 in 60 Seconds
[223001340050] |I just wrote a script with a bunch of regexes to transform our current tests into the new form.
[223001340060] |That meant taking every public void method whose name started with “test” and adding the annotation @Test
.
[223001340070] |That is, my test methods now all meet this year’s look and feel for JUnit:
[223001340080] |I’m still not a big fan of the static import, which adds the method to the local namespace, but that’s the recommended practice, and I believe in following recommended practice more than I believe in innovating, at least for syntactic details like this.
[223001340090] |Now if you want to test that a class throws an exception, what I used to do was this:
[223001340100] |The pattern’s simple: set up a try/catch block, with the thing you expect to throw an exception in the try block.
[223001340110] |If the statement succeeds without an exception, you hit the fail method, which registers a test failure with JUnit.
[223001340120] |If you throw the exception, it gets caught and registers a test success.
[223001340130] |I had to implement the success method myself to register successes.
[223001340140] |The recommended style for this test now is:
[223001340150] |This certainly looks much cleaner.
[223001340160] |The test succeeds if and only if the exception is thrown.
[223001340170] |Now consider this test, which has the same form as the testGamma()
method in unit.io.BitInputTest
:
[223001340180] |The test checks that the first call to next succeeds, the second throws an exception, then the has-next method returns false.
[223001340190] |I can’t see how to do this in the new style.
[223001340200] |The first problem is that I don’t want the test to succeed if the first call to next throws an exception.
[223001340210] |So I’d have to write a separate test to make sure at least that part succeeds.
[223001340220] |But the real problem is the post test for has-next to return false, which I can’t figure out how to finagle into a test with an expected exception.
[223001350010] |Doubling Speed and Halving Memory in Binomial Naive Bayes
[223001350020] |It’s easy to convert naive Bayes into a linear classifier, then reduce the number of coefficient vectors to the number of categories minus 1.
[223001350030] |We’ll work through the binomial (two category) case here, where we can get away with a single parameter vector.
[223001350040] |The general case is the same with more subscripts.
[223001350050] |First, let’s express naive Bayes as a linear classifier.
[223001350060] |Suppose we have a binary naive Bayes text classifier over tokens:
[223001350070] |The typical parameterization is with a coefficient vector for each category:
[223001350080] |Naive Bayes represents documents as bags of words, which provide a count for each word in the document.
[223001350090] |Specifically, an input document represented as a vector of token counts, with a constant zero dimension set to one for an intercept:
[223001350100] |Then the joint probability of a category and a document is given by:
[223001350110] |Thus it’s essentially a multinomial classifier.
[223001350120] |Now suppose we only have two categories, 0 and 1.
[223001350130] |The conditional probabilities are computed in the usual way by marginalization:
[223001350140] |Now suppose we define a new feature vector:
[223001350150] |and look at its behavior when multiplied by an document token vector:
[223001350160] |we see that it produces the log odds.
[223001350170] |Noting that the two conditionals add to 1:
[223001350180] |we contiue the expansion to derive:
[223001350190] |so that:
[223001350200] |Therefore, we only need to store the single parameter vector and multiply it by our document count vector.
[223001350210] |That neatly cuts our storage and the number of arithmetic operations in half.
[223001350220] |If we have K dimensions, we can get away with (K-1) feature vectors by subtracting the last feature vector from all the previous vectors.
[223001350230] |The time and space saving are 1/K.
[223001350240] |In practice, there’s no need to allocate a vector of token counts.
[223001350250] |Instead, as each token is streamed out of the tokenizer, its coefficient is looked up and added to the running total, which is initialized with the category log probability difference (aka the intercept).
[223001350260] |This recasting of naive Bayes as a linear classifier is not novel.
[223001350270] |The question becomes how best to choose the coefficient vector.
[223001350280] |Naive Bayes tends to be outperformed in terms of accuracy by discriminative methods like logistic regression or linear support vector machines, but these methods are orders of magnitude more expensive to train.
[223001350290] |There’s a good discussion of discriminative versus generative estimation of linear classifier coefficients in:
[223001350300] |Klein, Dan and Christopher D. Manning. 2002.
[223001350310] |Conditional structure versus conditional estimation in NLP models.
[223001350320] |In Proceedings of EMNLP.
[223001350330] |Note: Klein and Manning use a vector per category, as is traditional in natural language and machine learning approaches to both naive Bayes and logistic regression.
[223001360010] |Java Floating Point: strictfp, StrictMath vs. LingPipe’s Forward-Backward Algorithm
[223001360020] |(Note: there are two questions I have, highlighted in bold.)
[223001360030] |I’ve been wrestling with varying floating point behavior across platforms, a bug feature of modern programming languages over multiple platforms and compilers.
[223001360040] |Here’s a nice intro from Sun that pre-dates Java, but discusses the IEEE floating point standard, the main platform-based variations, and their effects on the C language:
[223001360050] |David Goldberg. 1991.
[223001360060] |What Every Computer Scientist Should Know About Floating-Point Arithmetic.
[223001360070] |ACM Computing Surveys.
[223001360080] |Here’s the whole book as a PDF.
[223001360090] |It’s just the appendix; the rest of the doc goes into gory detail about floating point.
[223001360100] |Much more so than in my numerical optimization (Nocedal and Wright) or matrix computation (Golub and van Loan) books.
[223001360110] |One of the features of LingPipe’s implementation of hidden Markov models (HMMs) is a full implementation of the forward-backward algorithm (also see Bishop’s pattern recognition textbook).
[223001360120] |Forward-backward is just a special case of the general sum-product algorithm for graphical models.
[223001360130] |It’s also used in conditional random fields (CRFs).
[223001360140] |Forward-backward is what allows us to calculate the probability of a tag for a token given the whole sequence of input tokens.
[223001360150] |It’s also what’s behind the same calculation for named entities in our HMM named-entity extractor.
[223001360160] |As I mentioned previously, in a post on POS tagger eval, Matthew Wilkins has a corpus of 1.6M tokens of literary text with POS tags.
[223001360170] |During the evaluation, our HMM decoder threw an exception in fold 347 of a 500-fold cross-validation run (it was so many folds so as not to blow out memory storing cases —I’ll make storage optional in the next release).
[223001360180] |I fixed the problem by adding a strictfp
declaration to every class, and replacing all calls to java.lang.Math
with the corresponding calls to java.lang.StrictMath
.
[223001360190] |It wasn’t noticeably slower to decode on my desktop machine (dual quad-core Xeon, Vista64), in either 1.5 or 1.6.
[223001360200] |Is this what others have found with strict math?
[223001360210] |What I read on the web led me to believe it’d be much slower.
[223001360220] |The bug was in the middle the forward portion of the forward-backward algorithm.
[223001360230] |Forward-backward’s a nasty algorithm numerical-computation-wise because it involves a sequence of multiply-adds (basically vector dot products, where the output of one set of computations makes up the input to the next).
[223001360240] |Here’s how the forward values (alphas) are defined for a given token position (n) and category/tag (t):
[223001360250] |For each input token (n), there are multiply-adds accumulated over all of the tags (t), each involving the output of the previous (n-1) calculation!
[223001360260] |The algorithm’s not as bad as it looks because we store the previous alphas and use dynamic programming.
[223001360270] |Because the forward values (alphas) are joint probabilities of the input tokens and the tag assigned at a position, they can easily underflow in practice on a linear scale.
[223001360280] |That’s why Viterbi (first best algorithm) uses log scales (that, and it converts multiplication to addition).
[223001360290] |In practice, we really only care about the conditional probabilities.
[223001360300] |So we can rescale:
[223001360310] |As those of you who follow floating point know, this is all risky business.
[223001360320] |What happened in practice was that the rescaled values summed to 1.09 at one point (I only rescaled up, so this is a problem in the forward calculations at some point).
[223001360330] |Finally: Does anyone know any references about calculating forward-backward in a way that’s sensitive to how floating point arithmetic actually works?
[223001370010] |Elkan and Noto (2008): Learning Classifiers from Only Positive and Unlabeled Data
[223001370020] |Charles Elkan sent along links to these two papers in response to my previous post, Denis et al.’s Text Classification from Positive and Unlabeled Examples.
[223001370030] |The title overlap indicates just how much we need auto-alerting using tech like Hal’s WhatToSee:
[223001370040] | K. Noto, M. H. Saier Jr., and C. Elkan.
[223001370050] |Learning to Find Relevant Biological Articles Without Negative Training Examples.
[223001370060] |In Procs. of the Australasian Joint Conference on Artificial Intelligence (AI ’08).
[223001370070] |C. Elkan and K. Noto.
[223001370080] |Learning Classifiers from Only Positive and Unlabeled Data.
[223001370090] |In Proceedings of the Fourteenth International Conference on Knowledge Discovery and Data Mining (KDD ’08 ).
[223001370100] |Because they’ve supplied the data, I can use it in my upcoming tutorial on semi-supervised learning (I’m also doing standard EM).
[223001370110] |The Setup
[223001370120] |The basic setup is the same as for Denis et al.: a two-category (positive/negative) classification problem with a relatively tiny data set.
[223001370130] |In this case, a total of 7500 or so SwissProt text recrods, 2500 or so of which are positive instances that mention transport proteins.
[223001370140] |Then a labeled training set consisting of a randomly selected subset of the positive examples.
[223001370150] |The goal is to estimate a classifier p(positive|text) given only the collection of 7500 texts and 300 labeled positive examples.
[223001370160] |Lemma 1
[223001370170] |Elkan and Noto show that if you can estimate the probability of a positive item being labeled, p(lab|pos)
, and you can estimate the probability of an item being labeled, p(lab|text)
, then you can estimate:
[223001370180] |Given that p(lab|pos)
is a constant, we also see that p(pos|text)
is proportional to p(lab|text)
.
[223001370190] |They lean on this ratio to train a binary classifier to estimate p(lab|text)
, which provides the same ranking by probability as the estimand of interest, p(pos|text)
.
[223001370200] |The Evaluation
[223001370210] |They evaluate by precision at 300 and the results look very good.
[223001370220] |The KDD paper cuts off the ROC curve at .95 recall saying the only range of interest is precision-at-300; the AI’08 paper shows the whole thing, where they hit 100% recall at a respectable 70% sensitivity, which almost seems too good to be true.
[223001370230] |Their application of ranking articles is more like traditional search in that it does not require a properly calibrated p(pos|text)
, only a ranking.
[223001370240] |They’re interested in supplying a list of 300 papers to a human curator to review them for inclusion in a database of transport proteins.
[223001370250] |That’s a bit different than our concern.
[223001370260] |We’re working on the much larger scale problem of linking Entrez-Gene (hundreds of thousands of genes with between a handful and thousands positive examples plus metadata) and MEDLINE (tens of millions of citations with titles, authors, abstracts, dates, and other meta-data like MeSH terms).
[223001370270] |For each of the genes in Entrez, we have positive-only data consisting of MEDLINE citations to papers.
[223001370280] |We also have lots of metadata and attached descriptive text.
[223001370290] |We’d like to be able to calibrate p(gene|text)
across genes so that for each paper, we can rank genes in order of likelihood they’re mentioned in the text.
[223001370300] |This is very different than ranking texts in order of likelihood of mentioning a gene, which we also want to do, but is easier with traditional “more like this” methodology.
[223001370310] |Estimation
[223001370320] |In order to derive estimates of p(pos|text)
, the authors use a 1D logistic regression to estimate probabilities p(lab|text)
from SVM classifier margins.
[223001370330] |Of course, the sigmoid is monotonic, so this preserves the same ranking as the SVMs.
[223001370340] |The authors estimate p(lab|pos)
by computing the expectation of p(lab|pos)
by averaging p(lab|x)
over a cross-validated sample of labeled positives.
[223001370350] |Does Negative Data Help?
[223001370360] |In the AI’08 paper, they consider adding negative data gleaned from early review rounds where they gave 300 papers to curators.
[223001370370] |Surprisingly, their results showed the negative data didn’t help much at all.
[223001370380] |They were surprised about variance in their data samples, which showed from 1.18% to 4.25% relevant articles in 300 samples.
[223001370390] |I’m not surprised –the 95% posterior interval for Binomial(n|300,0.025)/300
is (0.010,0.043).
[223001370400] |Underestimation Bias (I think)
[223001370410] |Their figure shows the result of a (mis-specified, in the statistical sense) two-dimensional logistic regression on two categories generated by 2D normal distributions (from the figure, it looks like zero covariance), using an expanded basis with intercept and interaction terms: 1, x, y, x*x, y*y
(click on figure to enlarge):
[223001370420] |The blue pluses are positive cases, about 20% of which are labeled (not separated from the unlabeled positive cases); the red circles are negative cases.
[223001370430] |The smaller blue ellipse is the 50% contour for the labeled-vs-unlabeled classifier, and the larger blue ellipse is the 50% contour for a classifier trained on positive-vs-negative.
[223001370440] |This neatly shows why the classifier trained on the labeled versus unlabeled distinction has high precision.
[223001370450] |It also neatly shows a case where Lemma 1 fails in practice, because the probabilities are not in the ratio dicated by Lemma 1.
[223001370460] |My main concern with this paper is that training a classifier on labeled versus unlabeled data is going to systematically underestimate p(lab|text)
because the unlabeled data contains many points drawn from the same set of underlying positive examples (red pluses) as the labeled examples.
[223001370470] |Their diagram shows this very neatly, though we can’t prove anything with a mis-specified model anecdote.
[223001370480] |What about Recall?
[223001370490] |It’s hard to evaluate recall in this setup, and the AI’08 paper provides some heuristics that require evaluating some negative data in the end from a sample.
[223001370500] |For the gene linkage problem of generating classifiers for every gene in Entrez-Gene, it’s hard to even get a sample where there’s enough positive density to allow estimation.
[223001370510] |Questions and Quibbles
[223001370520] |A question I had is why they didn’t iterate.
[223001370530] |Given what they’re doing, it’d seem natural to combine this with EM-like iterations.
[223001370540] |Or Gibbs samples.
[223001370550] |Quibble 1: After all the estimation, they don’t measure how well calibrated the probabilities are —we only see the ROC curves.
[223001370560] |Although every reviewer wants more graphs, I really want to see how well calibrated the probabilities are.
[223001370570] |Quibble 2: The authors really shouldn’t use four decimal places (0.xxxx) for reporting results on 6000 data points in their tables, but rounding makes them look like less of an improvement on “prior art”.
[223001370580] |Quibble 3: I didn’t see any justification for their claim in the AI’08 paper that SVMs have a “small and useful boost in accuracy compared to methods like maximum entropy [aka logistic regression]“.
[223001370590] |What’s the gain from their use of a quadratic kernel over the linear kernel?
[223001370600] |Did they regularize?
[223001370610] |I suppose SVMs are like IBM and you can’t go wrong these days buying into them.
[223001370620] |I’d have thought for estimating the probabilities (Quibble 1), logistic regression might have an advantage because training maximizes the sum of the log estimates.
[223001370630] |And even if SVMs do have a small and useful boost in accuracy, we’re talking about ranking on the one hand and probability estimate on the other, not first-best classifier accuracy.
[223001380010] |Linear Models for N-Gram Probabilities
[223001380020] |I’ve been thinking about linear classifiers, and while it’s well known how to convert naive Bayes (multinomial) classifiers into linear models (just treat log p(cat) as intercept, and log p(word|cat) as dimensions), it’s less well known how to do this with higher-order n-grams.
[223001380030] |One use is an n-gram language-model based search engine implemented on top of a vector-based retrieval engine (I’m not sure Lucene’s scoring’s flexible enough for this).
[223001380040] |Well, it turns out to be easy if you use interpolated smoothing like Witten-Bell.
[223001380050] |In the bigram case, the smoothed estimate is:
[223001380060] |where lam(b)
in [0,1]
is the interpolation coefficient for context b
and lam()
is the interpolation coefficient for the empty n-gram. p'(a|b)
and p'(a)
are maximum likelihood estimates, and unif
is the uniform probability of a token.
[223001380070] |For Witten-Bell smoothing, interpolation is defined as:
[223001380080] |where count(b,_)
is the number of times the token b
is seen before another token in training, making lam(b)
increase with the number of training instances.
[223001380090] |Let count[X]
be the count for n-gram X
(where the length may be 0, 1 or 2).
[223001380100] |For instance, the vector “John likes John” has counts for the following dimensions:
[223001380110] |Then define coefficient vector beta
, with dimensions for all of the n-grams seen in training (after pruning):
[223001380120] |If we look at the contribution from terms in our example, and sum, we see how it works, assuming all bigrams have been seen in training other than [likes,John]
.
[223001380130] |Doing the sums leaves us with the log estimate:
[223001380140] |It’s no problem extending this to higher order n-grams.
[223001380150] |Many other kinds of interpolated smoothing will work, too.
[223001390010] |Document-Length-Normalized Naive Bayes
[223001390020] |I just finished re-reading the Nigam, McCallum and Mitchell paper Semi-Supervisied Text Classification Using EM before implementing an EM tutorial and new naive Bayes implementation.
[223001390030] |I was struck by:
[223001390040] |For pre-processing, stopwords are removed and word counts of each document are scaled such that each document has constant length, with potentially fractional word counts.
[223001390050] |(Nigam et al. p. 9; my bold)
[223001390060] |I figured they meant an L1 norm, which amounts to dividing by the sum of (the absolute value of the) counts.
[223001390070] |But they could’ve meant the L2 norm that’s used in vector-based information retrieval, which amounts to dividing by vector length, which is just the square root of the sum of the squared counts.
[223001390080] |In fact, (Rennie et al. 2003) used L2 norms (after log-scale TF/IDF weighting!).
[223001390090] |I also didn’t know what length they normed to, but figured it’d be something small so that the conditional probability estimates of naive Bayes would have less variance (more on this below).
[223001390100] |So I wrote off to Kamal Nigam, who kindly responded with answers.
[223001390110] |Yes, they used an L1 norm, but they used a surprisingly long length, on the order of hundreds of words, which will cause most naive Bayes estimates to tend toward 0 or 1 due to the faulty independence assumption underlying naive Bayes.
[223001390120] |They used the stoplist from Andrew McCallum’s BOW toolkit, and Kamal was pretty sure they case normalized.
[223001390130] |(20 Newsgroups also works a smidgen better for classification if you strip out the headers, which I’m not sure if they did or not.)
[223001390140] |It’s pretty clear how to generalize naive Bayes to handle fractional counts —just use floating point accumulators (better make that a double
) and carry out all the other calculations in exactly the same way.
[223001390150] |Let’s say I have two categories with the following estimates after training:
[223001390160] |Now what happens when I get the document “a a b”?
[223001390170] |so that:
[223001390180] |That’s without length normalization.
[223001390190] |What if we make the length norm 1?
[223001390200] |Then we have to raise all of our token probabilities (not the category probabilities) to the 1/3 power, yielding:
[223001390210] |This works out to an estimate of p(cat1|a a b) = 0.91
.
[223001390220] |Suppose we take the norm to be 100, we raise the conditional probs to the power of (100/3), yielding p(cat1|a a b)=1.0
(well, actually 0.999999925).
[223001390230] |In general, length norms lower than the length of the document drive the estimates p(cat|text)
closer to the category distribution p(cat|text)
, in this case 0.9, and length norms greater than the document length drive the estimates closer to 0 or 1.
[223001390240] |Thus with a balanced category distribution, the posterior category estimates are driven to the uniform distribution.
[223001390250] |One side effect of this normalization is that the joint probabilities are no longer coherent.
[223001390260] |It’s easy to see that the infinite chain of inputs “a”, “a a”, “a a a”, …, each of which has the same “joint probability” after length normalization, is going to drive our total probablity above 1.0.
[223001390270] |Nigam et al. just went ahead and carried out EM in the usual way in an attempt to maximize joint probability estimates of the data and the models (simply Dirichlet prior and MAP estimate).
[223001400010] |Rennie, Shih, Teevan, and Karger (2003) Tackling the Poor Assumptions of Naive Bayes Text Classifiers
[223001400020] |It turns out Nigam et al. weren’t the only ones fiddling with naive Bayes.
[223001400030] |I just finished:
[223001400040] |Rennie, Shih, Teevan, Karger (2003) Tackling the Poor Assumptions of Naive Bayes Text Classifiers.
[223001400050] |In ICML.
[223001400060] |Rennie et al. introduce four distinct modifications to Naive Bayes:
[223001400070] |log term frequency normalization, to cope with overdispersion,
[223001400080] | length normalization, to dampen inter-term correlation,
[223001400090] | inverse doc frequency normalization, to cope with noise from common features, and
[223001400100] | complementation, to cope with low-count bias.
[223001400110] |Log Term Frequency (TF)
[223001400120] |Word counts tend to be overdispersed relative to what is predicted by a multinomial.
[223001400130] |Simply put, two or three occurrences of a word in a document are much more common than predicted by multinomials.
[223001400140] |Ken Church called this “burstiness”.
[223001400150] |The two principled ways to address this are (1) mixture models, including the gamma/Poisson [negative binomial], the Dirichlet/multinomial, and zero-inflated models, and (2) latent topic models, where the latent topics model correlations among terms.
[223001400160] |Rennie et al. motivate log word frequency transforms on the basis that modeling log counts with a multinomial is overdispersed compared to standard multinomial models of linear counts.
[223001400170] |Rather than training on raw document counts, as in standard naive Bayes, the count vectors count(w,d)
of words in documents are first transformed based on term frequency (TF) and inverse document frequency (IDF):
[223001400180] |where
[223001400190] |where count(w,d)
is the count of word w
in document d
,
[223001400200] |Inverse Document Frequency (IDF)
[223001400210] |To help eliminate noise caused by common words (defined as those that show up in many documents), word counts are next scaled by inverse document frequency:
[223001400220] |where numDocs
is the total number of documents and numDocs(w)
is the total number of documents containing word w
.
[223001400230] |Length Normalization (L2)
[223001400240] |Length normalization dampens the correlations among words in documents.
[223001400250] |It also scales probabilities for different lengths, which is not important for Rennie et al.’s first-best classifiers.
[223001400260] |To L2 (Euclidean) length normalize, they divide the vector of TF/IDF counts by its length:
[223001400270] |where the L2 (Euclidean) length of the term vector for document d is defined by:
[223001400280] |I’ve been doing classification like this ever since I started doing classification, which is why the result so far looks like LingPipe’s classify.TfIdfClassifierTrainer
.
[223001400290] |Complementation
[223001400300] |What I’d never seen done is complementation.
[223001400310] |Complementation is introduced to deal with the bias seen with imbalanced training data.
[223001400320] |What’s really nifty is their comparison to one-versus-all formulations of linear classifiers.
[223001400330] |I’ll talk about both of these issues in a future blog post.
[223001400340] |Instead of training a category based on its own counts and then taking the most likely category, Rennie et al. train a category based on all other category counts, then take the lowest scoring category.
[223001400350] |That is, we count all the instances (after all the TF/IDF and length norms) of the word in categories other than c.
[223001400360] |Finally, the discrete distribution underlying category c is defined as:
[223001400370] |Classification
[223001400380] |The output is a set of weights that can be used as the basis of a regular old linear classifier (with intercept feature of the category probabilities).
[223001400390] |As such, this looks like search-engine TF/IDF computations for information retrieval (see my post TF/IDF, Scaling and Symmetry for gory details), for which there are typically no normalizations applied to the queries.
[223001400400] |It’s possible to roll most of the norms into the training, so this is no big deal, and the final length normalization doesn’t affect first-best answers.
[223001400410] |First-Best, Conditional Probabilities and EM
[223001400420] |Recall that (Nigam, McCallum and Mitchell 2003) (L1) length-normalized the documents being classified, to fit better semi-supervised models with expectation maximization (EM).
[223001400430] |Although this has no effect on first-best classification, it does change the category probability estimates, and thus affects expectations, and hence EM.
[223001400440] |Ironically, Rennie et al. list not being able to run EM as a drawback to their ad-hoc normalizations.
[223001400450] |As Nigam et al. showed, you really only need the conditional estimates for computing the E step, and these may be computed with Rennie et al.’s modifications as well as with Nigam et al.’s.
[223001400460] |Then the M step can be any old model fitting, including Rennie et al.’s.
[223001410010] |Lucene 2.4 in 60 seconds
[223001410020] |[Update: 17 Nov 2010.
[223001410030] |We now have an extensive (20 pages in textbook format) tutorial for Lucene 3.0 as an appendix for the LingPipe book.
[223001410040] |The PDF for the current draft and code samples in a tarball are available from the following site:
[223001410050] | Text Analysis in LingPipe
[223001410060] |As a bonus, the tokenization chapter provides adapter implementations for converting LingPipe tokenizers to Lucene analyzers and vice-versa.]
[223001410070] |This is a tutorial on getting started with the Lucene 2.4 Java API which avoids using deprecated classes and methods.
[223001410080] |In particular it shows how to process search results without using Hits
objects, as this class is scheduled for removal in Lucene 3.0.
[223001410090] |The time estimate of 60 seconds to complete this tutorial is more wish than promise; my goal is to present the essential concepts in Lucene as concisely as possible.
[223001410100] |In the best “Hello World” tradition I have written a class called HelloLucene
which builds a Lucene index over a set of 3 texts: { “hello world”, “hello sailor”, “goodnight moon” }. HelloLucene
has two methods (besides main
): buildIndex
and searchIndex
, which are called in turn. buildIndex
builds an index over these 3 texts; searchIndex
takes the args
vector and runs a search over the index for each string, printing out results to the terminal.
[223001410110] |Here is buildIndex
and its required import statements:
[223001410120] |A Lucene data store is called an Index.
[223001410130] |It is a collection of Document
objects.
[223001410140] |A Document
consists of one or more named Field
objects.
[223001410150] |Documents are indexed on a per-field basis.
[223001410160] |An IndexWriter
object is used to create this index.
[223001410170] |Let’s go over the call to its constructor method:
[223001410180] |The first argument is the index that the IndexWriter
operates on.
[223001410190] |This example keeps the index on disk, so the FSDirectory
class is used to get a handle to this index.
[223001410200] |The second argument is a Lucene Analyzer
.
[223001410210] |This object controls how the input text is tokenized into terms used in search and indexing. HelloLucene
uses a StandardAnalyzer
, which is designed to index English language texts.
[223001410220] |It ignores punctuation, removes common function words such as "if", "and", and “"but", and converts words to lowercase.
[223001410230] |The third argument determines the amount of text that is indexed.
[223001410240] |The constant IndexWriter.MaxFieldLength.LIMITED
defaults to 10,000 characters.
[223001410250] |The for
loop maps texts into Document
objects, which contain a single Field
with name "text".
[223001410260] |The last 2 arguments to the Field
constructor method specify that the contents of the field are stored in the index, and that they are analyzed by the IndexWriter
's Analyzer
.
[223001410270] |The IndexWriter.addDocument()
method adds each document to the index.
[223001410280] |After all texts have been processed the IndexWriter is closed.
[223001410290] |Both indexing and search are operations over Document
objects.
[223001410300] |Searches over the index are specified on a per-field basis, (just like indexing).
[223001410310] |Lucene computes a similarity score between each search query and all documents in the index and the search results consist of the set of best-scoring documents for that query.
[223001410320] |Here is searchIndex
and its required import statements:
[223001410330] |Access to a Lucene index (or indexes) is provided by a Searcher
object. searchIndex
instantiates a IndexSearcher
over the directory "helloLuceneIndex".
[223001410340] |Search happens via a call to the Searcher.search()
method.
[223001410350] |Before calling the search()
method the search string must be processed into a Lucene Query
.
[223001410360] |To prepare the query we create a QueryParser
and call its parse()
method on the search string.
[223001410370] |The QueryParser
is constructed using the same Analyzer
as was used for indexing because the QueryParser
processes the search string into a set of search terms which are matched against the terms in the index, therefore both the indexed text and the search text must be tokenized in the same way.
[223001410380] |Here are the statements which process the results:
[223001410390] |The search()
method returns a TopDocs
object, which has 2 public fields: scoreDocs
and totalHits
.
[223001410400] |The latter is the number of documents that matched the query, and the former is the array of results in the form of ScoreDoc
objects, where a ScoreDoc
is itself a simple object consisting of two public fields: doc
, the Document
id (an int
value); and score
a float
value calculated by Lucene (using a similarity metric best described as Unexplainable Greek kung fu). searchIndex
reports the TopDoc.totalHits
, then iterates over the TopDoc.scoreDocs
array.
[223001410410] |The ScoreDoc.doc
is used to retrieve the Document
object from the index, and finally the method Document.get()
retrieves the contents of the Field
with name "text".
[223001410420] |In order to compile this program, download the Lucene 2.4 distribution, which contains the jarfile lucene-core-2.4.0.jar
.
[223001410430] |Either add this to your classpath, or simply put it in the same directory as HelloLucene
, and compile with the following command:
[223001410440] |Invoke the program with the following command:
[223001410450] |Produces these results:
[223001410460] |The first query "hello world" matches exactly against our first text, but it also partially matches the text "hello sailor".
[223001410470] |The second query "hello" matches both texts equally well.
[223001410480] |The third query "moon" matches against the only document which contains that word.
[223001410490] |The fourth query "foo" doesn’t match anything in the index.
[223001410500] |Exercise for the reader
[223001410510] |Refactor HelloLucene.java
into two programs: BuildHelloLucene.java
and
SearchHelloLucene.java, where BuildHelloLucene
uses its command line arguments to build the index, instead of the strings supplied by Sring[] texts
.
[223001410520] |Runing BuildHelloLucene
with inputs "hello world", "hello sailor", and "goodnight moon" and then running SearchHelloLucene
with inputs "hello world", "hello", "moon" and "foo" should give the same results as above.
[223001410530] |Different input texts and searches will expose the behaviour of the StandardAnalyzer
in particular, and Lucene in general.
[223001420010] |javax.datamining
.
[223001420100] |The number of interfaces defined is truly remarkable.
[223001420110] |For example, javax.datamining.algorithm
includes
[223001420120] |svm.classification.SVMClassificationSettingsFactory
javax.datamining.supervised
contains:
[223001420150] |classification.ClassificationTestMetricsTaskFactory
, javax.datamining
package.
[223001440010] |java.util.Comparator
) that supports operations to add elements, and to pop (or peek at) the largest element.
[223001450060] |A reasonable implementation is with a resizable heap, which makes the add and pop operations logarithmic on the number of entries and the peek operation constant time.
[223001450070] |For search algorithms, the beam heuristic (in one incarnation), restricts the queue to the best K elements (it’s a heuristic because it can result in search errors).
[223001450080] |This means a bounded priority queue, which has a fixed upper bound on size.
[223001450090] |The problem is that you can’t use a regular heap-based priority queue for this, because there’s no way to pop off the minimal element in less than linear time in the size of the heap.
[223001450100] |You can use a fully sorted implementation, like java.util.TreeSet
, and that’s exactly what LingPipe’s util.BoundedPriorityQueue
does.
[223001450110] |But it’s inefficient, because the red-black tree underlying the set implementation does a whole lot of work, including allocating nodes.
[223001450120] |Today’s review paper has the “right” algorithm to do bounded priority queues of fixed maximum sizes, which I implemented way back in LingPipe 2.4 in util.MinMaxHeap
:
[223001450130] |java.util.SortedSet
requires.
[223001450250] |I’ve just extended LingPipe’s priority queues to implemented SortedSet
, but I had to punt on the dynamic views returned by the headSet()
, tailSet()
and subSet()
methods.
[223001450260] |For auto-complete, I also just added a small bounded priority queue for accumulating results, because bubble-sort‘s much faster for small sets than all the tree-fiddling in tree sets.
[223001460010] |English Aliases: [ "ultra thin flat panel screen", "ultra thin flat panel screens", "ultra thin screen", "ultra thin screens", "ultra-thin display", "ultra-thin displays", "ultra-thin flat panel displays", "ultra-thin screen", "ultra-thin screens" ]
[223001460060] |Lists like these also seem arbitrary.
[223001460070] |Why provide “ultra-thin” versus “ultra thin” variations in only some of the aliases?
[223001460080] |Is “ultra thin display”, which is not listed, an inferior alias?
[223001460090] |For some reason, the title isn’t included as an alias.
[223001460100] |The temptation is to use a regular expression, but often there are dependencies that are hard to capture with a regular expression.
[223001460110] |For instance, if I want to be able to say “tall white coffee” in all six orders, it’s pretty much impossible to characterize with a regex; the tightest I could come up with was (tall (white coffee|coffee white))|(white (tall coffee|coffee tall))|(coffee (tall white|white tall))
.
[223001460120] |So what we wind up doing instead is taking a set of aliases and then looking at some kind of approximate matching.
[223001460130] |Our goal right now’s to get a high recall extraction over gene names.
[223001460140] |We know we can get the mentions, but what about the Entrez-Gene-ID/mention pairs?
[223001470010] |http://download.eclipse.org/eclipse/downloads/
[223001480050] |From there you need to click through for a specific release.
[223001480060] |The most recent release is at the top of the first table; right now it points to:
[223001480070] |http
link.
[223001480090] |That’ll then take you to the download page.
[223001480100] |For 64-bit Windows (x86_64) use:
[223001480110] |Windows (x86_64)
builds on Vista on both an Intel Core-2-based laptop and dual-Xeon-based workstation.
[223001480130] |It works with the 1.5 JDKs from Sun.
[223001480140] |We have instructions on getting it running in our LingPipe Eclipse Tutorial.
[223001480150] |I’m about to add that you need to close the funny start page they have now, which wasn’t obvious.
[223001480160] |This took some finding.
[223001480170] |Actually, it took a hint from Mike, because you certainly won’t find it by navigating, which will take you to:
[223001480180] |http://www.eclipse.org/downloads/
[223001480190] |which doesn’t have links to 64-bit versions of Eclipse for Windows.
[223001480200] |If you search for on their site, you get old bboard questions without answers.
[223001480210] |If you search for on Google, Yahoo, or MSN, only MSN has anything close to a helpful answer in the first 10 hits.
[223001490010] |com.aliasi.util
‘s BoundedPriorityQueue
to have it implement java.util
‘s SortedSet
.
[223001490030] |That way I could use more generic returns for things like the auto-completer.
[223001490040] |Turns out I built a tighter bubble-sort based sorted set for that in the end, but that’s a different story.
[223001490050] |So I replaced the extends AbstractCollection
with extends AbstractSet
.
[223001490060] |Everything’s hunky-dory, and it’s passing all of our unit tests.
[223001490070] |Move on a few days, and I’m doing massive refactorings to remove compiler warnings, so I want to use Eclipse, which has even more aggressive compiler warnings than the JDK.
[223001490080] |I couldn’t get it to work on 64-bit Windows, so I go home and fire up our antique ThinkPad.
[223001490090] |Eclipse only does JDK 1.5, so I plug that in.
[223001490100] |I do a little refactoring, then run unit tests.
[223001490110] |Failing in cluster.CompleteLinkClusterer
.
[223001490120] |WTF?
[223001490130] |I shouldn’t have changed anything recently that has an effect on clustering.
[223001490140] |I trace through the code by dumping out dendrograms as they get created (agglomerative clusterers are easy to trace).
[223001490150] |It’s getting the distances wrong.
[223001490160] |The only way it could do that is if the priority queue of inter-cluster distances is messed up.
[223001490170] |Uh oh, that’s my new BoundedPriorityQueue.
[223001490180] |How’d I mess it up?
[223001490190] |Next, I back all the changes I’ve made with Eclipse and still it’s failing.
[223001490200] |But hang on, it worked on my machine!
[223001490210] |So I switch the JDK back to 1.6 and voilà, it works again.
[223001490220] |How can it work in 1.6 and not 1.5?
[223001490230] |Unfortunately, I still don’t know the answer.
[223001490240] |But I did, through more fine-grained tracing, figure out where it was going wrong.
[223001490250] |The culprit is the inconsistent behavior of TreeSet
‘s removeAll()
method, which is only javadoc-ed on the superclass, AbstractSet
, which points to the source of the problem:
[223001490260] |This implementation determines which is the smaller of this set and the specified collection, by invoking the size method on each.
[223001490270] |If this set has fewer elements, then the implementation iterates over this set, checking each element returned by the iterator in turn to see if it is contained in the specified collection.
[223001490280] |If it is so contained, it is removed from this set with the iterator’s remove method.
[223001490290] |If the specified collection has fewer elements, then the implementation iterates over the specified collection, removing from this set each element returned by the iterator, using this set’s remove method.
[223001490300] |The problem here is that iterating over the tree set itself (what happens when it’s smaller) uses a test with contains()
and then calls the remove method on the iterator, whereas iterating over the argument collection calls remove()
on the tree set.
[223001490310] |These differ with respect to whether they use the natural ordering (equals()
) or whether they use the tree set’s comparator method.
[223001490320] |Thus tree sets whose comparators are not compatible with natural equality run into problems.
[223001490330] |Needless to say, there’ve been lots of “bugs” filed against this behavior, every one of which Sun dismisses as “obeying the letter of the spec”, but I happen to agree with user “gouldina” on this one:
[223001490340] |removeAll()
for bounded priority queues that used the iterator-based approach (first case above).
[223001490380] |That’s how AbstractCollection
implements it, and why the program worked before.
[223001490390] |Problem solved.
[223001490400] |Of course, it’s ridiculously inefficient for large queues.
[223001490410] |I need a better data structure here.
[223001490420] |The root cause (pun intended) is a trickier question.
[223001490430] |Both JDK 1.5 and 1.6 have the same definition of removeAll()
, but it’s a maze of twisty little passages, all different, out from there through all the indirection of the remove()
and contains()
through the superclasses and contained map’s definitions.
[223001490440] |Something must be different in 1.5 and 1.6, because I’m getting different answers.
[223001490450] |But I have better things to do than trace down what it is.
[223001510010] |visitTrain()
.
[223001510300] |On the same topic, suppose we have a text corpus and we want to restrict it to only texts of length 30 to 50 (yes, that just came up this week in a client project).
[223001510310] |We just apply the same trick twice, filtering the corpus by filtering the handler:
[223001510320] |Basically, these approaches all put something in between the implementation you have and the implementation that’s needed, usually acting as a filter.
[223001510330] |While it’s not quite Lisp, it’s getting close in terms of parenthesis nesting, and is a handy tool for your Java toolbox.
[223001520010] |cluster.KMeansClusterer
[223001530050] |We’re working on a medium-size clustering problem for a customer with a few hundred thousand short-ish text messages being clustered.
[223001530060] |It was taking roughly forever (OK, several hours/epoch).
[223001530070] |I knew it was slow when I was writing.
[223001530080] |You can tell by the apologetic javadoc implementation note:
[223001530090] |Implementation Note: The current implementation is an inefficient brute-force approach that computes the distance between every element and every centroid on every iteration.
[223001530100] |These distances are computed between maps, which itself is not very efficient.
[223001530110] |In the future, we may substitute a more efficient version of k-means implemented to this same interface.
[223001530120] |More efficient implementations could use kd-trees to index the points, could cache scores of input feature vectors against clusters, and would only compute differential updates when elements change cluster.
[223001530130] |Threading could also be used to compute the distances.
[223001530140] |All true, but there’s a simpler approach than kd-trees or multiple threads or caching that I just implemented and now the epochs are taking roughly no time at all (OK, about a minute).
[223001530150] |K-means clustering is very simple.
[223001530160] |You start with a set of items and a feature extractor to convert them to feature vectors.
[223001530170] |These feature vectors are maps from strings to numbers.
[223001530180] |In the first step, items are assigned to clusters somehow, for instance randomly (though there are better ways to do this for some purposes).
[223001530190] |Then on for each iteration until a maximum number or convergence, the mean (aka centroid) of each cluster is computed by averaging all the vectors (centroids are calculated dimensionwise), and each item’s Euclidean distance to each centroid is computed.
[223001530200] |After all these computations, each item is reassigned to the closest centroid at the end of an epoch.
[223001530210] |Typically, when I do heavy lifting with features, I extract the features once and then re-use the vectors.
[223001530220] |I did that with k-means, but didn’t consider the downstream computations closely enough.
[223001530230] |When you have a 100,000-feature centroid vector and a 100-feature item vector, it’s rather inefficient to compute Euclidean distance, which is defined as:
[223001530240] |Of course, we don’t care about distance per se, so we can save the square root operation and work with squared distances
[223001530250] |There are a couple of problems.
[223001530260] |Hash maps in Java involving numbers are wasteful in size and time to convert back and forth to primitives for arithmetic.
[223001530270] |They’re also slow to iterate (even linked hash maps).
[223001530280] |That’s an obvious bottleneck.
[223001530290] |But the real bottleneck is iterating at all.
[223001530300] |Remember, the fastest code is the code that doesn’t execute at all (as I noted in a comment about point 6 of John Langford’s excellent blog entry on machine learning code optimization).
[223001530310] |elt[i]
values are zero.
[223001530330] |So what we do is compute the squared length of each centroid and store it at the beginning of each epoch:
[223001530340] |The second part of this equation is key, as it shows how the length of the centroid is related to distances to dimensions with zero values.
[223001530350] |Here’s the formula:
[223001530360] |Now note that when elt[i]=0
, the terms inside the sum cancel.
[223001530370] |So we keep going with:
[223001530380] |This simple algebra means that after computing the length of each centroid, each distance computation only requires a number of operations proportional to the number of non-zero elements in the item’s feature vector.
[223001530390] |int[]
index and double[]
value arrays), and reasonable dense vector representations (i.e. double[]
), K-means is now hundreds of times faster for the kinds of problems we care about: large sets of sparse feature vectors.
[223001530410] |I know I can squeeze out at least another factor of 2 or 3 on my dual quad-core machine by multi-threading; k-means and other EM-like algorithms are trivial to parallelize because all the heavy computations in the inner loops are independent.
[223001530420] |classify.JointClassification
using joint log probabilities, which requires properly scaled linear probabilities in order to construct its superclass classify.ConditionalClassification
.
[223001540080] |The second is inside multinomial logistic regression, where its used to convert the linear predictors into probabilities.
[223001540090] |Suppose we have a vector of finite numbers:
[223001540100] |The softmax operation converts it into a vector whose entries are finite, non-negative, and sum to 1:
[223001540110] |where the normalizing term is the usual sum:
[223001540120] |so that by construction:
[223001540130] |The goal is to actually compute the terms in the softmax vector.
[223001540140] |The problem?
[223001540150] |Overflow.
[223001540160] |The terms in x
might be large.
[223001540170] |So large that Math.exp(x[i])
overflows.
[223001540180] |That doesn’t even need to be that big, just Math.log(Double.MAX_VALUE)
, or about 710
.
[223001540190] |It’s especially easy to do in logistic regression under many conditions (e.g. large absolute feature values, large variance within a dimension of features, large predictive differences, weak priors with separable features).
[223001540200] |It’s algebra month here at Alias-i, so you shouldn’t be surprised when I note that the vector x can be shifted dimensionwise without changing the softmax result:
[223001540210] |where the additive term factors through the normalizing constant:
[223001540220] |and thus drop out to establish the equivalence.
[223001540230] |What we do is rescale so there’s no overflow by subtracting the maximum of the x
values from each x[i]
.
[223001540240] |That converts overflow into underflow.
[223001540250] |Underflow is no problem, because that rounds off to zero, which is a well-behaved floating point number.
[223001540260] |The code is roughly as follows:
[223001540270] |The real code’s less clear because it caches the exponent computation Math.exp(xs[i] - a)
in ps
while computing the sum for Z
and then just rescales ps
by Z
.
[223001540280] |In practice, suppose x = (305, 300, 150)
.
[223001540290] |So we subtract the max (305) from each entry to get x' = (0,-5,-155)
.
[223001540300] |We then calculate the normalizer Z = exp(0) + exp(-5) + exp(-155) ~ 1 + 1/32 + 0 = 33/32
.
[223001540310] |Note how the exponentiation of the big negative value (-155) underflows to zero.
[223001540320] |We now compute the linear probabilities as p' = (exp(0)/(33/32)), exp(-5)/(33/32), exp(-155)/(33/32)) = (32/33, 1/33, 0)
.
[223001540330] |The old classifier implementation used the max rather than the min to rescale, which isn’t agressive enough in that it still allows overflow when the inputs may be large an dpositive.
[223001540340] |That wasn’t a problem for classifiers when the numbers were log probabilities, because they’re already all negative; so there the scaling prevents catastrophic underflow, and we already had that right.
[223001540350] |This method is guaranteed not to have any overflow.
[223001540360] |And it’s guaranteed not to have all zero values, too.
[223001540370] |I’m afraid it’s simply not possible to represent two numbers more than exp(711)
apart or so using IEEE double-precision floating point arithmetic.
[223001540380] |And frankly, once events are relatively this unlikely, we’re not too worried about rounding to zero.
[223001540390] |Look for the new implementation in LingPipe 4.0′s logistic regression classification function.
[223001540400] |Coming soon to a web site near you.
[223001550010] |