[224001710010] |
What Makes a Fair Comparison
[224001710020] |This issue seems to come up implicitly or explicitly in a large variety of circumstances.
[224001710030] |The general set up is this.
[224001710040] |Suppose that for some task, we have the "de facto" approach "A."
[224001710050] |I come up with a new approach "B" that I want to argue is better than "A."
[224001710060] |The catch is that for whatever reasons, I have to limit B in some way (perhaps in terms of the amount of data I train in on).
[224001710070] |So, is the fair comparison to a similarly limited version of A or to the full-blown A?
[224001710080] |Let's instantiate this with two examples.
[224001710090] |(Yes, these are examples of papers that have been published; you can see if you can figure out which ones they are.
[224001710100] |I'm just not sufficiently creative to make something up right now.)
[224001710110] |Let's say I'm doing syntactic language modeling.
[224001710120] |In this case, the baseline would be (say) a trigram language model.
[224001710130] |My model requires parse trees to train.
[224001710140] |So I train my language model on the Penn Treebank.
[224001710150] |Now, when I compare to an ngram model, is the fair comparison to an ngram model trained on the Treebank, or one trained on (say) all WSJ from ten years?
[224001710160] |Now let's say I'm doing MT.
[224001710170] |My baseline is Moses and I build some fancy MT system that can only train on sentences of length 10 or less.
[224001710180] |Should I compare to Moses trained on sentences of length 10 or less, or all sentences?
[224001710190] |The obvious response is "report both."
[224001710200] |But this is just deferring the problem.
[224001710210] |Instead of you deciding which is the appropriate comparison, you are leaving the decision up to someone else.
[224001710220] |Ultimately, someone has to decide.
[224001710230] |The advantage to comparing to a gimped version of model A is that it tells you: if this is the only data available to me, this is how much better I do than A.
[224001710240] |Of course, in no real world situation (for many problems, including the two I listed above) will this really be the only data you have.
[224001710250] |Plus, if you compare to a non-gimped A, you'll almost always lose by a ridiculously large margin.
[224001710260] |On the other hand, comparing against a non-gimped A is a bit unfair.
[224001710270] |Chances are that quite some time has gone in to optimizing (say, for speed) algorithms for solving A. Should I, as a developer of new models, also have to be a hard-core optimizer in order to make things work?
[224001710280] |I'm thinking back to the introduction of SVMs twenty years ago.
[224001710290] |Back then, SVM training would take thousands of times longer than naive Bayes.
[224001710300] |Today, (linear) SVM training really isn't that much slower (maybe a small constant times longer).
[224001710310] |Yet from the perspective of a consumer, there's something fundamentally unpleasant about having to gimp an existing system.... you have to ask yourself: why should I care?
[224001710320] |I suppose this is where human judgement comes in.
[224001710330] |If I can reasonably imagine that the new system B might possibly be scaled up and, if so, I think it would continue to do well, then I'm not unhappy with a gimped comparison.
[224001710340] |For example, I can probably buy the syntactic language modeling example above (and to a lesser degree the MT example).
[224001710350] |I have a harder time with grammar induction on 10 word sentences because my prior beliefs state that 10 word sentences are syntactically really different from real sentences.
[224001710360] |(Incidentally, although this issue didn't come up in my recent reviewing, I suppose that if I were reviewing a paper that had this issue, it probably wouldn't hurt if the authors were to ease me through this imagination process.
[224001710370] |For instance, in grammar induction, maybe you can show statistics that say that the distribution of context free rules is not so dissimilar between all sentences and short sentences.
[224001710380] |This almost never happens, but I think it would be useful both during reviewing as well as just for posterity.)
[224001730010] |Co-training, 10 years later
[224001730020] |At this year's ICML, they gave out a "10 year" award to a paper published in an ICML-related venue from 1998.
[224001730030] |This year it went to a COLT 1998 paper by Avrim Blum and Tom Mitchell: Combining Labeled and Unlabeled Data with Co-Training.
[224001730040] |While I'm not super familiar with what might have been a contender, I have to say that I definitely think this is a good choice.
[224001730050] |For those unfamiliar with the idea of co-training, you should really read the paper.
[224001730060] |There's also a wikipedia entry that describes it as:
[224001730070] |Co-training is a semi-supervised learning technique that requires two views of the data.
[224001730080] |It was introduced by Avrim Blum and Tom Mitchell.
[224001730090] |It assumes that each example is described using two different feature sets that provide different, complementary information about the instance.
[224001730100] |Ideally, the two views are conditionally independent (i.e., the two feature sets of each instance are conditionally independent given the class) and each view is sufficient (i.e., the class of an instance can be accurately predicted from each view alone).
[224001730110] |Co-training first learns a separate classifier for each view using any labeled examples.
[224001730120] |The most confident predictions of each classifier on the unlabeled data are then used to iteratively construct additional labeled training data.
[224001730130] |This is a good summary of the algorithm, but what is left off is that---as far as I know---co-training was one of the first (if not the first) method for which theoretical analysis showed that semi-supervised learning might help.
[224001730140] |My history is a bit rough, so anyone should feel free to correct me if I'm wrong.
[224001730150] |Another aspect of co-training that is cool for readers of this blog is that to a reasonable degree, it has its roots in a 1995 ACL paper by David Yarowsky: Unsupervised Word Sense Disambiguation Rivaling Supervised Methods, which, as far as I know, was really the first paper to introduce the notion of having two views of data (although I don't think David described it as such).
[224001730160] |All in all, the co-training paper is great.
[224001730170] |In fact, if you don't believe me that I think it's great, check out my EMNLP 2008 paper.
[224001730180] |My analysis (and, to some degree, algorithm) are based heavily on the co-training analysis.
[224001730190] |Which brings me to what I really want to discuss.
[224001730200] |That is, I have a strong feeling that if the co-training paper were reviewed today, it would be blasted for the theoretical analysis.
[224001730210] |(Indeed, I had the same fear for my EMNLP paper; though since it was EMNLP and not, say, COLT, I don't think the problem was as severe.)
[224001730220] |The problem with the co-training paper is that the theoretical result is for an algorithm that is only superficially related to the actual algorithm they implement.
[224001730230] |In particular, the actual algorithm they implement uses notions of confidence, and steadily increasing training set size, and incremental additions and so on.
[224001730240] |It's vastly more complicated that the algorithm they analyze.
[224001730250] |My recent experience as both an author and reviewer at places like NIPS and ICML is that this is pretty much a non-starter these day.
[224001730260] |In fact, the algorithm is so different that it took three years for an analysis of something even remotely close to come out.
[224001730270] |In NIPS 2001, Sanjoy Dasgupta, Michael Littman and David McAllester published a paper that actually tries to analyze something closer to the "real" co-training algorithm.
[224001730280] |They get pretty close.
[224001730290] |And this analysis paper is a full NIPS paper that basically just proves one (main) theorem.
[224001730300] |(A similar set of events happened with David Yarowsky's paper.
[224001730310] |He didn't include any theoretical analysis, but there has been subsequent work, for instance by Steve Abney to try to understand the Yarowsky algorithm theoretically.
[224001730320] |And again we see that an analysis of the exact original algorithm is a bit out of grasp.)
[224001730330] |I'm sure other people will disagree--which is fine--but my feeling about this matter is that there's nothing wrong with proving something interesting about an algorithm that is not quite exactly what you implement.
[224001730340] |The danger, of course, is if you get an incorrect intuition.
[224001730350] |For instance, in the case of co-training, maybe it really was all these "additions" that made the algorithm work, and the whole notion of having two views was useless.
[224001730360] |This seems to have turned out not to be the case, but it would be hard to tell.
[224001730370] |For instance, the co-training paper doesn't report results on the actual algorithm analyzed: presumably it doesn't work very well or there would be no need for the more complex variant (I've never tried to implement it).
[224001730380] |On the other hand, if it had taken Avrim and Tom three extra years to prove something stronger before publishing, then the world would have had to wait three extra years for this great paper.
[224001730390] |The approach I took in my EMNLP paper, which, at least as of now, I think is reasonable, is to just flat out acknowledge that the theory doesn't really apply to the algorithm that was implemented.
[224001730400] |(Actually, in the case of the EMNLP paper, I did implement both the simple and the complex and the simple wasn't too much worse, but the difference was enough to make it worth--IMO--using the more complex one.)
[224001740010] |Statistical Machine Translation Papers at COLING
[224001740020] |(guest post) Not always is a major conference a short train ride away, so I went down to Manchester even though I had no real business being at COLING this year.
[224001740030] |Liang Huang gave an interesting tutorial where he tied together a number of natural language problems and types of dynamic programming solutions for it.
[224001740040] |A nice treatment of this area, with examples from tree-based statistical machine translation, of course.
[224001740050] |There were also a lot of very strong papers, so let's take a look at them.
[224001740060] |Syntax-Based SMT is alive: Currently one of the most exciting areas of statistical machine translation is the development of tree-based models, with all the open question: Real syntax in the source, in the target?
[224001740070] |Which grammar formalism?
[224001740080] |How do we parameterize the models?
[224001740090] |And all this efficient, please!
[224001740100] |When using real syntax on both sides, many of the rules in standard phrase-based models do not match anymore the syntactic annotations, so they have to be dropped.
[224001740110] |So, to accommodate more rules, Zhang et al. propose to extend the synchronous grammar formalism to allow multi-headed rules.
[224001740120] |He et al. build maximum entropy models for rule selection in hierarchical phrase models, similar to recent work by Carpuat and Wu in phrase-based models.
[224001740130] |Additional source-side context may of course be a syntactic parse tree.
[224001740140] |Xiong et al. use features about syntactic spans in their translation model and maximum entropy reordering model in a BTG framework.
[224001740150] |Zhang et al. present an algorithm to extract rules from a word-aligned corpus in linear time.
[224001740160] |The trick is to build up a tree structure that compactly encodes all possible rules.
[224001740170] |Storing all these rules is a real problem, too, it may take up tera-bytes, but Lopez presents a handy solution: suffix arrays.
[224001740180] |He also shows that rules that cover longer spans do help in hierarchical models, even if long phrases do not matter much in phrase-based models.
[224001740190] |Finally, Zollmann et al. present detailed results from their time at Google where they compared their syntax-augmented model against a hierarchical and standard phrase model -- so, syntax has arrived at Google at last.
[224001740200] |Back to the Basics: Statistical machine translation systems have evolved into a fairly long pipeline of steps, and many of them deserve more attention.
[224001740210] |For instance the phrase segmentation of the input.
[224001740220] |Does it matter?
[224001740230] |The standard approach uses uniform segmentation, and maybe a phrase count feature.
[224001740240] |Blackwood et al. suggest to build an explicit phrasal segmentation model, which takes the form of a smoothed bigram phrase model.
[224001740250] |Moore and Quirk examine everybody's favorite: minimum error rate training (MERT), the step used to fine-tune weights given to various model components.
[224001740260] |This is typically done with a random starting point, and then hillclimbing to a better weight setting to optimize BLEU.
[224001740270] |Since this method easily gets stuck in local minmima, this is repeated several times.
[224001740280] |Moore and Quirk promise better merting (yes, that is a verb!) with fewer random restarts and better selection of the starting points through a random walk.
[224001740290] |But also the n-best list of candidate translations used during merting or re-ranking may be improved.
[224001740300] |Chen et al. suggest to add additional entries using a confusion network or by concatenating n-grams found in candidate translations.
[224001740310] |Speaking of building of confusion networks, they are also used in combining different system outputs (which is the current key method to win competitions, even though bad-mouthed as intellectually bankrupt at the recent NIST evaluation meeting).
[224001740320] |Ayan et al. suggest that better confusion networks and better system combination results may be achieved, when considering synonyms as found in Wordnet for matching up words when aligning the different outputs.
[224001740330] |Zwarts and Dras show that system combination could also be done by building a classifier that chooses one of the outputs for each sentence.
[224001740340] |Reordering: Two papers on reordering, one of the big unsolved problems in statistical machine translation.
[224001740350] |Elming shows some advances in the first-reorder-into-lattice-then-decode approach, specifically how this lattice should be scored: he argues for scoring the final output to check which rules were implicitly followed (either by applying the rule or using a phrase translation that has it internalized).
[224001740360] |Zhang et al. propose that different reordering models should be built for different types of sentences, such as statements vs. questions.
[224001740370] |Neat and surprising: We typically train our system on corpora that we find on web sites of multilingual institutions, such as the UN or the EU.
[224001740380] |When using such data, does it matter what the original source language was?
[224001740390] |I doubt it, but then van Halteren shows that he can detect the original source language in English Europarl documents with over 96 percent accuracy.
[224001740400] |Connecting with our rule-based friends: An effective but simple method to combine a rule-based system like Systran with a statistical model is but first translating with Systran and then learn a statistical model to translate Systran English into real English.
[224001740410] |Based on your biases, you can call this rule-based preprocessing, serial combination, or statistical post-editing.
[224001740420] |Ueffing et al. show that some information from the rule-based system may help the statistical component, such as annotation of names or other reliable markup.
[224001740430] |Aligning and building up dictionaries: What can you do with a large monolingual source language corpus, or a target language corpus, or a conventional bilingual dictionary?
[224001740440] |Wu et al. present various methods and compare them.
[224001740450] |Tsunakawa et al. use a method using a pivot language (English) to build a Japanese-Chinese dictionary.
[224001740460] |Given a syntactically parsed parallel corpus, Zhechev and Way compare different methods how to extract subtrees.
[224001740470] |Macken et al. extract technical term translations from a domain-specific parallel corpus.
[224001740480] |Lardilleux and Lepage propose an iterative phrase alignment method that first matches short sentences and phrases, and then subtracts the known alignments from longer phrases to extract the remainder, basically pigeon-holing.
[224001740490] |Also: A paper on a Bayesian method for Chinese word segmentation by Xu et al.; a paper of transliteration, by Malik et al.; a paper on evaluation, especially if quality of reference translations matters (it does not), by Hamon and Mostefa; and a new grammar formalism for translation by Søgaard.
[224001740500] |What's missing?
[224001740510] |No paper on word alignment or large-scale discriminative training, but there is always Hawaii.
[224001750010] |Using machine learning to answer scientific questions
[224001750020] |I tend to prefer research (and the papers that result from it) that attempt to answer some sort of question about natural language.
[224001750030] |And "can I build a system that beats your system" does not count as a question.
[224001750040] |(Of course, I don't always obey this model myself.)
[224001750050] |For instance, the original IBM MT papers were essentially trying to answer the question: can we build a model of translation directly from parallel data?
[224001750060] |Later work in phrase-based and syntax-based translation basically started out by asking the question: is syntactic structure (or local lexical structure) a useful abstraction for enabling translation?
[224001750070] |It seems the answer to all these questions is "yes."
[224001750080] |A less broad type of question that one can ask might be categorized as: is some type of feature relevant for some task.
[224001750090] |For instance, are lexical semantics features derived via a hand-crafted resource (such as WordNet) useful for the task of coreference resolution?
[224001750100] |What about lexical semantics features derived automatically from the web?
[224001750110] |This type of question is the sort of thing that I looked at in an EMNLP 2005 paper.
[224001750120] |The answers seemed to be "no" and "yes" respectively.
[224001750130] |There are several issues with trying to answer such questions.
[224001750140] |One issue is that typically what you're actually looking at is the question: can I figure out a way to turn WordNet into a feature vector that is useful for the task of coreference resolution?
[224001750150] |This is probably a partial explanation for why our community seems not to like negative results to such questions: maybe you just weren't sufficiently clever in encoding WordNet as features.
[224001750160] |Or maybe WordNet features are useful but there is some other set of features that's more useful that's just swamping the benefits of WordNet.
[224001750170] |My first question is whether we're even going about this the right way.
[224001750180] |The usual approach is to take some baseline system, add WordNet features, and see if predictive performance goes up (i.e., performance on test data).
[224001750190] |This seems like a bit of a round-about way to attack the problem.
[224001750200] |After all, this problem of "does some feature have an influence on the target concept" is a classical and very well studied area of statistics: most people have probably seen ANOVA (analysis of variance), but there are many many more ways to try to address this question.
[224001750210] |And, importantly, they don't hinge on the notions of predictive performance.
[224001750220] |(Which almost immediately ties us in to "my system is better than yours.")
[224001750230] |Now, I'm not claiming that ANOVA (or some variant thereof) is the right thing to do.
[224001750240] |Our problems are often quite different from those encountered in classical statistics.
[224001750250] |We have millions of covariates (features) and often complex structured outputs.
[224001750260] |I just wonder if there's some hint of a good idea in those decades of statistical research.
[224001750270] |The other way to go about looking at this question, which at the time I wrote the EMNLP paper I thought was pretty good, is the following.
[224001750280] |The question is: why do I care if WordNet features are useful for prediction.
[224001750290] |Very rarely will they actually hurt performance, so who cares if they don't help?
[224001750300] |(Besides those people who actually want to know something about how language works.... though in this case perhaps it tells us more about how WordNet works.)
[224001750310] |The perspective we took was that someone out there is going to implement a coreference system from scratch (perhaps for a new language, perhaps for a new domain) and they want to know what features they should spend time writing and which ones they can ignore.
[224001750320] |This now is a question about predictive performance, and we did a form of backwards feature selection to try to answer it.
[224001750330] |If you look at Figure 2 in the paper, basically what you see is that you'd better implement string-match features, lexical features, count-based features, and knowledge-based features.
[224001750340] |On top of this, you can get another half point with inference features and another half again using gazetteers.
[224001750350] |But beyond that, you're probably wasting your time.
[224001750360] |(At the time of the EMNLP paper, one of this big "sells" was the count-based features, which are impossible in the standard pairwise-decision models that most people apply to this problem.
[224001750370] |I still think this is an interesting result.)
[224001750380] |At any rate, I still think this ("what should I spend time implementing") is a reasonable way to look at the problem.
[224001750390] |But if you're moving to a new domain or new language, most likely all of your time/money is going to be spent annotating data, not implementing features.
[224001750400] |So maybe you should just go and implement everything you can think of anyway.
[224001750410] |And besides, if you want to make an argument about moving to different languages or different domains or even different amounts of training data, you should make those arguments specifically and do the experiments to back them up.
[224001750420] |For instance, how does Figure 2 in the EMNLP paper change when you have half the data, or a tenth the data?
[224001750430] |Maybe the lexical features don't matter as much and perhaps here WordNet might kick in?
[224001760010] |Supplanting vs Augmenting Human Language Capabilities
[224001760020] |With an analogy to robotics, I've seen two different approaches.
[224001760030] |The first is to develop humanoid robots.
[224001760040] |The second is to develop robots that enhance human performance.
[224001760050] |The former supplants a human (eg., the long awaited robot butler); the latter augments a human.
[224001760060] |There are parallels in many AI fields.
[224001760070] |What about NLP?
[224001760080] |I would say that most NLP research aims to supplant humans.
[224001760090] |Machine translation puts translators out of work.
[224001760100] |Summarization puts summarizers out of work (though there aren't as many of these).
[224001760110] |Information extraction puts (one form of) information analysts out of work.
[224001760120] |Parsing puts, well... hrm...
[224001760130] |There seems actually to be quite little in the way of trying to augment human capabilities.
[224001760140] |Web search might be one such area, though this is only tenuously an "NLP" endeavor.
[224001760150] |Certainly there is a reasonable amount of work in translation assistance: essentially fancy auto-completion for translators.
[224001760160] |Some forms of IE might look like this: find all the names, coreferenceify them and then present "interesting" results to a real human analyst who just doesn't have time to look through all the documents... though this looks to me like a fancy version of some IR task that happens to use IE at the bottom.
[224001760170] |Where else might NLP technology be used to augment, rather than supplant?
[224001760180] |A student here recently suggested the following.
[224001760190] |When learning a new language, you are often reading an encounter unknown words.
[224001760200] |These words could be looked up in a dictionary and described in a little pop-up window.
[224001760210] |Of course, if the definition (hopefully sense-disambiguated) itself contains unknown words, you'd need to recurse.
[224001760220] |He then suggested a similar model for reading Wikipedia pages: tell Wikipedia everything you know and then have it regenerate the "variational EM" page explaining things that you don't know about (again, recursively).
[224001760230] |This could either be interactive or not.
[224001760240] |Wikipedia is nice here because you can probably look up most things that a reader might not know via internal Wikipedia links.
[224001760250] |Although really a field of IR, there's the whole interactive track at TREC that essentially aims for an interactive search experience, complete with suggestions, refinements, etc.
[224001760260] |I can imagine electronic tutorials that automatically follow your progress in some task (eg., learning to use photoshop) and auto-generate text explaining parts where you seem to be stuck, rather than just providing you with random, consistent advice.
[224001760270] |(Okay, this starts to sound a bit like our mutual enemy clippy... but I suspect it could actually be done well, especially if it were really in the context of learning.)
[224001760280] |Speaking of learning, someone (I don't even remember anymore! Sorry!) suggested the following talk to me a while ago.
[224001760290] |When trying to learn a foreign language, there could be some proxy server you go through that monitors when you are reading pages in this language that you want to learn.
[224001760300] |It can keep track of what you know and offer mouseover suggestions for words you don't know.
[224001760310] |This is a bit like the first suggestion above.
[224001760320] |That's all I can come up with now.
[224001760330] |One big "problem" with working on such problems is that you then cannot avoid actually doing user studies, and we all know how much we love doing this in NLP these days.
[224001770010] |Where did my (linear) model go wrong?
[224001770020] |Back to research-related posts for a while :).
[224001770030] |Let's say we're learning a predictor f and it doesn't get 100% test accuracy.
[224001770040] |We can ask ourselves: why not.
[224001770050] |Typically, the reason I ask myself this question is because I want to get closer to 100% test accuracy and want to know how I should go about it.
[224001770060] |Should I add more labeled data (what should it look like) or should I add more features (what should they look like) or am I just hosed?
[224001770070] |In order to try to get a handle on this, we can ask ourselves: where can error come from:
[224001770080] |Noise in the training data (which we have fit and is now screwing us up at test time)
[224001770090] |Noise in the test data (which is causing our otherwise correct predictions to look wrong)
[224001770100] |Insufficient representation (we just don't have the right/enough features)
[224001770110] |Insufficient examples (our training data wasn't sampled densely enough in some region)
[224001770120] |These are not mutually exclusive.
[224001770130] |For instance, maybe be reducing/changing the feature set (3), it would turn out that we are sampled densely enough in a region of interest (4).
[224001770140] |I'm going to use (binary) text categorization as a working example because it's simple yet exhibits NLP-like properties (ever growing feature sets, sparse feature representations, etc.).
[224001770150] |So let's say we train a model (perhaps linear) on a bag of words feature set and apply it on test data.
[224001770160] |Say we get 90% accuracy on dev data.
[224001770170] |Now what can we do?
[224001770180] |Let's take a single example that we misclassified.
[224001770190] |Perhaps take the "worst" (i.e., most misclassified in a margin sense), though I don't know that it matters.
[224001770200] |Which of the four reasons above tells us why this example was misclassified?
[224001770210] |By looking at the example (yes, by hand!) we can probably tell if it's a problem due to issue (2: Noise in the test data).
[224001770220] |We might like to try to automate this ascertainment, but honestly I'm at a loss for how to do this.
[224001770230] |Let's say we decide that the problem isn't due to test set noise.
[224001770240] |Now what?
[224001770250] |Let's consider the following approach.
[224001770260] |We are going to retrain our classifier several times (okay, this is expensive, but we can worry about this later).
[224001770270] |What we're going to do is add this misclassified dev point into the training data with a varying cost.
[224001770280] |The currently trained model we will call f(0), since it is based on giving this dev point a cost of zero.
[224001770290] |We can then train f(c) for a large range of costs c and let C be some reasonable upper bound (i.e., we guarantee that C is big enough that f(C) correctly classifies this point -- for any reasonable classifier, there should be such a finite C).
[224001770300] |Go ahead and replace "correctly classifies" with "classifies with a sufficiently large margin" if you'd prefer; I doubt it matters.
[224001770310] |Now, we're going to look at two of these fs.
[224001770320] |We'll look at f(0) and f(c'), where c' is the minimal value of c such that this dev example becomes correctly classified.
[224001770330] |Let's say we now run these two fs on the training data.
[224001770340] |We know that f(0) will misclassify our "selected" test point and that f(c') will not.
[224001770350] |The big question is what do the fs do on the other points.
[224001770360] |Suppose that f(c') doesn't make any (many?) more mistakes than f(0).
[224001770370] |That is, they're basically the same, just now classifying our selected point correctly.
[224001770380] |This suggests that the problem is (3) or (4) above (features or examples).
[224001770390] |Suppose that f(c') makes many more mistakes than f(0).
[224001770400] |Now, we see that in order to get this selected test point correct, we have to pay by getting other things wrong (that we didn't before). This suggests that the problem is (1) or (3) above (noise or features). Importantly, it's not (4: examples).
[224001770410] |At this point, we have separated causes (3 or 4) from (1 or 4), but haven't separated them completely.
[224001770420] |Now, let's go back and do the same process of all of the dev points that were misclassified.
[224001770430] |What can happen?
[224001770440] |Almost all of the f(c')s make no more errors on other training points.
[224001770450] |Unless all of these erroneous dev points are markedly different from the training data (in which case we really have a domain adaptation problem), then this is almost certainly a feature issue (3).
[224001770460] |Almost all of the f(c')s make many more errors on the other training points, and the set of training points on which they make these errors is usually the same.
[224001770470] |Then this is almost certainly noisy training data (or noisy test data).
[224001770480] |Almost all of the f(c')s make many more errors on the other training points, but the set of training points on which they err is substantially different each time.
[224001770490] |Then this is almost certainly a feature issue (3).
[224001770500] |Mixed results: some f(c')s make more errors on training, some don't.
[224001770510] |This is harder to say, but my gut tells me that this is probably a (4: example) issue.
[224001770520] |There's a lot of hedging going on here: "almost", "many", "different", etc.
[224001770530] |Maybe you could formalize these things, but I'd rather get the intuition right first.
[224001770540] |(If you're using a margin based classifier, you might not have to exactly retrain each time.
[224001770550] |Koby Crammer's passive aggressive algorithm will essentially give you a closed form solution for "closest (in L2) weight vector to the current weight vector such that a given point is correctly classified," which could save some computational effort.)
[224001770560] |Note that I haven't actually tried this.
[224001770570] |I'd definitely be interested to, but wanted to think about it a bit before I shot off to implement something.
[224001780010] |Two Reviewer Comments that Stuck with Me
[224001780020] |Over the years I've gotten a number of positive reviews and negative reviews.
[224001780030] |I usually remember few of the details of any more than a few months after the good or bad news.
[224001780040] |Two reviewer quotes, however, have really stuck with me.
[224001780050] |They were both in overall positive reviews, though one of them is more negative sounding.
[224001780060] |Obviously I don't know who wrote them, but they've actually had a strong impact on most of my research and my own reviewing since:
[224001780070] |"Other approaches don't have to be bad in order for your approach to be good."
[224001780080] |"If I were working in this area, I would want to know about these results."
[224001780090] |You can guess what I might have written that had caused this reviewer to make the first comment.
[224001780100] |And actually I've come to think of these two comments as basically saying the same thing.
[224001780110] |I feel like we too often see research in a competitive light.
[224001780120] |There are two things that can cause this.
[224001780130] |The first is funding.
[224001780140] |Short-term funding (in the US) is essentially a zero-sum game, which means that I can win only if you lose.
[224001780150] |(There are models where this is less true, but usually that has other -- not necessarily desirable -- outcomes.)
[224001780160] |The second is the scooping effect: many times when we have (what we think is) a cool idea, we want to "beat" everyone else to the punch.
[224001780170] |I recall a comment John Langford made on his blog (which I cannot seem to find now because I can't figure out what search terms to use) quite some time ago along these lines... saying that for many problems he doesn't care who finds a solution so long as the solution is found.
[224001780180] |I usually have mixed feelings.
[224001780190] |For some topics, I definitely feel this way.
[224001780200] |Indeed, for some topics I actually don't want to be the person who figures it out, either because I don't feel I have the necessary skills or because I don't have the necessary time.
[224001780210] |For some topics, I do get a feeling of ownership and really want to be the person who does it.
[224001780220] |My thesis work was like that, as was my document/abstract alignment work.
[224001780230] |This is definitely highly personal: I know plenty of people who care a lot more about ownership than I do, and many who care a lot less.
[224001780240] |What I took away from this comment is essentially the realization that we are all working toward some vague future goal, which has to do with computationalizing language processing (or some other topic, for the non-NLP audience).
[224001780250] |Progress is good.
[224001780260] |If I've done work that has something interesting and novel to say about this goal, then it's not bad -- and is often good -- that this builds on and improves on your work.
[224001790010] |Interesting NIPS papers, take 1
[224001790020] |I just got back from NIPS.
[224001790030] |Kevin Duh was nice enough to forward his "top N" list of NIPS papers; I'll post my own shortly.
[224001790040] |Thanks Kevin!
[224001790050] |"Large Margin Taxonomy Embedding for Document Categorization" - Kilian Weinberger, Olivier Chapelle (Yahoo)
[224001790060] |- Suppose you have a multi-class classification problem where you need to assign documents to different nodes in the topic hierarchy.
[224001790070] |While there are hierarchical classifiers for solving this problem, the authors instead proposes to embed the taxonomy in a continuous space and use regression.
[224001790080] |The idea is as follows: from a taxonomy, we can compute distances between nodes that characterize the loss of classifying one node when the true class is the other.
[224001790090] |This pairwise distance matrix is used by multidimensional scaling to create a set of prototype vectors in Euclidean space, one for each class.
[224001790100] |Then, we train multiple regression that maps training samples to these prototype vectors.
[224001790110] |However, the problem with this two-stage approach is that the prototypes are computed without regard to the training data, so the solution may be suboptimal for classification.
[224001790120] |The paper then introduces an objective function that combines both steps: essentially, we constrain the mapping of the training samples and the prototypes to have a large margin.
[224001790130] |"Learning taxonomies by Dependence Maximization" - Matthew Blaschko, Arthur Gretton (MPI)
[224001790140] |- Our goal is to cluster a dataset and provide a taxonomy that shows the relationship between clusters.
[224001790150] |The standard solutions are agglomerative/divisive hierarchical clustering.
[224001790160] |This paper proposes an alternative solution which allows us to use kernels (and is thus related to spectral clustering).
[224001790170] |The idea is based on a kernel measure of dependence: roughly speaking, if K is the kernel matrix of the original data and L is the kernel matrix of the resulting clustering, the objective max_{L} trace(K*L) is an measures the dependence between samples and clusters and is thus a viable clustering objective.
[224001790180] |The method gets a taxonomy by formulating L=PYP' where P is a partition matrix (maps cluster to samples) and Y is a positive semi-definite matrix that encodes relationships between clusters.
[224001790190] |Fast Prediction on a Tree" - Mark Herbster, Massimiliano Pontil, Sergio Rojas (UCL)
[224001790200] |- Graph-based semi-supervised learning needs to scale well with the number of unlabeled samples in order to be truly useful in large data scenarios.
[224001790210] |This paper presents a method to improve the computational scalability of Laplacian-based methods: First, convert the data graph to a tree (using, e.g. a maximum spanning tree algorithm).
[224001790220] |Second, they show a fast way to compute the pseudo-inverse of the graph/tree Laplacian in O(m2 + mS), where m is the number of labeled samples and S is the tree diameter.
[224001790230] |This Laplacian pseudo-inverse corresponds to a kernel, and so one can use, say, a kernel perceptron. to predict on test points.
[224001790240] |Experiments show that tree approximations to graph did not deteriorate accuracy, while drastically increasing speed.
[224001790250] |"Unlabeled data: Now it helps, now it doesn't" - Aarti Singh, Rob Nowak, Jerry Zhu (Wisconsin)
[224001790260] |- This is an interesting theoretical paper that analyzes when unlabeled data helps under the cluster assumption.
[224001790270] |First, the authors argue that asymptotic analysis is unsuitable for analyzing the difference between supervised learning and SSL, and instead uses finite-sample analysis and minimax bounds.
[224001790280] |Let n be the number of labeled samples, m the number of unlabeled samples, d the feature dimension, and g the margin between two classes (can be positive or negative).
[224001790290] |The proof is of the form: suppose a clairvoyant supervised learner will full knowledge of the underlying density p(x) has error less than e2(n), and a supervised learner has error greater than e1(n).
[224001790300] |Then, the error of SSL is no more than e2(n) + O(some function of m).
[224001790310] |Thus, if O(some function of m) is negligible (and this depends on the exact values of m,d,g,n), then SSL will improve over supervised learning; otherwise, no.
[224001790320] |In words, the cases where SSL helps is as follows: if the margin g is relatively large compared to the average spacing between labeled points (n^{-1/d}), then supervised learning can discover p(x) accurately and works just as well as SSL.
[224001790330] |However, if g is small relative to the spacing between labeled points, but large relative to the spacing between unlabeled points (m^{-1/d}), then SSL will beat any supervised learner.
[224001790340] |In the case that the margin is negative, if -g is larger than (m^{-1/d}), then SSL also wins.
[224001790350] |"DiscLDA: Discriminative learning for dimensionality reduction and classification" - Simon Lacoste-Julien, Fei Sha, Michael Jordan (Berkeley/USC)
[224001790360] |- Unsupervised topic models have become popular methods for finding latent structures in text documents.
[224001790370] |These are usually trained by max likelihood, but this may be suboptimal if our final goal is classification.
[224001790380] |This paper considers the problem of introducing labeled data (e.g. topic labels) into topic models.
[224001790390] |Recall in Latent Dirichlet Allocation (LDA), foreach each document, we first draw a (k-dimensional) topic mixture from a Dirichlet prior.
[224001790400] |Then we draw a words according to p(word|topic)p(topic|topic-mixture).
[224001790410] |We can view each document as a topic simplex.
[224001790420] |The idea here is to introduce a transformation T on the topic simplex, so that documents with the same label will be mapped close together.
[224001790430] |"Modeling the effects of memory on human online sentence processing with particle filters" - Roger Levy (UCSD), Florencia Realia, Tom Griffiths (Berkeley)
[224001790440] |- Humans comprehend sentences in an online manner: it is believed that we do incremental parsing as we hear words one at a time.
[224001790450] |Thus, garden-path sentences are able to catch us off-guard.
[224001790460] |Moreover, the longer the sentence is before a disambiguation point is reached, the harder it is for humans to recover (digging-in effect).
[224001790470] |This is a psycholinguistics paper that seeks to explain garden-path and digging-in by a novel particle-filter based PCFG parser: essentially, whenever a word is received, a partial parse is sampled.
[224001790480] |The number of "incorrect" particles increase with sentence length (modeling digging-in), and the number of particles used correlates with the memory constraints of the brain.
[224001790490] |"Tighter bounds for structured estimation" - Olivier Chapelle, et. al. (Yahoo/Stanford/NICTA)
[224001790500] |- A common approach in optimizing difficult loss functions is to minimize a convex upper bound instead (e.g. hinge loss in SVM's).
[224001790510] |However, these losses are often loose.
[224001790520] |In particular, outliers often suffer large loss, so the general classifier accuracy may be sacrificed since the optimizer focuses on these extremely difficult points.
[224001790530] |The idea here is to use a non-convex, but tighter upper bound.
[224001790540] |They adopt a ramp-loss for the structured prediction problem and use the convex-concave procedure to solve it.
[224001800010] |Featuritis in NLP
[224001800020] |Featuritis (term from John Langford) is (in my mind) the process of throwing in a ton of features to a learning system without thinking about it.
[224001800030] |Long gone are the days when one had to select a small number of useful features for supervised learning problems.
[224001800040] |Now that maxent has replaced naive Bayes and CRFs have replaced HMMs, we are free to throw a gigantic number of features into our learning problems with little to no repercussions (beyond a constant computation factor).
[224001800050] |The one key exception to this is MT, where the number of features is typically kept small because the algorithms that we currently use for feature weight tuning (eg., MERT) scale badly in the number of features.
[224001800060] |I know there is lots of work to get around this, but I think it's fair to say that this is still not de facto.
[224001800070] |I think this is related to the fact that we cherish linear models.
[224001800080] |That is, I think that a significant reason that featuritis has flourished is because linear models are pretty good at coping with it; and a reason that linear models have flourished is because they are computationally cheap and can always be "fixed up" by taking a featuritis approach.
[224001800090] |I think a great point of contrast is the work that's been done in the machine learning community on using neural networks for solving NLP tasks.
[224001800100] |This work basically shows that if you're willing to give your machine learning algorithm much more power, you can kind of forget about representation.
[224001800110] |That is, just give yourself a feature for every word in your vocabulary (as you might, for instance, in language modeling), throw these through a convolution, then through a multilayer neural network and train it in a multitask fashion, making use of (essentially) Ando and Zhang-style auxiliary problems (from, eg., Wikipedia text) to do semi-supervised learning.
[224001800120] |And you do as well as a featuritis approach.
[224001800130] |I suppose this is the standard "prior knowledge versus data" issue that comes up over an over again.
[224001800140] |Either I can put more prior knowledge into my system (cf., adding more features that I think are going to be useful) or putting more data into my system.
[224001800150] |The nuance seems to be that I cannot only make this trade-off.
[224001800160] |When I add more data to my system, I also have to change my learning model: a simple linear approach no longer cuts it.
[224001800170] |The linear model on a simple feature space just doesn't have the representational power to learn what I would like it to learn.
[224001800180] |So I have to go to a more complex function class and therefore need more data to reliably estimate parameters.
[224001800190] |So why isn't everyone using neural nets?
[224001800200] |Well, to some degree we've been conditioned to not like them.
[224001800210] |Seeing cool positive results makes it a bit enticing to forget why we were conditioned not to like them in the first place.
[224001800220] |To me, there are basically three advantages that linear models have over multilayer neural nets.
[224001800230] |The first is that there is very little model selection to do: in a neural net, since I have little experience, I have no idea how to choose the network architecture.
[224001800240] |The second is training efficiency.
[224001800250] |Linear models are just ridiculously fast to train, and neural nets (despite all the progress over the past 20 years) are still darn slow.
[224001800260] |(Although, at least neural nets are fast to predict with; unlike, say, kernel machines.)
[224001800270] |The third is non-convexity.
[224001800280] |This means that we probably have to do lots of random restarts.
[224001800290] |I doubt the third issue (non-convexity) carries much weight in the NLP community.
[224001800300] |We're such fans of algorithms like EM (also non-convex) and Gibbs sampling (atrociously not even comparable to notions of convexity) that I can't imagine that this is the thing that's stopping us.
[224001800310] |The first issue (choosing network structure) is roughly analogous to choosing a feature representation.
[224001800320] |I think the difference is that when I say "I add a feature that pulls out a two-character suffix of the word," I can see exactly how this might affect learning and why it might be useful.
[224001800330] |When I say that I add a new node in a hidden layer of a network, I have no idea really what's going to happen.
[224001800340] |The second issue (speed) is actually probably non-trivial.
[224001800350] |When I'm training a relatively simple classifier or sequence labeler, I kind of expect it to be able to train in a matter of minutes or hours, not days or weeks.
[224001800360] |The primary issue here doesn't seem to be the representation that's making things so much slower to train, but the fact that it seems (from experimental results) that you really have to do the multitask learning (with tons of auxiliary problems) to make this work.
[224001800370] |This suggests that maybe what should be done is just to fix an input representation (eg., the word identities) and then have someone train some giant multitask network on this (perhaps a few of varying sizes) and then just share them in a common format.
[224001800380] |Then, when I want to learn my specific task, I don't have to do the whole multitask thing and can just use that learned network structure and weights as an initial configuration for my network.
[224001800390] |At the end of the day, you're going to still have to futz with something.
[224001800400] |You'll either stick with your friendly linear model and futz with features, or you'll switch over to the neural networks side and futz with network structure and/or auxiliary problem representation.
[224001800410] |It seems that at least as of now, futzing is unavoidable.
[224001800420] |At least network structure futzing it somewhat automatable (see lots of work in the 80s and 90s), but this isn't the whole package.
[224001800430] |(p.s., I don't mean to imply that there isn't other modern work in NLP that uses neural networks; see, for instance, Titov and Henderson, ACL 2007.)
[224001810010] |Boosting, neural networks and deep learning
[224001810020] |I'm on a neural networks kick of late, apparently.
[224001810030] |The following was pointed out to me recently.
[224001810040] |Maybe other people think it's obvious, and it's obvious once you hear it, but I'd never seen it that way.
[224001810050] |Suppose we're doing boosting and the weak learner that we choose is a thresholded linear model (eg., perceptron, SVM, logistic regression).
[224001810060] |Then what we get out of boosting is basically a neural network with a single hidden unit.
[224001810070] |The hidden units themselves are the weak learners, and the "alphas" that arise as part of the boosting procedure (eg., AdaBoost) are the weights connecting the hidden units to the output.
[224001810080] |(Thanks to Antonio Paiva for pointing this out to me.)
[224001810090] |So, in a sense, boosting a linear model can be seen as a kind-of-unusual method for training a two layer network.
[224001810100] |Essentially the number of boosting iterations that you run determines the number of hidden units (i.e., the network architecture).
[224001810110] |Okay, that's great.
[224001810120] |But let's say we don't care about two layer networks, we care about deep learning.
[224001810130] |For a 3-layer network you could simple boost boosted-linear-models.
[224001810140] |For a 4-layer network you could boost boosted-boosted-linear models.
[224001810150] |And so on.
[224001810160] |Is there an alternative?
[224001810170] |Thinking purely procedurally, let's say my weak learner is a linear model.
[224001810180] |I start boosting and I've got 5 linear models trained in the standard AdaBoost manner.
[224001810190] |Now I have a choice.
[224001810200] |Should I train a 6th linear model to throw in to the standard boosting set?
[224001810210] |Or should I treat the 5 boosted linear models as a new base classifier and boost against the combination?
[224001810220] |If I choose the latter, I've now gone from two layers to three layers.
[224001810230] |Why might it be a good idea to boost against the 5 collectively?
[224001810240] |Well, if you believe the whole deep learning propaganda, then it's a good idea because deep = good.
[224001810250] |From a more theoretical perspective, you might how that the extra level of recursion might get you an increased rate of improvement in the error rate.
[224001810260] |I.e., the recursion could potentially lead to stronger boosting results than the standard linear boosting.
[224001810270] |Of course, this is just a hunch: I haven't at all looked to try to figure out if it would actually work in theory.
[224001810280] |But it seems plausible.
[224001810290] |For instance, in neural networks theory, we know that a 2 layer network can approximate any (reasonable) function, but you might need an exponential number of hidden units; the number of required hidden units goes down if you make deeper networks (under assumptions).
[224001810300] |Of course, maybe someone's already done this, maybe it doesn't work, and maybe it's a stupid idea :).
[224001820010] |Beating the state of the art, big-O style
[224001820020] |I used to know a fair amount about math; in fact, I even applied to a few grad schools to do logic (long story, now is not the time).
[224001820030] |I was never involved in actual math research (primarily because of the necessary ramp-up time -- thank goodness this doesn't exist as much in CS!), but I did get a sense from my unofficial undergrad advisor how things worked.
[224001820040] |The reason I started thinking of this is because I recently made my way about halfway through a quite dense book on long games by a prof from grad school (I cross-enrolled at UCLA for a few semesters).
[224001820050] |The basic idea of a countable game is that there is a fixed subset A of [0,1] (subset of the reals) and two players alternate play.
[224001820060] |In each move, they play an integer from 0 to 9.
[224001820070] |They play for a countably infinite number of moves, essentially writing down (in decimal form) a real number.
[224001820080] |If, at the "end", this number is in A, then player 1 wins; otherwise, player 2 wins.
[224001820090] |Both players know A.
[224001820100] |The set A is said to be determined if there is a strategy that will force a win; it is undetermined otherwise.
[224001820110] |A long game is the obvious generalization where you play for longer than countable time.
[224001820120] |The details don't matter.
[224001820130] |This led me to think, as someone who's moved over to working a lot on machine learning: is there an analogous question for online learning?
[224001820140] |There are several ways to set this up and I won't bore you with the details (I doubt any reader here really cares), but depending on how you set it up, you can prove several relatively trivial, but kind of cute, results (I say trivial because they took me on the order of hours, which means that someone who knows what they're doing probably would see them immediately).
[224001820150] |I basically did this as a mental exercise, not for any real reason.
[224001820160] |But it got me thinking: obviously machine learning people wouldn't care about this because it's too esoteric and not at all a realistic setting (even for COLTers!).
[224001820170] |I strongly doubt that logicians would care either, but for a totally different reason.
[224001820180] |From my interaction, they would be interested if and only if two things were satisfied: (a) the result showed some interesting connection between a new model and existing models; (b) the proofs were non-trivial and required some new insight that could be then applied to other problems.
[224001820190] |Obviously this is not my goal in life, so I've dropped it.
[224001820200] |This led to me introspect: what is is that we as a community need in order to find some result interesting?
[224001820210] |What about other fields that I claim to know a bit about?
[224001820220] |Let's take algorithms for a minute.
[224001820230] |Everything here is about big-O.
[224001820240] |Like the math types, a result without an interesting proof is much less interesting than a result with an interesting proof, though if you start reading CS theory blogs, you'll find that there's a bit of divide in the community on whether this is good or not.
[224001820250] |But my sense (which could be totally broken) is that if you have a result with a relatively uninteresting proof that gets you the same big-O running time as the current state of the art, you're in trouble.
[224001820260] |I think it's interesting to contrast this with what happens in both NLP and ML.
[224001820270] |Big-O works a bit differently here.
[224001820280] |My non-technical description of big-O to someone who knows nothing is that it measure "order of magnitude" improvements.
[224001820290] |(Okay, O(n log n) versus O(n log log n) is hard to call an order of magnitude, but you get the idea.)
[224001820300] |An equivalent on the experimental side would seem to be something like: you cut the remaining error on a problem by half or more.
[224001820310] |In other words, if state of the art is 60% accuracy, then an order of magnitude improvement would be 80% accuracy or better.
[224001820320] |At 80% it would be 90%.
[224001820330] |At 90% it would be 95% and so on.
[224001820340] |90.5% to 91.2% is not order of magnitude.
[224001820350] |I actually like this model for looking at experimental results.
[224001820360] |Note that this has absolutely nothing to do with statistical significance.
[224001820370] |It's kind of like reading results graphs with a pair of thick glasses on (for those who don't wear glasses) or no glasses on (for those who wear think glasses).
[224001820380] |I think the justification is that for less than order of magnitude improvement, it's really just hard to say whether the improvement is due to better tweaking or engineering or dumb luck in how some feature was implemented or what.
[224001820390] |For order of magnitude improvement, there almost has to be something interesting going on.
[224001820400] |Now, I'm not proposing that a paper isn't publishable if it doesn't have order of magnitude improvement.
[224001820410] |Very few papers would be published this way.
[224001820420] |I'm just suggesting that improving the state of the art not be -- by itself -- a reason for acceptance unless it's an order of magnitude improvement.
[224001820430] |That is, you'd either better have a cool idea, be solving a new problem, analyzing the effect of some aspect of a problem that's important, etc., or work on a well trod task and get a big-O improvement.
[224001820440] |What I'm saying isn't novel, of course... the various exec boards at the ACL conferences have been trying to find ways to get more "interesting" papers into the conferences for (at least) a few years.
[224001820450] |This is just a concrete proposal.
[224001820460] |Obviously it requires buy-in at least of area chairs and probably reviewers.
[224001820470] |And there are definitely issues with it.
[224001820480] |Like any attempt to make reviewing non-subjective, there are obviously corner cases (eg., you have a sort-of-interesting idea and an almost-order-of-magnitude improvement).
[224001820490] |You can't mechanize the reviewing process.
[224001820500] |But frankly, when I see paper reviews that gush over tiny improvements in the state of the art in an otherwise drab paper, I just get annoyed :).
[224001830010] |Mixing of gibbs samplers for topic (and mixture) models
[224001830020] |This is just a quick note because it's something that had somehow never occurred to me until a few days ago.
[224001830030] |If you've ever talked to me in person about the standard (collapsed) Gibbs samplers for topic models, you know that I get concerned that these things don't mix.
[224001830040] |And it's not just the generic "how do we know" feeling that can get levied at (almost) any sampler, but a very specific we know for certain that our Gibbs samplers for topic models aren't mixing.
[224001830050] |How do we know?
[224001830060] |Because, for the most part, they don't mode switch.
[224001830070] |You can figure this out quite easily by simply watching the topic assignments.
[224001830080] |(The standard collapsed samplers for Gaussian or Multinomial mixture models exhibit the same problem.)
[224001830090] |At least if you have a reasonable amount of data.
[224001830100] |The reason this always bugged me is because we now have definitive evidence that these things aren't mixing in the sense of mode hopping, which leads me to worry that they might also not be mixing in other, less easy to detect ways.
[224001830110] |Well, worry no more.
[224001830120] |Or at least worry less.
[224001830130] |The mode hopping is a red herring.
[224001830140] |Maybe the following observation is obvious, but I'll say it anyway.
[224001830150] |Let's take our topic model Gibbs sampler and introduce a new Metropolis-Hastings step.
[224001830160] |This MH step simply takes two topic indices (say topics i and j) and swaps them.
[224001830170] |It picks i and j uniformly at random from the (K choose 2) possibilities.
[224001830180] |It's easy to verify that the acceptance probability for this move will be one (the qs will cancel and the ps are identical), which means that it's really more like a Gibbs step than an MH step (in the sense that Gibbs steps are MH steps that are always accepted).
[224001830190] |The observation is that (a) this doesn't actually change what the sampler is doing in any real, meaningful way -- that is, re-indexing things is irrelevant; but (b) you now cannot claim that the sampler isn't mode switching.
[224001830200] |It's mode switching a lot.
[224001830210] |Sure, it might still have poor mixing properties for other reasons, but at least now there isn't this elephant-in-the-room reason it can't possibly be mixing.
[224001830220] |So this is a totally "useless" practical observation (sometimes we like to try to exploit the fact that it's not mixing), but from a theoretical perspective it might be interesting.
[224001830230] |For instance, if you want to prove something about a fast mixing rate for these samplers (if you believe they are actually mixing fast, which I'm somewhat keen to believe), then you're not going to be able to prove anything if you don't make this trivial change to the sampler (because without it they're not really mixing fast).
[224001830240] |So it might have interesting theoretical consequences.
[224001850010] |n-gram words an language Ordering model with
[224001850020] |N-gram language models have been fairly successful at the task of distinguishing homophones, in the context of speech recognition.
[224001850030] |In machine translation (and other tasks, such as summarization, headline generation, etc.), this is not their job.
[224001850040] |Their job is to select fluent/grammatical sentences, typically ones which have undergone significant reordering.
[224001850050] |In a sense, they have to order words.
[224001850060] |A large part of the thesis of my academic sibling, Radu Soricut, had to do with exploring how well ngram language models can reorder sentences.
[224001850070] |Briefly, they don't do very well.
[224001850080] |This is something that our advisor, Daniel Marcu, likes to talk about when he gives invited talk; he shows a 15 word sentence and the preferred reorderings by a ngram LM and they're total hogwash, even though audience members can fairly quickly solve the exponential time problem of reordering the words to make a good sounding sentence.
[224001850090] |(As an aside, Radu found that if you add in a syntactic LM, things get better... if you don't want to read the whole thesis, just skip forward to section 8.4.2.)
[224001850100] |Let's say we like ngram models.
[224001850110] |They're friendly for many reasons.
[224001850120] |What could we do to make them more word-order sensitive?
[224001850130] |I'm not claiming that none of these things have been tried; just that I'm not aware of them having been tried :).
[224001850140] |Discriminative training.
[224001850150] |There's lots of work on discriminative training of language models, but, from what I've seen, it usually has to do with trying to discriminate true sentences from fake sentences, where the fake sentences are generated by some process (eg., an existing MT or speech system, a trigram LM, etc.).
[224001850160] |The alternative is to directly train a language model to order words.
[224001850170] |Essentially think of it as a structured prediction problem and try to predict the 8th word based on (say) the two previous.
[224001850180] |The correct answer is the actual 8th word; the incorrect answer is any other word in the sentence.
[224001850190] |Words that don't appear in the sentence are "ignored."
[224001850200] |This is easy to implement and seems to do something reasonable (on a small set of test data).
[224001850210] |Add syntactic features to words, eg., via cluster-based language models.
[224001850220] |My thought here is to look at syntactic features of words (for instance, CCG-style lexicon information) and use these to create descriptors of the words; these can then be clustered (eg., use tree-kernel-style-features) to give a cluster LM.
[224001850230] |This is similar to how people have added CCG/supertag information to phrase-based MT, although they don't usually do the clustering step.
[224001850240] |The advantage to clustering is then you (a) get generalization to new words and (b) it fits in nicely with the cluster LM framework.
[224001850250] |These both seem like such obvious ideas that they must have been tried... maybe they didn't work?
[224001850260] |Or maybe I just couldn't dig up papers.
[224001850270] |Or maybe they're just not good ideas so everyone else dismissed them :).
[224001860010] |Programming Language of Choice
[224001860020] |Some of you know that I (at least used to be) a bit of a programming language snob.
[224001860030] |In fact, on several occasions, I've met (in NLP or ML land) someone who recognizes my name from PL land and is surprised that I'm not actually a PL person.
[224001860040] |My favorite story is after teaching machine learning for the second time, I had Ken Shan, a friend from my PL days, visit.
[224001860050] |I announced his visit and got an email from a student who had taken ML from me saying:
[224001860060] |I _knew_ your name was familiar!
[224001860070] |I learned a ton about Haskell from your tutorial, for what's worth..
[224001860080] |Great read back in my freshman year in college.
[224001860090] |(Belatedly) Thanks for writing it!
[224001860100] |And it's not like my name is particularly common!
[224001860110] |At any rate (and, admittedly, this is a somewhat an HBC-related question) I'm curious what programming language(s) other NLP folks tend to use.
[224001860120] |I've tried to include a subset of the programming language shootout list here that I think are most likely to be used, but if you need to write-in, feel free to do so in a comment.
[224001860130] |You can select as many as you would like, but please just try to vote for those that you actually use regularly, and that you actually use for large projects.
[224001860140] |Eg., I use Perl a lot, but only for o(100) line programs... so I wouldn't select Perl.
[224001870010] |NAACL program up
[224001870020] |See here.
[224001870030] |I see a bunch that I reviewed and really liked, so I'm overall pretty happy.
[224001870040] |(Though I also see one or two in the "other" category :P.)
[224001870050] |At any rate, here are the top stemmed/stopped words with frequencies:
[224001870060] |model (30)
[224001870070] |base (19)
[224001870080] |translat (17)
[224001870090] |languag (16)
[224001870100] |speech (14)
[224001870110] |improv (14)
[224001870120] |statist (12)
[224001870130] |machin (12)
[224001870140] |unsupervis (11)
[224001870150] |recognit (10)
[224001870160] |pars (10)
[224001870170] |system (9)
[224001870180] |learn (9)
[224001870190] |word (8)
[224001870200] |depend (8)
[224001870210] |approach (8)
[224001870220] |spoken (7)
[224001870230] |semant (7)
[224001870240] |search (7)
[224001870250] |linear (7)
[224001870260] |extract (7)
[224001870270] |Looks like machine translation and unsupervised learning are the winners.
[224001880010] |...and here it is for ACL
[224001880020] |Papers are here.
[224001880030] |Words:
[224001880040] |translat (19)
[224001880050] |model (19)
[224001880060] |base (19)
[224001880070] |learn (16)
[224001880080] |semant (12)
[224001880090] |supervis (10)
[224001880100] |machin (10)
[224001880110] |depend (10)
[224001880120] |automat (10)
[224001880130] |word (9)
[224001880140] |pars (9)
[224001880150] |approach (8)
[224001880160] |system (7)
[224001880170] |relat (7)
[224001880180] |gener (7)
[224001880190] |web (6)
[224001880200] |unsupervis (6)
[224001880210] |train (6)
[224001880220] |languag (6)
[224001880230] |label (6)
[224001880240] |decod (6)
[224001880250] |align (6)
[224001890010] |Viterbi search, minimum Bayes risk and laplacian eigenmaps
[224001890020] |I've been slow at posting because I wanted to post on this current topic for a while, but wanted to work out some more details before doing so.
[224001890030] |Well, it didn't happen.
[224001890040] |So I'm writing sans details.
[224001890050] |Let's think for a minute about non-linear dimensionality reduction, aka manifold learning.
[224001890060] |If we compare what, say, ISOMAP does with what laplacian eigenmaps (LE) does, they're really quite similar.
[224001890070] |In both cases, we construct a graph over our data points, usually a kNN graph or something, sometimes with edge weights, sometimes not.
[224001890080] |We then attempt to embed the points in a low dimensional space to minimize some sort of distortion.
[224001890090] |In ISOMAP, the distortion is based on the shortest path distance between the two points in our constructed graph.
[224001890100] |In LE, the distance is measures according to the Laplacian of the graph.
[224001890110] |The notion of a Laplacian of a graph is basically a discrete version of the more standard notion of the differential operator by the same name (that comes up in most standard analysis courses).
[224001890120] |In continuous land, it is the divergence of the differential, which essentially means that measures some form of diffusion (and has its original applications in physics).
[224001890130] |In discrete land, where we live, it can be thought of as a measure of flow on a graph.
[224001890140] |The key difference, then, between ISOMAP and LE is whether you measure distances according to shortest path or to flow, which has a very "average path" feel to it.
[224001890150] |The advantage to LE is that it tends to be less susceptible to noise, which is easy to understand when viewed this way.
[224001890160] |Now, getting back to NLP stuff, we often find ourselves doing some sort of shortest path search.
[224001890170] |It's well known that the much loved Viterbi algorithm is exactly an application of dynamic programming to search in the lattice defined by an HMM.
[224001890180] |This extends in well known ways to other structures.
[224001890190] |Of course, Viterbi search isn't the only game in town.
[224001890200] |There are two other popular approaches to "decoding" in HMMs.
[224001890210] |One is marginal decoding, where we individually maximize the probability of each node.
[224001890220] |The other is minimum Bayes risk decoding.
[224001890230] |Here, we take a user-supplied risk (aka loss) function and try to find the output that minimizes the expected risk, where the probabilities are given by our current model.
[224001890240] |MBR has been shown to outperform Viterbi in many applications (speech, MT, tagging, etc.). If your risk (loss) function is 0/1 loss, then these recover the same solution.
[224001890250] |What I'm curious about is whether this is a connection here.
[224001890260] |I'm not exactly sure how the construction would go -- I'm thinking something like a graph defined over the lattice vertices with edges that reflect the loss function -- but there definitely seems to be a similarity between MBR and average path, which is approximately equal to a Laplacian operation (aka spectral operation).
[224001890270] |The reason I think this would be interesting is because a lot is known about spectral computations and we might be able to use this knowledge to our advantage (coming up with MBR algorithms is sometimes a bit painful).
[224001890280] |An additional complication (or bit of fun, if you see it that way) is that there are at least three standard ways to generalize the notion of a Laplacian to a hypergraph, which is what we would really need to do.
[224001890290] |Perhaps we can help pick out the "right" one through the MBR connection.
[224001900010] |How to reduce reviewing overhead?
[224001900020] |It's past most reviewing time (for the year), so somehow conversations I have with folks I visit tend to gravitate toward the awfulness of reviewing.
[224001900030] |That is, there is too much to review and too much garbage among it (of course, garbage can be slightly subjective).
[224001900040] |Reviewing plays a very important role but is a very fallible system, as everyone knows, both in terms of precision and recall.
[224001900050] |Sometimes there even seems to be evidence of abuse.
[224001900060] |But this post isn't about good reviewing and bad reviewing.
[224001900070] |This is about whether it's possible to cut down on the sheer volume of reviewing.
[224001900080] |The key aspect of cutting down reviewing, to me, is that in order to be effective, the reduction has to be significant.
[224001900090] |I'll demonstrate by discussing a few ideas that have come up, and some notes about why I think they would or wouldn't work:
[224001900100] |Tiered reviewing (this was done at ICML this year).
[224001900110] |The model at ICML was that everyone was guaranteed two reviews, and only a third if your paper was "good enough."
[224001900120] |I applaud ICML for trying this, but as a reviewer I found it useless.
[224001900130] |First, it means that at most 1/3 of reviews are getting cut (assuming all papers are bad), but in practice it's probably more like 1/6 that get reduced.
[224001900140] |This means that if on average a reviewer would have gotten six papers to review, he will now get five.
[224001900150] |First, this is a very small decrease.
[224001900160] |Second, it comes with an additional swapping overhead: effectively I now have to review for ICML twice, which makes scheduling a much bigger pain.
[224001900170] |It's also harder for me to be self-consistent in my evaluations.
[224001900180] |Reject without review (this was suggested to me at dinner last night: if you'd like to de-anonymize yourself, feel free in comments).
[224001900190] |Give area chairs the power that editors of journals have to reject papers out of hand.
[224001900200] |This gives area chairs much more power (I actually think this is a good thing: area chairs are too often too lazy in my experience, but that's another post), so perhaps there would be a cap on the number of reject without reviews.
[224001900210] |If this number is less that about 20%, then my reviewing load will drop in expectation from 5 to 4, which, again, is not a big deal for me.
[224001900220] |Cap on submissions (again, a suggestion from dinner last night): authors may only submit one paper to any conference on which their name comes first.
[224001900230] |(Yes, I know, this doesn't work in theory land where authorship is alphabetical, but I'm trying to address our issues, not someone else's.) I've only twice in my life had two papers at a conference where my name came first, and maybe there was a third where I submitted two and one was rejected (I really can't remember).
[224001900240] |At NAACL this year, there are four such papers; at ACL there are two.
[224001900250] |If you assume these are equally distributed (which is probably a bad assumption, since the people who submit multiple first author papers at a conference probably submit stronger papers), then this is about 16 submissions to NAACL and 8 submissions to ACL.
[224001900260] |Again, which is maybe 1-4% of submitted papers: again, something that won't really affect me as a reviewer (this, even less than the above two).
[224001900270] |Strong encouragement of short papers (my idea, finally, but with tweaks from others): right now I think short papers are underutilized, perhaps partially because they're seen (rightly or wrongly) as less significant than "full" papers.
[224001900280] |I don't think this need be the case.
[224001900290] |Short papers definitely take less time to review.
[224001900300] |A great "short paper tweak" that was suggested to me is to allow only 3 pages of text, but essentially arbitrarily many pages of tables/figures (probably not arbitrarily, but at least a few... plus, maybe just make data online).
[224001900310] |This would encourage experimental evaluation papers to be submitted as shorts (currently these papers typically just get rejected as being longs because they don't introduce really new ideas, and rejected as shorts because its hard to fit lots of experiments in four pages).
[224001900320] |Many long papers that appear in ACL could easily be short papers (I would guesstimate somewhere around 50%), especially ones that have the flavor of "I took method X and problem Y and solved Y with X (where both are known)" or "I took known system X, did new tweak Y and got better results."
[224001900330] |One way to start encouraging short papers is to just have an option that reviews can say something like "I will accept this paper as a short paper but not a long paper -- please rewrite" and then just have it accepted (with some area chair supervision) without another round of reviewing.
[224001900340] |The understanding would have to be that it would be poor form as an author to pull your paper out just because it got accepted short rather than accepted long, and so authors might be encouraged just to submit short versions.
[224001900350] |(This is something that would take a few years to have an effect, since it would be partially social.)
[224001900360] |Multiple reviewer types (an idea that's been in the ether for a while).
[224001900370] |The idea would be that you have three reviewers for each paper, but each serves a specific role.
[224001900380] |For instance, one would exclusively check technical details.
[224001900390] |The other two could then ignore these.
[224001900400] |Or maybe one would be tasked with "does this problem/solution make sense."
[224001900410] |This would enable area chairs (yes, again, more work for area chairs) to assign reviewers to do things that they're good at.
[224001900420] |You'd still have to review as many papers, but you wouldn't have to do the same detailed level of review for all of them.
[224001900430] |Require non-student authors on papers to review 3 times as many papers as they submit to any given conference, no exceptions ("three" because that's how many reviews they will get for any given paper).
[224001900440] |I know several faculty to follow the model of "if there is a deadline, we will submit."
[224001900450] |I don't know how widespread this is.
[224001900460] |The idea is that even half-baked ideas will get garner reviews that can help direct the research.
[224001900470] |I try to avoid offending people here, but that's what colleagues are for: please stop wasting my time as a reviewer by submitting papers like this.
[224001900480] |If they get rejected, you've wasted my time; if they get accepted, it's embarrassing for you (unless you take time by camera ready to make them good, which happens only some of the time).
[224001900490] |Equating "last author" = "senior person", there were two people at NAACL who have three papers and nine who have two.
[224001900500] |This means that these two people (who in expectation submitted 12 papers -- probably not true, probably more like 4 or 5) should have reviewed 12-15 papers.
[224001900510] |The nine should probably have reviewed 9-12 papers.
[224001900520] |I doubt they did.
[224001900530] |(I hope these two people know that I'm not trying to say they're evil in any way :P.)
[224001900540] |At ACL, there are four people with three papers (one is a dupe with a three from NAACL -- you know who you are!) and eight people with two.
[224001900550] |This would have the added benefit of having lots of reviews done by senior people (i.e., no crummy grad student reviews) with the potential downside that these people would gain more control over the community (which could be good or bad -- it's not a priori obvious that being able to do work that leads to many publications is highly correlated with being able to identify good work done by others).
[224001900560] |Make the job of the reviewer more clear.
[224001900570] |Right now, most reviews I read have a schizophrenic feel.
[224001900580] |On the one hand, the reviewer justifies his rating to the area chair.
[224001900590] |On the other, he provides (what he sees as) useful feedback to the authors.
[224001900600] |I know that in my own reviewing, I have cut down on the latter.
[224001900610] |This is largely in reaction to the "submit anything and everything" model that some people have.
[224001900620] |I'll typically give (what I hope is) useful feedback to papers I rate highly, largely because I have questions whose answers I am curious about, but for lower ranked papers (1-3), I tend to say things like "You claim X but your experiments only demonstrate Y." Rather than "[that] + ... and in order to show Y you should do Z." Perhaps I would revert to my old ways if I had less to review, but this was a choice I made about a year ago.
[224001900630] |I'd be interested to hear if others have additional "small changes" ideas.
[224001900640] |There are "large delta" ideas, such as Fernando's "everything is a journal" model, which I actually like, but is likely to be hard to implement because it's hard to make sweeping changes to a system (though VLDB -- or was it SIGMOD -- managed to do it).
[224001900650] |I actually think that together, some of these ideas could have a significant impact.
[224001900660] |For instance, I would imagine 2 and 4 together would probably cut a 5-6 paper review down to a 3-4 paper review, and doing 6 on top of this would probably take the average person's review load down maybe one more.
[224001900670] |Overall, perhaps a 50% reduction in number of papers to review, unless you're one of the types who submits lots of papers.
[224001900680] |I'd personally like to see it done!
[224001910010] |Semi-supervised or Semi-unsupervised? (SSL-NLP invited papers)
[224001910020] |Kevin Duh not-so-recently asked me to write a "position piece" for the workshop he's co-organizing on semi-supervised learning in NLP.
[224001910030] |I not-so-recently agreed.
[224001910040] |And recently I actually wrote said position piece.
[224001910050] |You can also find a link off the workshop page.
[224001910060] |I hope people recognize that it's intentionally a bit tongue-in-cheek.
[224001910070] |If you want to discuss this stuff or related things in general, come to the panel at NAACL from 4:25 to 5:25 on 4 June at the workshop!
[224001910080] |You can read the paper for more information, but my basic point is that we can typically divide semi-supervised approached into one lump (semi-supervised) that work reasonably well with only labeled data and are just improved with unlabeled data and one lump (semi-unsupervised) that work reasonably well with only unlabeled data and are just improved with labeled data.
[224001910090] |The former are typically encode lots of prior information; the latter do not.
[224001910100] |Let's combine!
[224001910110] |(Okay, my claim is more nuanced than that, but that's the high-order bit.)
[224001920010] |NAACL-HLT 2006 has been WhatToSee-d!
[224001920020] |See here.
[224001930010] |The importance of input representations
[224001930020] |As some of you know, I run a (machine learning) reading group every semester.
[224001930030] |This summer we're doing "assorted" topics, which basically means students pick a few papers from the past 24 months that are related and present on them.
[224001930040] |The week before I went out of town, we read two papers about inferring features from raw data; one was a deep learning approach; the other was more Bayesian.
[224001930050] |(As a total aside, I found it funny that in the latter paper they talk a lot about trying to find independent features, but in all cog sci papers I've seen where humans list features of objects/categories, they're highly dependent: eg., "has fur" and "barks" are reasonable features that humans like to produce that are very much not independent.
[224001930060] |In general, I tend to think that modeling things as explicitly dependent is a good idea.)
[224001930070] |Papers like this love to use vision examples, I guess because we actually have some understanding of how the visual cortex words (from a neuroscience perspective), which we sorely lack for language (it seems much more complicated).
[224001930080] |They also love to start with pixel representations; perhaps this is neurologically motivated: I don't really know.
[224001930090] |But I find it kind of funny, primarily because there's a ton of information hard wired into the pixel representation.
[224001930100] |Why not feed .jpg and .png files directly into your system?
[224001930110] |On the language side, an analogy is the bag of words representation.
[224001930120] |Yes, it's simple.
[224001930130] |But only simple if you know the language.
[224001930140] |If I handed you a bunch of text files in Arabic (suppose you'd never done any Arabic NLP) and asked you to make a bag of words, what would you do?
[224001930150] |What about Chinese?
[224001930160] |There, it's well known that word segmentation is hard.
[224001930170] |There's already a huge amount of information in a bag of words format.
[224001930180] |The question is: does it matter?
[224001930190] |Here's an experiment I did.
[224001930200] |I took the twenty newsgroups data (standard train/test split) and made classification data.
[224001930210] |To make the classification data, I took a posting, fed it through a module "X".
[224001930220] |"X" produced a sequence of tokens.
[224001930230] |I then extract n-gram features over these tokens and throw out anything that appears less than ten times.
[224001930240] |I then train a multiclass SVM on these (using libsvm).
[224001930250] |The only thing that varies in this setup is what "X" does.
[224001930260] |Here are four "X"s that I tried:
[224001930270] |Extract words.
[224001930280] |When composed with extracting n-gram tokens, this leads to a bag of words, bag of bigrams, bag of trigrams, etc., representation.
[224001930290] |Extract characters.
[224001930300] |This leads to character unigrams, character bigrams, etc.
[224001930310] |Extract bits from characters.
[224001930320] |That is, represent each character in its 8 bit ascii form and extract a sequence of zeros and ones.
[224001930330] |Extract bits from a gzipped version of the posting.
[224001930340] |This is the same as (3), but before extracting the data, we gzip the file.
[224001930350] |The average word length for me is 3.55 characters, so a character ngram with length 4.5 is approximately equivalent to a bag of words model.
[224001930360] |I've plotted the results below for everything except words (words were boring: BOW got 79% accuracy, going to higher ngram length hurt by 2-3%).
[224001930370] |The x-axis is number of bits, so the unigram character model starts out at eight bits.
[224001930380] |The y-axis is accuracy:As we can see, characters do well, even at the same bit sizes.
[224001930390] |Basically you get a ton of binary sequence features from raw bits that are just confusing the classifier.
[224001930400] |Zipped bits do markedly worse than raw bits.
[224001930410] |The reason the bit-based models don't extend further is because it started taking gigantic amounts of memory (more than my poor 32gb machine could handle) to process and train on those files.
[224001930420] |But 40 bits is about five characters, which is just over a word, so in theory the 40 bit models have the same information that the bag of words model (at 79% accuracy) has.
[224001930430] |So yes, it does seem that the input representation matters.
[224001930440] |This isn't shocking, but I've never seen anyone actually try something like this before.
[224001940010] |Navigating ICML
[224001940020] |I couldn't find a schedule for ICML that had paper titles/authors written in, so I joined the existing schedule with the abstracts to create a printable schedule with titles.
[224001940030] |You can find organizer-created schedules for UAI and COLT already.
[224001940040] |In addition, I've put ICML 2009 up on WhatToSee, so have fun!
[224001940050] |(I haven't done UAI and/or COLT because their papers haven't appeared online yet.)
[224001950010] |NAACL-HLT 2009 Retrospective
[224001950020] |I hope this post will be a small impetus to get other people to post comments about papers they saw at NAACL (and associated workshops) that they really liked.
[224001950030] |As usual, I stayed for the whole conference, plus workshops.
[224001950040] |As usual, I also hit that day -- about halfway through the first workshop day -- where I was totally burnt out and was wondering why I always stick around for the entire week.
[224001950050] |That's not to say anything bad about the workshops specifically (there definitely were good ones going on, in fact, see some comments below), but I was just wiped.
[224001950060] |Anyway, I saw a bunch of papers and missed even more.
[224001950070] |I don't think I saw any papers that I actively didn't like (or wondered how they got in), including short papers, which I think is fantastic.
[224001950080] |Many thanks to all the organizers (Mari Ostendorf for organizing everything, Mike Collins, Lucy Vanderwende, Doug Oard and Shri Narayanan for putting together a great program, James Martin and Martha Palmer for local arrangements -- which were fantastic -- and all the other organizers who sadly we -- i.e., the NAACL board -- didn't get a chance to thank publicly).
[224001950090] |Here are some things I thought were interesting:
[224001950100] |Classifier Combination Techniques Applied to Coreference Resolution (Vemulapalli, Luo, Pitrelli and Zitouni).
[224001950110] |This was a student research workshop paper; in fact, it was the one that I was moderating (together with Claire Cardie).
[224001950120] |The student author, Smita, performed this work while at IBM; though her main research is on similar techniques applied to really cool sounding problems in recognizing stuff that happens in the classroom.
[224001950130] |Somehow classifier combination, and general system combination, issues came up a lot at this conference (mostly in the hallways where someone was begrudgingly admitting to working on something as dirty as system combination).
[224001950140] |I used to think system combination was yucky, but I don't really feel that way anymore.
[224001950150] |Yes, it would be nice to have one huge monolithic system that does everything, but that's often infeasible.
[224001950160] |My main complaint with system combination stuff is that in many cases I don't really understand why it's helping, which means that unless it's applied to a problem I really care about (of which there are few), it's hard for me to take anything away.
[224001950170] |But I think it's interesting.
[224001950180] |Getting back to Smita's paper, the key thing she did to make this work is introduce the notion of alignments between different clusterings, which seemed like a good idea.
[224001950190] |The results probably weren't as good as they were hoping for, but still interesting.
[224001950200] |My only major pointers as a panelist were to try using different systems, rather than bootstrapped versions of the same system, and to take a look at the literature on consensus clustering, which is fairly relevant for this problem.
[224001950210] |Graph-based Learning for Statistical Machine Translation (Alexandrescu and Kirchhoff).
[224001950220] |I'd heard of some of this work before in small group meetings with Andrei and Kathrin, but this is the first time I'd seen the results they presented.
[224001950230] |This is an MT paper, but really it's about how to do graph-based semi-supervised learning in a structured prediction context, when you have some wacky metric (read: BLEU) on which you're evaluating.
[224001950240] |Computation is a problem, but we should just hire some silly algorithms people to figure this out for us.
[224001950250] |(Besides, there was a paper last year at ICML -- I'm too lazy to dig it up -- that showed how to do graph-based stuff on billions of examples.)
[224001950260] |Intersecting Multilingual Data for Faster and Better Statistical Translations (Chen, Kay and Eisele).
[224001950270] |This is a very simple idea that works shockingly well.
[224001950280] |Had I written this paper, "Frustrating" would probably have made it into the title.
[224001950290] |Let's say we want an English to French phrase table.
[224001950300] |Well, we do phrase table extraction and we get something giant and ridiculous (have you ever looked at those phrase pairs) that takes tons of disk space and memory, and makes translation slow (it's like the "grammar constant" in parsing that means that O(n^3) for n=40 is impractical).
[224001950310] |Well, just make two more phrase tables, English to German and German to French and intersect.
[224001950320] |And viola, you have tiny phrase tables and even slightly better performance.
[224001950330] |The only big caveat seems to be that they estimate all these things on Europarl.
[224001950340] |What if your data sets are disjoint: I'd be worried that you'd end up with nothing in the resulting phrase table except the/le and sometimes/quelquefois (okay, I just used that example because I love that word).
[224001950350] |Quadratic Features and Deep Architectures for Chunking (Turian, Bergstra and Bengio).
[224001950360] |I definitely have not drunk the deep architectures kool-aid, but I still think this sort of stuff is interesting.
[224001950370] |The basic idea here stems from some work Bergstra did for modeling vision, where they replaced a linear classifier(y = w'x) with a low rank approximation to a quadratic classifier (y = w'x + sqrt[(a'x)^2 + (b'x)^2 + ... ]).
[224001950380] |Here, the a,b,... vectors are all estimated as part of the learning process (eg., by stochastic gradient descent).
[224001950390] |If you use a dozen of them, you get some quadratic style features, but without the expense of doing, say, an implicit (or worse, explicit) quadratic kernel.
[224001950400] |My worry (that I asked about during the talk) is that you obviously can't initialize these things to zero or else you're in a local minimum, so you have to do some randomization and maybe that makes training these things a nightmare.
[224001950410] |Joseph reassured me that they have initialization methods that make my worries go away.
[224001950420] |If I have enough time, maybe I'll give it a whirl.
[224001950430] |Exploring Content Models for Multi-Document Summarization (Haghighi and Vanderwende).
[224001950440] |This combines my two favorite things: summarization and topic models.
[224001950450] |My admittedly biased view was they started with something similar to BayeSum and then ran a marathon.
[224001950460] |There are a bunch of really cool ideas in here for content-based summarization.
[224001950470] |Global Models of Document Structure using Latent Permutations (Chen, Branavan, Barzilay and Karger).
[224001950480] |This is a really cool idea (previously mentioned in a comment on this blog) based on using generalized Mallow's models for permutation modeling (incidentally, see a just-appeared JMLR paper for some more stuff related to permutations!).
[224001950490] |The idea is that documents on a similar topic (eg., "cities") tend to structure their information in similar ways, which is modeled as a permutation over "things that could be discussed."
[224001950500] |It's really cool looking, and I wonder if something like this could be used in conjunction with the paper I talk about below on summarization for scientific papers (9, below).
[224001950510] |One concern raised during the questions that I also had was how well this would work for things not as standardized as cities, where maybe you want to express preferences of pairwise ordering, not overall permutations.
[224001950520] |(Actually, you can do this, at least theoretically: a recent math visitor here, Mark Huber, has some papers on exact sampling from permutations under such partial order constraints using coupling from the past.)
[224001950530] |The other thing that I was thinking during that talk that I thought would be totally awesome would be to do a hierarchical Mallow's model.
[224001950540] |Someone else asked this question, and Harr said they're thinking about this.
[224001950550] |Oh, well...
[224001950560] |I guess I'm not the only one :(.
[224001950570] |Dan Jurafsky's invited talk was awesome.
[224001950580] |It appealed to me in three ways: as someone who loves language, as a foodie, and as an NLPer.
[224001950590] |You just had to be there.
[224001950600] |I can't do it justice in a post.
[224001950610] |More than Words: Syntactic Packaging and Implicit Sentiment (Greene and Resnik).
[224001950620] |This might have been one of my favorite papers of the conference.
[224001950630] |The idea is that how you say things can express your point of view as much as what you say.
[224001950640] |They look specifically at effects like passivization in English, where you might say something like "The truck drove into the crowd" rather than "The soldier drove the truck into the crowd."
[224001950650] |The missing piece here seems to be identifying the "whodunnit" in the first sentence.
[224001950660] |This is like figuring out subjects in languages that like the drop subjects (like Japanese).
[224001950670] |Could probably be done; maybe it has been (I know it's been worked on in Japanese; I don't know about English).
[224001950680] |Using Citations to Generate Surveys of Scientific Paradigms (Mohammad, Dorr, Egan, Hassan, Muthukrishan, Qazvinian, Radev and Zajic).
[224001950690] |I really really want these guys to succeed.
[224001950700] |They basically study how humans and machines create summaries of scientific papers when given either the text of the paper, or citation snippets to the paper.
[224001950710] |The idea is to automatically generate survey papers.
[224001950720] |This is actually an area I've toyed with getting in to for a while.
[224001950730] |The summarization aspect appeals to me, and I actually know and understand the customer very well.
[224001950740] |The key issue I would like to see addressed is how these summaries vary across different users.
[224001950750] |I've basically come to the conclusion that in summarization, if you don't pay attention to the user, you're sunk.
[224001950760] |This is especially true here.
[224001950770] |If I ask for a summary of generalization bound stuff, it's going to look very different than if Peter Bartlett asks for it.
[224001950780] |Online EM for Unsupervised Models (Liang and Klein).
[224001950790] |If you want to do online EM read this paper.
[224001950800] |On the other hand, you're going to have to worry about things like learning rate and batch size (think Pegasos).
[224001950810] |I was thinking about stuff like this a year or two ago and was wondering how this would compare to doing SGD on the log likelihood directly and not doing EM at all.
[224001950820] |Percy says that asymptotically they're the same, but who knows what they're like in the real world :).
[224001950830] |I think it's interesting, but I'm probably not going to stop doing vanilla EM.
[224001950840] |I then spent some time at workshops.
[224001950850] |I spent the first morning in the Computational Approaches to Linguistic Creativity workshop, which was just a lot of fun.
[224001950860] |I really liked all of the morning talks: if you love language and want to see stuff somewhat off the beaten path, you should definitely read these.
[224001950870] |I went by the Semantic Evaluation Workshop for a while and learned that the most frequent sense baseline is hard to beat.
[224001950880] |Moreover, there might be something to this discourse thing after all: Marine tells us that translators don't like to use multiple translations when one will do (akin to the one sense per discourse observation).
[224001950890] |The biggest question in my head here is how much the direction of translation matters (eg., when this heuristic is violated, is it violated by the translator, or the original author)?
[224001950900] |Apparently this is under investigation.
[224001950910] |But it's cool because it says that even MT people shouldn't just look at one sentence at a time!
[224001950920] |Andrew McCallum gave a great, million-mile-an-hour invited talk on joint inference in CoNLL.
[224001950930] |I'm pretty interested in this whole joint inference business, which also played a big role in Jason Eisner's invited talk (that I sadly missed) at the semi-supervised learning workshop.
[224001950940] |To me, the big question is: what happens if you don't actually care about some of the tasks.
[224001950950] |In a probabilistic model, I suppose you'd marginalize them out... but how should you train?
[224001950960] |In a sense, since you don't care about them, it doesn't make sense to have a real loss associated with them.
[224001950970] |But if you don't put a loss, what are you doing?
[224001950980] |Again,in probabilistic land you're saved because you're just modeling a distribution, but this doesn't answer the whole question.
[224001950990] |Al Aho gave a fantastically entertaining talk in the machine translation workshop about unnatural language processing.
[224001951000] |How the heck they managed to get Al Aho to give an invited talk is beyond me, but I suspect we owe Dekai some thanks for this.
[224001951010] |He pointed to some interesting work that I wasn't familiar with, both in raw parsing (eg., how to parse errorfull strings with a CFG when you want to find the closest in edit distance string that is parseable by a CFG) and natural language/programming language interfaces.
[224001951020] |(In retrospect, the first result is perhaps obvious had I actually thought much about it, though probably not so back in 1972: you can represent edit distance by a lattice and then parse the lattice, which we know is efficient.)
[224001951030] |Anyway, there were other things that were interesting, but those are the ones that stuck in my head somehow (note, of course, that this list is unfairly biased toward my friends... what can I say? :P).
[224001951040] |So, off to ICML on Sunday.
[224001951050] |I hope to see many of you there!
[224001960010] |NL Generation: A new problem
[224001960020] |Those who talk to me a lot over the years know that I really think that generation is a cool and interesting problem, but one that is hampered by a lack of clarity of what it is, or at least what the input/output is.
[224001960030] |It's like the problem with defining summarization, but one hundred times worse.
[224001960040] |I have no idea how it came up.
[224001960050] |I think I was talking to a bunch of people from Microsoft Research and we were talking about translation on the web and what not.
[224001960060] |And how this is not the business model of Language Weaver.
[224001960070] |And how when you're on the web you make money by advertising.
[224001960080] |And voila!
[224001960090] |The most incredible NL Generation task occurred to me!
[224001960100] |(I apologize if you ran in to me at all during NAACL because this was pretty much all I could talk about :P.)
[224001960110] |The initial idea for the task was embedded in MT, though it needn't be.
[224001960120] |But I'll tell it in the MT setting.
[224001960130] |So I go to some web page in some weirdo language (say, French) that I don't understand (because I was a moron and took Latin instead of French or Spanish in middle school and high school).
[224001960140] |So I ask my favorite translation system (Google or Microsoft or Babelfish or whatever) to translate it.
[224001960150] |However, the translation system takes certain liberties with the translation.
[224001960160] |In particular, it might embed a few "product placements" in the text.
[224001960170] |For instance, maybe it's translating "Je suis vraiment soif" into English (if this is incorrect, blame Google).
[224001960180] |And perhaps it decides that instead of translating this as "I'm really thirsty," it will translate it as "I'm really thirsty for a Snapple," or "I'm really thirsty: I could go for a Snapple," perhaps with a link to snapple.com.
[224001960190] |Product placement is all over the place, even in America where it's made fun of and kept a bit at bay.
[224001960200] |Not so in China: any American remotely turned off by the Coca-cola cups from which the judges on American Idol drink twice a week would be appalled by the ridiculous (my sentiment) product placement that goes on over there.
[224001960210] |The presence of the link would probably give away that it's an ad, though of course you could leave this off.
[224001960220] |But why limit this to translation.
[224001960230] |You could put such a piece of technology directly on blogger, or on some proxy server that you can go through to earn some extra cash browsing the web (thanks to someone -- can't remember who -- at NAACL for this latter suggestion).
[224001960240] |I mean, you could just have random ads pop up in the middle of text on any web page, for instance one you could host on webserve.ca!
[224001960250] |(See what I did there?)
[224001960260] |So now here's a real generation problem!
[224001960270] |You're given some text.
[224001960280] |And maybe you're even given adwords or something like that, so you can assume that the "select which thing to advertise" problem has been solved.
[224001960290] |(Yes, I know it's not.)
[224001960300] |Your job is just to insert the ad in the most natural way in the text.
[224001960310] |You could evaluate in at least two ways: click through (as is standard in a lot of this advertising business) and human judgments of naturalness.
[224001960320] |I think the point of product placement is to (a) get your product on the screen more or less constantly, rather than just during commercial breaks which most people skip anyway, and (b) perhaps appeal to people's subconscious.
[224001960330] |I don't know.
[224001960340] |My parents (used to) do advertising like things but I know next to nothing about that world.
[224001960350] |Okay, so this is slightly tongue in cheek, but not entirely.
[224001960360] |And frankly, I wouldn't be surprised if something like it were the norm in five years.
[224001960370] |(If you want to get more fancy, insert product placements into youtube videos!)
[224001970010] |Why I Don't Buy Clustering Axioms
[224001970020] |In NIPS 15, Jon Kleinberg presented some impossibility results for clustering.
[224001970030] |The idea is to specify three axioms that all clustering functions should obey and examine those axioms.
[224001970040] |Let (X,d) be a metric space (so X is a discrete set of points and d is a metric over the points of X).
[224001970050] |A clustering function F takes d as input and produces a clustering of the data.
[224001970060] |The three axioms Jon states that all clustering functions should satisfy are:
[224001970070] |Scale invariance: For all d, for all a>0, F(ad) = F(d).
[224001970080] |In other words, if I transform all my distances by scaling them uniformly, then the output of my clustering function should stay the same.
[224001970090] |Richness: The range of F is the set of all partitions.
[224001970100] |In other words, there isn't any bias that prohibits us from producing some particular partition.
[224001970110] |Consistency: Suppose F(d) gives some clustering, C.
[224001970120] |Now, modify d by shrinking distances within clusters of C and expanding distances between clusters in C. Call the new metric d'.
[224001970130] |Then F(d') = C.
[224001970140] |Kleinberg's result is that there is no function F that simultaneously satisfies all these requirements.
[224001970150] |Functions can satisfy two, but never all three.
[224001970160] |There have been a bunch of follow on papers, including one at NIPS last year and one that I just saw at UAI.
[224001970170] |If you think about these axioms a little bit, they kind of make sense.
[224001970180] |My problem is that if you think about them a lot of bit, you (or at least I) realize that they're broken.
[224001970190] |The biggest broken one is consistency, which becomes even more broken when combined with scale invariance.
[224001970200] |What I'm going to do to convince you that consistency is broken is start with some data in which there is (what I consider) a natural clustering into two clusters.
[224001970210] |I'll then apply consistency a few times to get something that (I think) should yield a different clustering.
[224001970220] |Let's start with some data.
[224001970230] |The colors are my opinion as to how the data should be clustered:
[224001970240] |I hope you agree with my color coding.
[224001970250] |Now, let's apply consistency.
[224001970260] |In particular, let's move some of the red points, only reducing inter-clustering distances.
[224001970270] |Formally, we find the closest pair of points and move things toward those.
[224001970280] |The arrows denote the directions these points will be moved.
[224001970290] |To make the situation more visually appealing, let's move things into lines:Okay, this is already looking funky.
[224001970300] |Let's make it even worse.
[224001970310] |Let's apply consistency again and start moving some blue points:Again, let's organize these into a line:And if I had given you this data to start with, my guess is the clustering you'd have come up with is more like:This is a violation of consistency.
[224001970320] |So, what I'd like someone to do is to argue to my why consistency is actually a desirable property.
[224001970330] |I can come up with lots of other examples.
[224001970340] |One reason why this invariance is bad is because it renders the notion of "reference sizes" irrelevant.
[224001970350] |This is of course a problem if you have prior knowledge (eg., one thing measured in millimeters, the other in kilometers).
[224001970360] |But even in the case where you don't know knowledge, what you can do is take the following.
[224001970370] |Take data generated by thousands of well separated Gaussians, so that the clearly right thing to do is have one cluster per Gaussian.
[224001970380] |Now, for each of these clusters except for one, shrink them down to single points.
[224001970390] |This is possible by consistency.
[224001970400] |Now, your data basically looks like thousands-1 of clusters with zero inter-cluster distances and then one cluster that's spread out.
[224001970410] |But now it seems that the reasonable thing is to put each data point that was in this old cluster into its own cluster, essentially because I feel like the other data shows you what clusters should look like.
[224001970420] |If you're not happy with this, apply scaling and push these points out super far from each other.
[224001970430] |(I don't think this example is as compelling as the one I drew in pictures, but I still think it's reasonable.
[224001970440] |Now, in the UAI paper this year, they show that if you fix the number of clusters, these axioms are now consistent.
[224001970450] |(Perhaps this has to do with the fact that all of my "weird" examples change the number of clusters -- though frankly I don't think this is necessary...
[224001970460] |I probably could have arranged it so that the resulting green and blue clusters look like a single line that maybe should just be one cluster by itself.)
[224001970470] |But I still feel like consistency isn't even something we want.
[224001970480] |(Thanks to the algorithms group at Utah for discussions related to some of these issues.)
[224001970490] |UPDATE 20 JUNE 2009, 3:49PM EST
[224001970500] |Here's some data to justify the "bad things happen even when the number of clusters stays the same" claim.
[224001970510] |Start with this data:
[224001970520] |Now, move some points toward the middle (note they have to spread to the side a bit so as not to decrease intra-cluster distances).
[224001970530] |Yielding data like the following:
[224001970540] |Now, I feel like two horizontal clusters are most natural here.
[224001970550] |But you may disagree.
[224001970560] |What if I add some more data (ie., this is data that would have been in the original data set too, where it clearly would have been a third cluster):
[224001970570] |And if you still disagree, well then I guess that's fine.
[224001970580] |But what if there were hundreds of other clusters like that.
[224001970590] |I guess the thing that bugs me is that I seem to like clusters that have similar structures.
[224001970600] |Even if some of these bars were rotated arbitrarily (or, preferably, in an axis-aligned manner), I would still feel like there's some information getting shared across the clusters.
[224001980010] |Should there be a shared task for semi-supervised learning in NLP?
[224001980020] |(Guest post by Kevin Duh.
[224001980030] |Thanks, Kevin!)
[224001980040] |At the NAACL SSL-NLP Workshop recently, we discussed whether there ought to be a "shared task" for semi-supervised learning in NLP.
[224001980050] |The panel discussion consisted of Hal Daume, David McClosky, and Andrew Goldberg as panelists and audience input from Jason Eisner, Tom Mitchell, and many others.
[224001980060] |Here we will briefly summarize the points raised and hopefully solicit some feedback from blog readers.
[224001980070] |Three motivations for a shared task
[224001980080] |A1.
[224001980090] |Fair comparison of methods: A common dataset will allow us to compare different methods in an insightful way.
[224001980100] |Currently, different research papers use different datasets or data-splits, making it difficult to draw general intuitions from the combined body of research.
[224001980110] |A2.
[224001980120] |Establish a methodology for evaluating SSLNLP results: How exactly should a semi-supervised learning method be evaluated?
[224001980130] |Should we would evaluate the same method for both low-resource scenarios (few labeled points, many unlabeled points) and high-resource scenarios (many labeled points, even more unlabeled points)?
[224001980140] |Should we evaluate the same method under different ratios of labeled/unlabeled data?
[224001980150] |Currently there is no standard methodology for evaluating SSLNLP results, which means that the completeness/quality of experimental sections in research papers varies considerably.
[224001980160] |A3.
[224001980170] |Encourage more research in the area: A shared task can potentially lower the barrier of entry to SSLNLP, especially if it involves pre-processed data and community support network.
[224001980180] |This will make it easier for researchers in outside fields, or researchers with smaller budgets to contribute their expertise to the field.
[224001980190] |Furthermore, a shared task can potentially direct the community research efforts in a particular subfield.
[224001980200] |For example, "online/lifelong learning for SSL" and "SSL as joint inference of multiple tasks and heterogeneous labels" (a la Jason Eisner's keynote) were identified as new, promising areas to focus on in the panel discussion.
[224001980210] |A shared task along those lines may help us rally the community behind these efforts.
[224001980220] |Arguments against the above points
[224001980230] |B1.
[224001980240] |Fair comparison: Nobody really argues against fair comparison of methods.
[224001980250] |The bigger question, however, is whether there exist a *common* dataset or task where everyone is interested in.
[224001980260] |At the SSLNLP Workshop, for example, we had papers in a wide range of areas ranging from information extraction to parsing to text classification to speech.
[224001980270] |We also had papers where the need for unlabeled data is intimately tied in to particular components of a larger system.
[224001980280] |So, a common dataset is good, but what dataset can we all agree upon?
[224001980290] |B2.
[224001980300] |Evaluation methodology: A consistent standard for evaluating SSLNLP results is nice to have, but this can be done independently from a shared task through, e.g. an influential paper or gradual recognition of its importance by reviewers.
[224001980310] |Further, one may argue: what makes you think that your particular evaluation methodology is the best?
[224001980320] |What makes you think people will adopt it generally, both inside and outside of the shared task?
[224001980330] |B3.
[224001980340] |Encourage more research: It is nice to lower the barriers to entry, especially if we have pre-processed data and scripts.
[224001980350] |However, it has been observed in other shared tasks that often it is the pre-processing and features that matter most (more than the actual training algorithm).
[224001980360] |This presents a dilemma: If the shared task pre-processes the data to make it easy for anyone to join, will we lose the insights that may be gained via domain knowledge?
[224001980370] |On the other hand, if we present the data in raw form, will this actually encourage outside researchers to join the field?
[224001980380] |Rejoinder
[224001980390] |A straw poll at the panel discussion showed that people are generally in favor of looking into the idea of a shared task.
[224001980400] |The important question is how to make it work, and especially how to address counterpoints B1 (what task to choose) and B3 (how to prepare the data).
[224001980410] |We did not have enough time during the panel discussion to go through the details, but here are some suggestions:
[224001980420] |We can view NLP problems as several big "classes" of problems: sequence prediction, tree prediction, multi-class classification, etc.
[224001980430] |In choosing a task, we can pick a representative task in each class, such as name-entity recognition for sequence prediction, dependency parsing for tree prediction, etc.
[224001980440] |This common dataset won't attract everyone in NLP, but at least it will be relevant for a large subset of researchers.
[224001980450] |If participants are allowed to pre-process their own data, the evaluation might require participant to submit a supervised system along with their semi-supervised system, using the same feature set and setup, if possible.
[224001980460] |This may make it easier to learn from results if there are differences in pre-processing.
[224001980470] |There should also be a standard supervised and semi-supervised baseline (software) provided by the shared task organizer.
[224001980480] |This may lower the barrier of entry for new participants, as well as establish a common baseline result.
[224001980490] |Any suggestions?
[224001980500] |Thoughts?