[224001990010] |
ICML/COLT/UAI 2009 retrospective
[224001990020] |This will probably be a bit briefer than my corresponding NAACL post because even by day two of ICML, I was a bit burnt out; I was also constantly swapping in other tasks (grants, etc.).
[224001990030] |Note that John has already posted his list of papers.
[224001990040] |#317: Multi-View Clustering via Canonical Correlation Analysis (Chaudhuri, Kakade, Livescu, Sridharan).
[224001990050] |This paper shows a new application of CCA to clustering across multiple views.
[224001990060] |They use some wikipedia data in experiments and actually prove something about the fact that (under certain multi-view-like assumptions), CCA does the "right thing."
[224001990070] |#295: Learning Nonlinear Dynamic Models (Langford, Salakhutdinov,, Zhang).
[224001990080] |The cool idea here is to cut a deterministic classifier in half and use its internal state as a sort of sufficient statistic.
[224001990090] |Think about what happens if you represent your classifier as a circuit (DAG); then anywhere you cut along the circuit gives you a sufficient representation to predict.
[224001990100] |To avoid making circuits, they use neural nets, which have an obvious "place to cut" -- namely, the internal nodes.
[224001990110] |#364: Online Dictionary Learning for Sparse Coding (Mairal, Bach, Ponce, Sapiro).
[224001990120] |A new approach to sparse coding; the big take-away is that it's online and fast.
[224001990130] |394: MedLDA: Maximum Margin Supervised Topic Models for Regression and Classification (Zhu, Ahmed, Xing).
[224001990140] |This is a very cute idea for combining objectives across topic models (namely, the variational objective) and classification (the SVM objective) to learn topics that are good for performing a classification task.
[224001990150] |#393: Learning from Measurements in Exponential Families (Liang, Jordan, Klein). Suppose instead of seeing (x,y) pairs, you just see some statistics on (x,y) pairs -- well, you can still learn.
[224001990160] |(In a sense, this formalizes some work out of the UMass group; see also the Bellare, Druck and McCallum paper at UAI this year.)
[224001990170] |#119: Curriculum Learning (Bengio, Louradour, Collobert, Weston).
[224001990180] |The idea is to present examples in a well thought-out order rather than randomly.
[224001990190] |It's a cool idea; I've tried it in the context of unsupervised parsing (the unsearn paper at ICML) and it never helped and often hurt (sadly).
[224001990200] |I curriculum-ified by sentence length, though, which is maybe not a good model, especially when working with WSJ10 -- maybe using vocabulary would help.
[224001990210] |#319: A Stochastic Memoizer for Sequence Data (Wood, Archambeau, Gasthaus, James, Whye Teh).
[224001990220] |If you do anything with Markov models, you should read this paper.
[224001990230] |The take away is: how can I learn a Markov model with (potentially) infinite memory in a linear amount of time and space, and with good "backoff" properties.
[224001990240] |Plus, there's some cool new technology in there.
[224001990250] |A Uniqueness Theorem for Clustering Reza Bosagh Zadeh, Shai Ben-David.
[224001990260] |I already talked about this issue a bit, but the idea here is that if you fix k, then the clustering axioms become satisfiable, and are satisfied by two well known algorithms.
[224001990270] |Fixing k is a bit unsatisfactory, but I think this is a good step in the right direction.
[224001990280] |Convex Coding David Bradley, J. Andrew Bagnell.
[224001990290] |The idea is to make coding convex by making it infinite!
[224001990300] |And then do something like boosting.
[224001990310] |On Smoothing and Inference for Topic Models Arthur Asuncion, Max Welling, Padhraic Smyth, Yee Whye Teh.
[224001990320] |If you do topic models, read this paper: basically, none of the different inference algorithms do any better than the others (perplexity-wise) if you estimate hyperparameters well.
[224001990330] |Come are, of course, faster though.
[224001990340] |Correlated Non-Parametric Latent Feature Models Finale Doshi-Velez, Zoubin Ghahramani.
[224001990350] |This is an indian-buffet-process-like model that allows factors to be correlated.
[224001990360] |It's somewhat in line with our own paper from NIPS last year.
[224001990370] |There's still something a bit unsatisfactory in both our approach and their approach that we can't do this "directly."
[224001990380] |Domain Adaptation: Learning Bounds and Algorithms.
[224001990390] |Yishay Mansour, Mehryar Mohri and Afshin Rostamizadeh.
[224001990400] |Very good work on some learning theory for domain adaptation based on the idea of stability.
[224001990410] |Okay, that's it.
[224001990420] |Well, not really: there's lots more good stuff, but those were the things that caught my eye.
[224001990430] |Feel free to tout your own favorites in the comments.
[224002000010] |Small changes beget good or bad examples?
[224002000020] |If you compare vision research with NLP research, there are a lot of interesting parallels.
[224002000030] |Like we both like linear models.
[224002000040] |And conditional random fields.
[224002000050] |And our problems are a lot harder than binary classification.
[224002000060] |And there are standard data sets that we've been evaluating on for decades and continue to evaluate on (I'm channeling Bob here :P).
[224002000070] |But there's one thing that happens, the difference of which is so striking, that I'd like to call it to center stage.
[224002000080] |It has to do with "messing with our inputs."
[224002000090] |I'll spend a bit more time describing the vision approach, since it's probably less familiar to the average reader.
[224002000100] |Suppose I'm trying to handwriting recognition to identify digits from zero to nine (aka MNIST).
[224002000110] |I get, say, 100 labeled zeros, 100 labeled ones, 100 labeled twos and so on.
[224002000120] |So a total of 1000 data points.
[224002000130] |I can train any off the shelf classifier based on pixel level features and get some reasonable performance (maybe 80s-90s, depending).
[224002000140] |Now, I want to insert knowledge.
[224002000150] |The knowledge that I want to insert is some notion of invariance.
[224002000160] |I.e., if I take an image of a zero and translate it left a little bit, it's still a zero.
[224002000170] |Or up a little bit.
[224002000180] |Of if I scale it up 10%, it's still a zero.
[224002000190] |Or down 10%.
[224002000200] |Or if I rotate it five degrees.
[224002000210] |Or negative five.
[224002000220] |All zeros.
[224002000230] |Same hold for all the other digits.
[224002000240] |One way to insert this knowledge is to muck with the learning algorithm.
[224002000250] |That's too complicated for me: I want something simpler.
[224002000260] |So what I'll do is take my 100 zeros and 100 ones and so on and just manipulate them a bit.
[224002000270] |That is, I'll sample a random zero, and apply some small random transformations to it, and call it another labeled example, also a zero.
[224002000280] |Now I have 100,000 training points.
[224002000290] |I train my off the shelf classifier based on pixel level features and get 99% accuracy or more.
[224002000300] |The same trick works for other vision problem (eg., recognizing animals).
[224002000310] |(This process is so common that it's actually described in Chris Bishop's new-ish PRML book!)
[224002000320] |This is what I mean by small changes (to the input) begetting good example.
[224002000330] |A slightly transformed zero is still a zero.
[224002000340] |Of course, you have to be careful.
[224002000350] |If you rotate a six by 180 degrees, you get a nine.
[224002000360] |If you rotate a cat by 180 degrees, you get an unhappy cat.
[224002000370] |More seriously, if you're brave, you might start looking at a class of transformations called diffeomorphisms, which are fairly popular around here.
[224002000380] |These are nice because of their nice mathematical properties, but un-nice because they can be slightly too flexible for certain problems.
[224002000390] |Now, let's go over to NLP land.
[224002000400] |Do we ever futz with our inputs?
[224002000410] |Sure!
[224002000420] |In language modeling, we'll sometimes permute words or replace one word with another to get a negative example.
[224002000430] |Noah Smith futzed with his inputs in contrastive estimation to produce negative examples by swapping adjacent words, or deleting words.
[224002000440] |In fact, try as I might, I cannot think of a single case in NLP where we make small changes to an input to get another good input: we always do it to get a bad input!
[224002000450] |In a sense, this means that one thing that vision people have that we don't have is a notion of semantics preserving transformations.
[224002000460] |Sure, linguists (especially those from that C-guy) study transformations.
[224002000470] |And there's a vague sense that work in paraphrasing leads to transformations that maintain semantic equivalence.
[224002000480] |But the thing is that we really don't know any transformations that preserve semantics.
[224002000490] |Moreover, some transformations that seem benign (eg., passivization) actually are not: one of my favorite papers at NAACL this year by Greene and Resnik showed that syntactic structure affects sentiment (well, them, drawing on a lot of psycholinguistics work)!
[224002000500] |I don't have a significant point to this story other than it's kind of weird.
[224002000510] |I mentioned this to some people at ICML and got a reaction that replacing words with synonyms should be fine.
[224002000520] |I remember doing this in high school, when word processors first started coming with thesauri packed in.
[224002000530] |The result seemed to be that if I actually knew the word I was plugging in, life was fine... but if not, it was usually a bad replacement.
[224002000540] |So this seems like something of a mixed bag: depending on how liberal you are with defining "synonym" you might be okay do this, but you might also not be.
[224002010010] |Non-parametric as memorizing, in exactly the wrong way?
[224002010020] |There is a cool view of the whole non-parametric Bayes thing that I think is very instructive.
[224002010030] |It's easiest to see in the case of the Pitman-Yor language modeling work by Frank Wood and Yee Whye Teh.
[224002010040] |The view is "memorize what you can, and back off (to a parametric model) when you can't."
[224002010050] |This is basically the "backoff" view... where NP Bayes comes in is to control the whole "what you can" aspect.
[224002010060] |In other words, if you're doing language modeling, try to memorize two grams; but if you haven't seen enough to be confident in your memorization, back off to one grams; and if you're not confident there, back off to a uniform distribution (which is our parametric model -- the base distribution).
[224002010070] |Or, if you think about the state-splitting PCFG work (done both at Berkeley and at Stanford), basically what's going on is that we're memorizing as many differences of tree labels as we can, and then backing off to the "generic" label when memorization fails.
[224002010080] |Or if we look at Trevor Cohn's NP-Bayes answer to DOP, we see a similar thing: memorize big tree chunks, but if you can't, fall back to a simple CFG (our parametric model).
[224002010090] |Now, the weird thing is that this mode of memorization is kind of backwards from how I (as an outsider) typically interpret cognitive models (any cogsci people out there, feel free to correct me!).
[224002010100] |If you take, for instance, morphology, there's evidence that this is exactly not what humans do.
[224002010110] |We (at least from a popular science) perspective, basically memorize simple rules and then remember exceptions.
[224002010120] |That is, we remember that to make the past tense of a verb, we add "-ed" (the sound, not the characters) but for certain verbs, we don't: go/went, do/did, etc.
[224002010130] |You do little studies where you ask people to inflect fake words and they generally follow the rule, not the exceptions (but see * below).
[224002010140] |If NP Bayes had its way on this problem (or at least if the standard models I'm familiar with had their way), they would memorize "talk" -> "talked" and "look" -> "looked" and so on because they're so familiar.
[224002010150] |Sure, it would still memorize the exceptions, but it would also memorize the really common "rule cases too... why?
[224002010160] |Because of course it could fall back to the parametric model, but these are so common that the standard models would really like to take advantage of the rich-get-richer phenomenon on things like DPs, thus saving themselves cost by memorizing a new "cluster" for each common word.
[224002010170] |(Okay, this is just my gut feeling about what such models would do, but I think it's at least defensible.)
[224002010180] |Yes, you could turn the DP "alpha" parameter down, but I guess I'm just not convinced this would do the right thing.
[224002010190] |Maybe I should just implement such a beast but, well, this is a blog post, not a *ACL paper :P.
[224002010200] |Take as an alternative example the language modeling stuff.
[224002010210] |Basically what it says is "if you have enough data to substantiate memorizing a 5 gram, you should probably memorize a 5 gram."
[224002010220] |But why?
[224002010230] |If you could get the same effect with a 2 or 3 gram, why waste the storage/time?!
[224002010240] |I guess you could argue "your prior is wrong," which is probably true for most of these models.
[224002010250] |In which case I guess the question is "what prior does what I want?"
[224002010260] |I don't have a problem with rich get richer -- in fact, I think it's good in this case.
[224002010270] |I also don't have a problem with a logarithmic growth rate in the number of exceptions (though I'd be curious how this holds up empirically -- in general, I'm a big fan of checking if your prior makes sense; for instance, Figure 3 (p16) of my supervised clustering paper).
[224002010280] |I just don't like the notion of memorizing when you don't have to.
[224002010290] |(*) I remember back in grad school a linguist from Yale came and gave a talk at USC.
[224002010300] |Sadly, I can't remember who it was: if anyone wants to help me out, I'd really appreciate it!
[224002010310] |The basic claim of the talk is that humans actually memorize a lot more than we give them credit for.
[224002010320] |The argument was in favor of humans basically memorizing all morphology and not backing off to rules like "add -ed."
[224002010330] |One piece of evidence in favor of this was timing information for asking people to inflect words: the timing seemed to indicate a linear search through a long list of words that could possibly be inflected.
[224002010340] |I won't say much more about this because I'm probably already misrepresenting it, but it's an interesting idea.
[224002010350] |And, if true, maybe the NP models are doing exactly what they should be doing!
[224002030010] |ACS: Machine Translation Papers at EMNLP
[224002030020] |[Guest Post by Adam Lopez... thanks, Adam!
[224002030030] |Hal's comment: you may remember that a while ago I proposed the idea of conference area chairs posting summaries of their areas; well, Adam is the first to take me up on this idea...
[224002030040] |I still think it's a good idea, so anyone else who wants to do so in the future, let me know!]
[224002030050] |Conferences can be exhausting, and back-to-back conferences can be really exhausting, so I want to convince you to pace yourself and save some energy for EMNLP at the end of the week, because we have some really interesting MT papers.
[224002030060] |I'll focus mainly on oral presentations, because unlike poster sessions, the parallel format of the oral sessions entails a hard choice between mutually exclusive options, and part of my motivation is to help you make that choice.
[224002030070] |That being said, there are many interesting papers at the poster session, so do take a look at them!
[224002030080] |MT is a busy research area, and we have a really diverse set of papers covering the whole spectrum of ideas: from blue sky research on novel models, formalisms, and algorithms, to the hard engineering problems of wringing higher accuracy and speed out of a mature, top-scoring NIST system.
[224002030090] |I occasionally feel that my colleagues on far reaches of either side of this spectrum are too dismissive of work on the other side; we need both if we're going to improve translation.
[224002030100] |Outside the Box
[224002030110] |Before giving you a guided tour through that spectrum, I want to highlight one paper that I found thought-provoking, but hard to classify.
[224002030120] |Zaidan &Callison-Burch question a basic assumption underlying most machine learning approaches to NLP: that we must optimize on an easily computable approximation to the true loss function.
[224002030130] |They ask: why not optimize for human judgement?
[224002030140] |They design a metric that uses judgements on small snippets of a target sentence (defined by a spanning nonterminal in a parse tree of the aligned source sentence) and figure how many judgements they would need to collect (using Amazon Mechanical Turk) to cover an iteration of MERT, exploiting the fact that these snippets reoccur repeatedly during optimization.
[224002030150] |How hard is this exactly?
[224002030160] |I would say, in terms of this scale of loss functions, that their metric is a 2.
[224002030170] |Yet, it turns out to be cheap and fast to compute.
[224002030180] |The paper doesn't report results of an actual optimization run, but it's in the works... hopefully you'll learn more at the conference.
[224002030190] |Connecting Theory and Practice
[224002030200] |A few papers combine deep theoretical insight with convincing empirical results.
[224002030210] |Hopkins &Langmead improve on cube pruning, a popular approximate search technique for structured models with non-local features (i.e. translation with an integrated language model).
[224002030220] |They move cube pruning from its ad hoc roots to a firm theoretical basis by constructing a reduction to A* search, connecting it to classical AI search literature.
[224002030230] |This informs the derivation of new heuristics for a syntax-based translation model, including an admissible heuristic to perform exact cube pruning.
[224002030240] |It's still globally approximate, but exact for the local prediction problem that cube pruning solves (i.e., what are the n-best state splits of an item, given the n-best input states from previous deductions?).
[224002030250] |Amazingly, this is only slightly slower than the inexact version and improves the accuracy of a strong baseline on a large-scale Arabic-English task.
[224002030260] |Li &Eisner show how to compute a huge number of statistics efficiently over a combinatorially large number of hypotheses represented in a hypergraph.
[224002030270] |The statistics include expected hypothesis length, feature expectation, entropy, cross-entropy, KL divergence, Bayes risk, variance of hypothesis length, gradient of entropy and Bayes risk, covariance and Hessian matrix.
[224002030280] |It's beautifully simple: they recast the quantities of interest as semirings and run the inside (or inside-outside) algorithm.
[224002030290] |As an example application, they perform minimum risk training on a small Chinese-English task, reporting gains in accuracy.
[224002030300] |For a related paper on minimum risk techniques, see the poster by Pauls et al.
[224002030310] |Novel Modeling and Learning Approaches
[224002030320] |Tromble &Eisner also connect translation to theory by way of a novel model, framing reordering as an instance of the linear ordering problem: given a matrix of pairwise ordering preferences between all words in a sentence, can we find a permutation that optimizes the global score?
[224002030330] |This is NP-hard, but they give a reasonable approximation based on ITG, with some clever dynamic programming tricks to make it work.
[224002030340] |Then they show how to learn the matrix and use it to reorder test sentences prior to translation, improving over the lexicalized reordering model of Moses on German-English.
[224002030350] |However, most of the new models at EMNLP are syntax-based.
[224002030360] |In the last few years, syntax-based modeling has focused primarily on variants of synchronous context-free grammar (SCFG).
[224002030370] |This year there's a lot of work investigating more expressive formalisms.
[224002030380] |Two papers model translation with restricted variants of synchronous tree-adjoining grammar (STAG).
[224002030390] |Carreras &Collins model syntax atop phrase pairs with a parser using sister adjunction (as in their 2008 parser).
[224002030400] |The model resembles a synchronous version of Markov grammar, which also connects it to recent dependency models of translation (e.g. Shen et al. 2008, Galley et al. 2009, Gimpel &Smith below, and Hassan et al. in the poster session).
[224002030410] |Decoding is NP-complete, and devising efficient beam search is a key point in the paper.
[224002030420] |The resulting system outperforms Pharaoh on German-English.
[224002030430] |DeNeefe &Knight model target-side syntax via synchronous tree insertion grammar (STIG).
[224002030440] |It's similar to synchronous tree substitution grammar (STSG; previously realized in MT as GHKM) with added left- and right-adjunction operations to model optional arguments.
[224002030450] |They show how to reuse a lot of the STSG machinery via a grammar transformation from STIG to STSG, and the results improve on a strong Arabic-English baseline.
[224002030460] |Gimpel
[224002030470] |&Blunsom reformulate tree-to-string STSG induction as a problem in non-parametric Bayesian inference, extending their TSG model for monolingual parsing, and removing the dependence on heuristics over noisy GIZA++ word alignments.
[224002030480] |The model produces more compact rules, and outperforms GHKM on a Chinese-English task.
[224002030490] |This is a hot topic: check out Liu &Gildea's poster for an alternative Bayesian formulation of the same problem and language pair.
[224002030500] |Galron et al. look at tree-to-tree STSG (from a Data-Oriented Parsing perspective), with an eye towards discriminatively learning STSG rules to optimize for translation accuracy.
[224002030510] |Bayesian inference also figures in the model of Chung
[224002030520] |this requires a lot of computation over sparse vectors.
[224002030530] |They instead represent the features implicitly using a tree convolution kernel, showing nice gains in Chinese-English.
[224002030540] |On the algorithmic side, Levenberg
[224002030550] |the resulting grammar is reliably better under a variety of conditions on a Chinese-English task.
[224002030560] |Meanwhile, Zhang et al. engineer more efficient STSG decoding for the case in which the source is a parse forest and source units are tree fragments.
[224002030570] |The trick is to encode translation rules in the tree path equivalent of a prefix tree.
[224002030580] |On Chinese-English this improves decoding speed and ultimately translation accuracy, because the decoder can consider larger fragments much more efficiently.
[224002030590] |Finally, see Finch
[224002040010] |Classifier performance: alternative metrics of success
[224002040020] |I really enjoyed Mark Dredze's talk at EMNLP on multiclass confidence weighted algorithms, where they take their CW binary predictors and extend them in two (basically equivalent) ways to a multiclass/structured setting (warning: I haven't read the paper!).
[224002040030] |Mark did a great job presenting, as always, and dryly suggested that we should all throw away our perceptrons and our MIRAs and SVMs and just switch to CW permanently.
[224002040040] |It was a pretty compelling case.
[224002040050] |Now, I'm going to pick on basically every "yet another classifier" paper I've read in the past ten years (read: ever).
[224002040060] |I'm not trying to point fingers, but just try to better understand why I, personally, haven't yet switched to using these things and continue to use either logistic regression or averaged perceptron for all of my classification needs (aside from the fact that I am rather fond of a particular software package for doing these things -- note, though, that it does support PA and maybe soon CW if I decide to spend 10 minutes implementing it!).
[224002040070] |Here's the deal.
[224002040080] |Let's look at SVM versus logreg.
[224002040090] |Whether this is actually true or not, I have this gut feeling that logreg is much less sensitive to hyperparameter selection than are SVMs.
[224002040100] |This is not at all based on any science, and the experience that it's based on it somewhat unfair (comparing megam to libSVM, for instance, which use very different optimization methods, and libSVM doesn't do early stopping while megam does).
[224002040110] |However, I've heard from at least two other people that they have the same internal feelings.
[224002040120] |In other words, here's a caricature of how I believe logreg and SVM behave:
[224002040130] |That is, if you really tune the regularizer (lambda) well, then SVMs will win out.
[224002040140] |But for the majority of the settings, they're either the same or logreg is a bit better.
[224002040150] |As a result, what do I do?
[224002040160] |I use logreg with lambda=1.
[224002040170] |That's it.
[224002040180] |No tuning, no nothing.
[224002040190] |(Note that, as I said before, I haven't ever run experiments to verify this.
[224002040200] |I think it would be a moderately interesting thing to try to see if it really holds up when all else -- eg., the optimization algorithm, early stopping, implementation, choice of regularizer (L1, L2, KL, etc.), and so on -- are held constant... maybe it's not true.
[224002040210] |But if it is, then it's an interesting theoretical question: hinge loss and log loss don't look that different, despite the fact that John seems to not like how log loss diverges: why should this be true?)
[224002040220] |This is also why I use averaged perceptron: there aren't any hyperparameters to select.
[224002040230] |It just runs.
[224002040240] |What I'd really like to see in future "yet another classifier" papers is an analysis of sensitivity to hyperparameter selection.
[224002040250] |You could provide graphs and stuff, but these get hard to read.
[224002040260] |I like numbers.
[224002040270] |I'd like a single number that I can look at.
[224002040280] |Here are two concrete proposals for what such a number could be (note: I'm assuming you're also going to provide performance numbers at the best possible selection of hyperparameters from development data or cross validation...
[224002040290] |I'm talking about something in addition):
[224002040300] |Performance at a default setting of the hyperparameter.
[224002040310] |For instance, SVM-light uses something like average inverse norm of the data vectors as the C parameter.
[224002040320] |Or you could just us 1, like I do for logreg.
[224002040330] |In particular, suppose you're testing your algorithm on 20 data sets from UCI.
[224002040340] |Pick a single regularization parameter (or parameter selection scheme, ala SVM-light) to use for all of them and report results using that value.
[224002040350] |If this is about the same as the "I carefully tuned" setting, I'm happy.
[224002040360] |If it's way worse, I'm not so happy.
[224002040370] |Performance within a range.
[224002040380] |Let's say that if I do careful hyperparameter selection then I get an accuracy of X. How large is the range of hyperparameters for which my accuracy is at least X*0.95?
[224002040390] |I.e., if I'm willing to suffer 5% multiplicative loss, how lazy can I be about hp selection?
[224002040400] |For this, you'll probably need to grid out your performance and then do empirical integration to approximate this.
[224002040410] |Of course, you'll need to choose a bounded range for your hp (usually zero will be a lower bound, but you'll have to pick an upper bound, too -- but this is fine: as a practitioner, if you don't give me an upper bound, I'm going to be somewhat unhappy).
[224002040420] |Neither of these is totally ideal, but I think they'd be a lot better than the current situation of really having no idea!
[224002040430] |Maybe there are other proposals out there that I don't know about, or maybe other readers have good ideas.
[224002040440] |But for me, if you're going to convince me to switch to your algorithm, this is something that I really really want to know.
[224002040450] |(As an aside, Mark, if you're reading this, I can imagine the whole CW thing getting a bit confused if you're using feature hashing: have you tried this?
[224002040460] |Or has someone else?)
[224002060010] |Where did you Apply to Grad School?
[224002060020] |Ellen and I are interested (for obvious reasons) in how people choose what schools to apply to for grad school.
[224002060030] |Note that this is not the question of how you chose where to go.
[224002060040] |This is about what made the list of where you actually applied.
[224002060050] |We'd really appreciate if you'd fill out our 10-15 minute survey and pass it along to your friends (and enemies).
[224002060060] |If you're willing, please go here.
[224002070010] |Some notes on job search
[224002070020] |There are tons of "how to apply for academic jobs" write-ups out there; this is not one of them.
[224002070030] |It's been four years (egads!) since I began my job search and there are lots of things I think I did well and lots of things I wish I had done differently.
[224002070040] |When I entered grad school, I was fairly sure that I eventually wanted a university job.
[224002070050] |During high school, my career goal was to be a high school math teacher.
[224002070060] |Then I went to college and realized that, no, I wanted to teach math to undergraduates.
[224002070070] |Then I was an advanced undergraduate and realized that I wanted to teach grads and do research.
[224002070080] |Teaching was always very important to me, though of course I fell in love with research later.
[224002070090] |It was unfortunate that it took so long for me to actually get involved in research, but my excuse was that I wasn't in CS, where REU-style positions are plentiful and relatively easy to come by (system development, anyone?).
[224002070100] |However, the more time I spend in grad school, including an internship at MSR with Eric Brill (but during which I befriended many in the NLP group at MSR, a group that I still love), I realized that industry labs were a totally great place to go, too.
[224002070110] |I ended up applying to basically everything under the sun, provided they had a non-zero number of faculty in either NLP or ML.
[224002070120] |I talked (mostly off the record) with a few people about post-doc positions (I heard later than simultaneously exploring post-docs and academic positions is not a good idea: hiring committees don't like to "reconsider" people; I don't know how true this is, but I heard it too late myself to make any decisions based on it), applied for some (okay, many) tenure-track positions, some research-track positions (okay, few) and to the big three industry labs.
[224002070130] |I wrote three cover letters, one more tailored to NLP, one more to ML and one more combined, three research statements (ditto) and one teaching statement.
[224002070140] |In retrospect, they were pretty reasonable, I think, though not fantastic.
[224002070150] |I don't think I did enough to make my future research plans not sound like "more of the same."
[224002070160] |I suppose my biggest piece of advice for applying is (to the extent possible) find someone you know and trust at the institution and try to figure out exactly what they're looking for.
[224002070170] |Obviously you can't change who you are and the work you've done, but you definitely can sell it in slightly different ways.
[224002070180] |This is why I essentially had three application packages -- the material was the same, the focus was different.
[224002070190] |But, importantly, they were all true.
[224002070200] |The more this person trusts you, the more of the inside scoop they can give you.
[224002070210] |For instance, we had a robotics/ML position open (which, sadly, we had to close due to budget issues), but in talking to several ML people, they felt that they weren't sufficiently "robotics" enough; I think I was able to dissuade them of this opinion and we ended up getting a lot of excellent applicants before we shut down the slot.
[224002070220] |Related, it's hard to sell yourself across two fields.
[224002070230] |At the time I graduated, I saw myself as basically straddling NLP and ML.
[224002070240] |This can be a hard sell to make.
[224002070250] |I feel in retrospect that you're often better off picking something and really selling that aspect.
[224002070260] |From the other side of the curtain, what often happens is that you need an advocate (or two) in the department to which you're applying.
[224002070270] |If you sell yourself as an X person, you can get faculty in X behind you; if you sell yourself as a Y person, you can get faculty in Y behind you.
[224002070280] |However, if you sell yourself as a mix, the X faculty might prefer a pure X and the Y faculty might prefer a pure Y. Of course, this isn't always true: Maryland is basically looking for a combined NLP/ML person this year to compliment their existing strengths.
[224002070290] |Of course, this doesn't always hold: this is something that you should try to find out from friends at the places to which you're applying.
[224002070300] |For the application process itself, my experience here and what I've heard from most (but not all) universities is that interview decisions (who to call in) get made by a topic-specific hiring committee.
[224002070310] |This means that to get in the door, you have to appeal to the hiring committee, which is typically people in your area, if it's an area-specific call for applications.
[224002070320] |Typically your application will go to an admin, first, who will filter based on your cover letter to put you in the right basket (if there are multiple open slots) or the waste basket (for instance, if you don't have a PhD).
[224002070330] |It then goes to the hiring committee.
[224002070340] |Again, if you have a friend in the department, it's not a bad idea to let them know by email that you've applied after everything has been submitted (including letters) to make sure that you don't end up in the waste bin.
[224002070350] |Once your application gets to the hiring committee, the hope is that they've already heard of you.
[224002070360] |But if they haven't, hopefully they've heard of at least one of your letter writers.
[224002070370] |When we get applications, I typically first sort by whether I've heard of the applicant, then by the number of letter writers they have that I've heard of, then loosely by the reputation of their university.
[224002070380] |And I make my way down the list, not always all the way to the bottom.
[224002070390] |(Okay, I've only done this once, and I think I got about 90% of the way through.)
[224002070400] |In my experience, what we've looked for in applications is (a) a good research statement, including where you're going so as to distinguish yourself from your advisor, (b) a not-bad teaching statement (it's hard to get a job at a research university on a great teaching statement, but it's easy to lose an offer on a bad one... my feeling here is just to be concrete and not to pad it with BS -- if you don't have much to say, don't say much), (c) great letters, and (d) an impressive CV.
[224002070410] |You should expect that the hiring committee will read some of your papers before interviewing you.
[224002070420] |This means that if you have dozens, you should highlight somewhere (probably the research statement) what are they best ones that they should read.
[224002070430] |Otherwise they'll choose essentially randomly, and (depending on your publishing style) this could hurt.
[224002070440] |As always, put your best foot forward and make it easy for the hiring committee to find out what's so great about you.
[224002070450] |Anyway, that's basically it.
[224002070460] |There's lots more at interview stage, but these are my feelings for application stage.
[224002070470] |I'd be interested to hear if my characterization of the hiring process is vastly different than at other universities; plus, if there are other openings that might be relevant to NLP/ML folks, I'm sure people would be very pleased to seem them in the comments section.
[224002070480] |Good luck, all your graduating folks!
[224002080010] |Convex or not, my schizophrenic personality
[224002080020] |Machine learning as a field has been very convex-happy for the past decade or so.
[224002080030] |So much so that when I saw a tutorial on submodular optimization in ML (one of the best tutorials I've seen), they said something along the lines of "submodularity will be for this decade what convexity was for the last decade."
[224002080040] |(Submodularity is cool and I'll post about it more in the future, but it's kind of a discrete analog of convexity.
[224002080050] |There's a NIPS workshop on the topic coming up.)
[224002080060] |This gives a sense of how important convexity has been.
[224002080070] |There's also a bit of an undercurrent of "convexity isn't so great" from other sides of the ML community (roughly from the neural nets folks); see, for instance, Yann LeCun's talk Who's Afraid of Non-convex Loss Functions, a great and entertaining talk.
[224002080080] |There's a part of me that loves convexity.
[224002080090] |Not having to do random restarts, being assured of global convergence, etc., all sounds very nice.
[224002080100] |I use logistic regression/maxent for almost all of my classification needs, have never run a neural network, and have only occasionally used svms (though of course they are convex, too). When I teach ML (as I'm doing now), I make a bit deal about convexity: it makes life easy in many ways.
[224002080110] |That said, almost none of my recent papers reflect this.
[224002080120] |In fact, in the structure compilation paper, we flat out say that non-linearity in the model (which leads to a non-convex loss function) is the major reason why CRFs outperform independent classifiers in structured prediction tasks!
[224002080130] |Moreover, whenever I start doing Bayesian stuff, usually solved with some form of MCMC, I've completely punted on everything convex.
[224002080140] |In a "voting with my feet" world, I could care less about convexity!
[224002080150] |For the most part, if you're using EM or sampling or whatever, you don't care much about it either.
[224002080160] |Somehow we (fairly easily!) tolerate whatever negative effects there are of non-convex optimization.
[224002080170] |I think one reason why such things don't both us, as NLPers, as much as they bother the average machine learning person is that we are willing to invest some energy in intelligent initialization.
[224002080180] |This already puts us in a good part of the optimization landscape, and doing local hillclimbing from there is not such a big deal.
[224002080190] |A classic example is the "Klein and Manning" smart initializer for unsupervised parsing, where a small amount of human knowledge goes a long way above a random initializer.
[224002080200] |Another style of initialization is the IBM alignment model style.
[224002080210] |IBM model 4 is, of course, highly non-convex and ridiculously difficult to optimize (the E step is intractable).
[224002080220] |So they do a smart initialization, using the output of model 3.
[224002080230] |Model 3, too, is highly non-convex (but not quite so much so), so they initialize with model 2.
[224002080240] |And so on, down to model 1, which is actually convex and fairly easy to optimize.
[224002080250] |This sequencing of simple models to complex models also happens in some statistical analysis, where you first fit first order effects and then later fit higher order effects.
[224002080260] |The danger, of course, is that you got to a bad hill to climb, but this overall generally appears to be a bigger win than starting somewhere in the middle of a random swamp.
[224002080270] |(Of course, later, Bob Moore had this cute argument that even though model 1 is convex, we don't actually ever optimize it to the global optimum, so doing clever initialization for model 1 is also a good idea!)
[224002080280] |These two ideas: clever initialization, and sequential initialization, seem like powerful ideas that I would like to see make their way into more complex models.
[224002080290] |For instance, in the original LDA paper, Dave Blei used an initialization where they pick some random documents as seeds for topics.
[224002080300] |As far as I know, no one really does this anymore (does anyone know why: does it really not matter?), but as we keep building more and more complex models, and lose hope that our off the shelf optimizer (or sampler) is going to do anything reasonable, we're probably going to need to get back to this habit, perhaps trying to formalize it in the meantime.
[224002090010] |Getting Started In: Bayesian NLP
[224002090020] |This isn't so much a post in the "GSI" series, but just two links that recently came out.
[224002090030] |Kevin Knight and Philip Resnik both just came out with tutorials for Bayesian NLP.
[224002090040] |They're both excellent, and almost entirely non-redundant.
[224002090050] |I highly recommend reading both.
[224002090060] |And I thank Kevin and Philip from the bottom of my heart, since I'd been toying with the idea of writing such a thing (for a few years!) and they've saved me the effort.
[224002090070] |I'd probably start with Kevin's and then move on to Philip's (which is more technically meaty), but either order is really fine.
[224002090080] |Bayesian Inference with Tears by Kevin
[224002090090] |Gibbs Sampling for the Uninitiated by Philip
[224002090100] |Thanks again to both of them.
[224002090110] |(And if you haven't read Kevin's previous workbook on SMT -- which promises free beer! -- I highly recommend that, too.)
[224002100010] |NLP as a study of representations
[224002100020] |Ellen Riloff and I run an NLP reading group pretty much every semester.
[224002100030] |Last semester we covered "old school NLP."
[224002100040] |We independently came up with lists of what we consider some of the most important ideas (idea = paper) from pre-1990 (most are much earlier) and let students select which to present.
[224002100050] |There was a lot of overlap between Ellen's list and mine (not surprisingly).
[224002100060] |If people are interested, I can provide the whole list (just post a comment and I'll dig it up).
[224002100070] |The whole list of topics is posted as a comment.
[224002100080] |The topics that were actually selected are here.
[224002100090] |I hope the students have found this exercise useful.
[224002100100] |It gets you thinking about language in a way that papers from the 2000s typically do not.
[224002100110] |It brings up a bunch of issues that we no longer think about frequently.
[224002100120] |Like language.
[224002100130] |(Joking.)
[224002100140] |(Sort of.)
[224002100150] |One thing that's really stuck out for me is how much "old school" NLP comes across essentially as a study of representations.
[224002100160] |Perhaps this is a result of the fact that AI -- as a field -- was (and, to some degree, still is) enamored with knowledge representation problems.
[224002100170] |To be more concrete, let's look at a few examples.
[224002100180] |It's already been a while since I read these last (I had meant to write this post during the spring when things were fresh in my head), so please forgive me if I goof a few things up.
[224002100190] |I'll start with one I know well: Mann and Thompson's rhetorical structure theory paper from 1988.
[224002100200] |This is basically "the" RST paper.
[224002100210] |I think that when a many people think of RST, they think of it as a list of ways that sentences can be organized into hierarchies.
[224002100220] |Eg., this sentence provides background for that one, and together they argue in favor of yet a third.
[224002100230] |But this isn't really where RST begins.
[224002100240] |It begins by trying to understand the communicative role of text structure.
[224002100250] |That is, when I write, I am trying to communicate something.
[224002100260] |Everything that I write (if I'm writing "well") is toward that end.
[224002100270] |For instance, in this post, I'm trying to communicate that old school NLP views representation as the heart of the issue.
[224002100280] |This current paragraph is supporting that claim by providing a concrete example, which I am using to try to convince you of my claim.
[224002100290] |As a more detailed example, take the "Evidence" relation from RST.
[224002100300] |M+T have the following characterization of "Evidence."
[224002100310] |Herein, "N" is the nucleus of the relation, "S" is the satellite (think of these as sentences), "R" is the reader and "W" is the writer:
[224002100320] |relation name: Evidenceconstraints on N: R might not believe N to a degree satisfactory to Wconstraints on S: R believes S or will find it credibleconstraints on N+S: R's comprehending S increases R's belief of Nthe effect: R's belief of N is increasedlocus of effect: N
[224002100330] |This is a totally different way from thinking about things than I think we see nowadays.
[224002100340] |I kind of liken it to how I tell students not to program.
[224002100350] |If you're implementing something moderately complex (say, forward/backward algorithm), first write down all the math, then start implementing.
[224002100360] |Don't start implementing first.
[224002100370] |I think nowadays (and sure, I'm guilty!) we see a lot of implementing without the math.
[224002100380] |Or rather, with plenty of math, but without a representational model of what it is that we're studying.
[224002100390] |The central claim of the RST paper is that one can think of texts as being organized into elementary discourse units, and these are connected into a tree structure by relations like the one above.
[224002100400] |(Or at least this is my reading of it.)
[224002100410] |That is, they have laid out a representation of text and claimed that this is how texts get put together.
[224002100420] |As a second example (this will be sorter), take Wendy Lehnert's 1982 paper, "Plot units and narrative summarization."
[224002100430] |Here, the story is about how stories get put together.
[224002100440] |The most interesting thing about the plot units model to me is that it breaks from how one might naturally think about stories.
[224002100450] |That is, I would naively think of a story as a series of events.
[224002100460] |The claim that Lehnert makes is that this is not the right way to think about it.
[224002100470] |Rather, we should think about stories as sequences of affect states. Effectively, an affect state is how a character is feeling at any time.
[224002100480] |(This isn't quite right, but it's close enough.)
[224002100490] |For example, Lehnert presents the following story:
[224002100500] |When John tried to start his care this morning, it wouldn't turn over.
[224002100510] |He asked his neighbor Paul for help.
[224002100520] |Paul did something to the carburetor and got it going.
[224002100530] |John thanked Paul and drove to work.
[224002100540] |The representation put forward for this story is something like: (1) negative-for-John (the car won't start), which leads to (2) motivation-for-John (to get it started, which leads to (3) positive-for-John (it's started), when then links back and resolves (1).
[224002100550] |You can also analyze the story from Paul's perspective, and then add links that go between the two characters showing how things interact.
[224002100560] |The rest of the paper describes how these relations work, and how they can be put together into more complex event sequences (such as "promised request bungled").
[224002100570] |Again, a high level representation of how stories work from the perspective of the characters.
[224002100580] |So now I, W, hope that you, R, have an increased belief in the title of the post.
[224002100590] |Why do I think this is interesting?
[224002100600] |Because at this point, we know a lot about how to deal with structure in language.
[224002100610] |From a machine learning perspective, if you give me a structure and some data (and some features!), I will learn something.
[224002100620] |It can even be unsupervised if it makes you feel better.
[224002100630] |So in a sense, I think we're getting to a point where we can go back, look at some really hard problems, use the deep linguistic insights from two decades (or more) ago, and start taking a crack at things that are really deep.
[224002100640] |Of course, features are a big problem; as a very wise man once said to me: "Language is hard.
[224002100650] |The fact that statistical association mining at the word level made it appear easy for the past decade doesn't alter the basic truth. :-)."
[224002100660] |We've got many of the ingredients to start making progress, but it's not going to be easy!
[224002110010] |K-means vs GMM, sum-product vs max-product
[224002110020] |I finished K-means and Gaussian mixture models in class last week or maybe the week before.
[224002110030] |I've previously discussed the fact that these two are really solving different problems (despite being structurally so similar), but today's post is about something different.
[224002110040] |There are two primary differences between the typical presentation of K-means and the typical presentation of GMMs.
[224002110050] |(I say "typical" because you can modify these algorithms fairly extensively as you see fit.)
[224002110060] |The first difference is that GMMs just have more parameters.
[224002110070] |The parameters of K-means are typically the cluster assignments ("z") and the means ("mu").
[224002110080] |The parameters of a GMM are typically these (z and mu) as well as the class prior probabilities ("pi") and cluster covariances ("Sigma").
[224002110090] |The GMM model is just richer.
[224002110100] |Of course, you can restrict it so all clusters are isotropic and all prior probabilities are even, in which case you've effectively removed this difference (or you can add these things into K-means).
[224002110110] |The second difference is that GMMs operate under the regime of "soft assignments," meaning that points aren't wed to clusters: they only prefer (in a probabilistic sense) some clusters to others.
[224002110120] |This falls out naturally from the EM formalization, where the soft assignments are simply the expectations of the assignments under the current model.
[224002110130] |One can get rid of the second difference by running "hard EM" (also called "Viterbi EM" in NLP land), where the expectations are clamped at their most likely value.
[224002110140] |This leads to something that has much more of a K-means feel.
[224002110150] |This "real EM" versus "hard EM" distinction comes up a lot in NLP, where computing exact expectations is often really difficult.
[224002110160] |(Sometimes you get complex variants, like the "pegging" approaches in the IBM machine translation models, but my understanding from people who run in this circle is that pegging is much ado about nothing.)
[224002110170] |My general feeling has always been "if you don't have much data, do real EM; if you have tons of data, hard EM is probably okay."
[224002110180] |(This is purely from a practical perspective.)
[224002110190] |The idea is that when you have tons and tons of data, you can approximate expectations reasonably well by averaging over many data points.
[224002110200] |(Yes, this is hand-wavy and it's easy to construct examples where it fails.
[224002110210] |But it seems to work many times.)
[224002110220] |Of course, you can get pedantic and say "hard EM sucks: it's maximizing p(x,z) but I really want to maximize p(x)" to which I say: ho hum, who cares, you don't actually care about p(x), you care about some extrinsic evaluation metric which, crossing your fingers, you hope correlates with p(x), but for all I know it correlates better with p(x,z).
[224002110230] |Nevertheless, a particular trusted friend has told me he's always remiss when he can't do full EM and has to do hard EM: he's never seen a case where it doesn't help.
[224002110240] |(Or maybe "rarely" would be more fair.)
[224002110250] |Of course, this comes at a price: for many models, maximization (search) can be done in polynomial time, but computing expectations can be #P-hard (basically because you have to enumerate -- or count -- over every possible assignment).
[224002110260] |Now let's think about approximate inference in graphical models.
[224002110270] |Let's say I have a graphical model with some nodes I want to maximize over (call them "X") and some nodes I want to marginalize out (call them "Z").
[224002110280] |For instance, in GMMs, the X nodes would be the means, covariances and cluster priors; the Z nodes would be the assignments.
[224002110290] |(Note that this is departing slightly from typical notation for EM.)
[224002110300] |Suppose I want to do inference in such a model.
[224002110310] |Here are three things I can do:
[224002110320] |Just run max-product.
[224002110330] |That is, maximize p(X,Z) rather than p(X).
[224002110340] |Just run sum-product.
[224002110350] |That is, compute expectations over X and Z, rather than just over Z.
[224002110360] |Run EM, by alternating something like sum-product on Z and something like max-product onX.
[224002110370] |Of these, only (3) is really doing the "right thing."
[224002110380] |Further, let's get away from the notion of p(X) not correlating with some extrinsic evaluation by just measuring ourselves against exact inference.
[224002110390] |(Yes, this limits us to relatively small models with 10 or 20 binary nodes.)
[224002110400] |What do you think happens?
[224002110410] |Well, first, things vary as a function of the number of X nodes versus Z nodes in the graph.
[224002110420] |When most of the nodes are X (maximization) nodes, then max-product does best and EM basically does the same.
[224002110430] |Whe most of the nodes are Z (marginalization) nodes, then EM does best and sum-product does almost the same.
[224002110440] |But max product also does almost the same.
[224002110450] |This is an effect that we've been seeing regularly, regardless of what the models look like (chains or lattices), what the potentials look like (high temperature or low temperature) and how you initialize these models (eg., in the chain case, EM converges to different places depending on initialization, while sum- and max-product do not).
[224002110460] |Max product is just unbeatable.
[224002110470] |In a sense, from a practical perspective, this is nice.
[224002110480] |It says: if you have a mixed model, just run max product and you'll do just as well as if you had done something more complicated (like EM).
[224002110490] |But it's also frustrating: we should be getting some leverage out of marginalizing over the nodes that we should marginalize over.
[224002110500] |Especially in the high temperature case, where there is lots of uncertainty in the model, max product should start doing worse and worse (note that when we evaluate, we only measure performance on the "X" nodes -- the "Z" nodes are ignored).
[224002110510] |Likening this back to K-means versus GMM, for the case where the models are the same (GMM restricted to not have priors or covariances), the analogy is that as far as the means go, it doesn't matter which one you use.
[224002110520] |Even if there's lots of uncertainty in the data.
[224002110530] |Of course, you may get much better assignments from GMM (or you may not, I don't really know).
[224002110540] |But if all you really care about at the end of the day are the Xs (the means), then our experience with max-product suggests that it just won't matter.
[224002110550] |At all.
[224002110560] |Ever.
[224002110570] |Part of me finds this hard to believe, and note that I haven't actually run experiments with K-means and GMM, but the results in the graphical model cases are sufficiently strong and reproducible that I'm beginning to trust them.
[224002110580] |Shouldn't someone have noticed this before, though?
[224002110590] |For all the effort that's gone into various inference algorithms for graphical models, why haven't we ever noticed that you just can't beat max-product?
[224002110600] |(Yes, I'm aware of some theoretical results, eg., the Wainwright result that sum-product + randomized rounding is a provably good approximation to the MAP assignment, but this result actually goes the other way, and contradicts many of our experimental studies where sum-product + rounding just flat out sucks.
[224002110610] |Maybe there are other results out there that we just haven't been able to dig up.)
[224002120010] |From Kivenen/Warmuth and EG to CW learning and Adaptive Regularization
[224002120020] |This post is a bit of a historical retrospective, because it's only been recently that these things have aligned themselves in my head.
[224002120030] |The all goes back to Jyrki Kivenen and Manfred Warmuth's paper on exponentiated gradient descent that dates back to STOC 1995.
[224002120040] |For those who haven't read this paper, or haven't read it recently, it's a great read (although it tends to repeat itself a lot).
[224002120050] |It's particularly interesting because they derive gradient descent and exponentiated gradient descent (GD and EG) as a consequence of other assumptions.
[224002120060] |In particular, suppose we have an online learning problem, where at each time step we receive an example x, make a linear prediction (w'x) and then suffer a loss.
[224002120070] |The idea is that if we suffer no loss, then we leave w as is; if we do suffer a loss, then we want to balance two goals:
[224002120080] |Change w enough so that we wouldn't make this error again
[224002120090] |Don't change w too much
[224002120100] |The key question is how to define "too much."
[224002120110] |Suppose that we measure changes in w by looking at Euclidean distance between the updated w and the old w.
[224002120120] |If we work through the math for enforcing 1 while minimizing 2, we derive the gradient descent update rule that's been used for optimizing, eg., perceptrons for squared loss for ages.
[224002120130] |The magic is what happens if we use something other than Euclidean distance.
[224002120140] |If, instead, we assume that the ws are all positive, we can use an (unnormalized) KL divergence to measure differences between weight vectors.
[224002120150] |Doing this leads to multiplicative updates, or the exponentiated gradient algorithm.
[224002120160] |(Obvious (maybe?) open question: what happens if you replace the distance with some other divergence, say a Bregman, or alpha or phi-divergence?)
[224002120170] |This line of thinking leads naturally to Crammer et al.'s work on Online Passive Aggressive algorithms, from JMLR 2006.
[224002120180] |Here, the idea remains the same, but instead of simply ensuring that we make a correct classification, ala rule (1) above, we ensure that we make a correct classification with a margin of at least 1.
[224002120190] |They use Euclidean distance to measure the difference in weight vectors, and, for many cases, can get closed-form updates that look GD-like, but not exactly GD.
[224002120200] |(Aside: what happens if you use, eg., KL instead of Euclidean?)
[224002120210] |Two years later, Mark Dredze, Koby Crammer and Fernando Pereira presented Confidence-Weighted Linear Classification.
[224002120220] |The idea here is the same: don't change the weight vectors too much, but achieve good classification.
[224002120230] |The insight here is to represent weight vectors by distributions over weight vectors, and the goal is to change these distributions enough, but not too much.
[224002120240] |Here, we go back to KL, because KL makes more sense for distributions, and make a Gaussian assumption on the weight vector distribution.
[224002120250] |(This has close connections both to PAC-Bayes and, if I'm wearing my Bayesian hat, Kalman filtering when you make a Gaussian assumption on the posterior, even though it's not really Gaussian... it would be interesting to see how these similarities play out.)
[224002120260] |The cool thing here is that you effectively get variable learning rates on different parameters, where confident parameters get moved less.
[224002120270] |(In practice, one really awesome effect is that you tend to only need one pass over your training data to do a good job!)
[224002120280] |If you're interested in the Bayesian connection, you can get a very similar style algorithm if you do EP on a Bayesian classification algorithm (by Stern, Herbrich and Graepel), which is what Microsoft Bing uses for online ads.
[224002120290] |This finally bring us to NIPS this year, where Koby Crammer, Alex Kulesza and Mark Dredze presented work on Adaptive Regularization of Weight Vectors.
[224002120300] |Here, they take Confidence Weighted classification and turn the constraints into pieces of the regularizer (somewhat akin to doing a Lagrangian trick).
[224002120310] |Doing so allows them to derive a representer theorem.
[224002120320] |But again, the intuition is exactly the same: don't change the classifier too much, but enough.
[224002120330] |All in all, this is a very interesting line of work.
[224002120340] |The reason I'm posting about it is because I think seeing the connections makes it easier to sort these different ideas into bins in my head, depending on what your loss is (squared versus hinge), what your classifier looks like (linear versus distribution over linear) and what your notion of "similar classifiers" is (Euclidean or KL).
[224002120350] |(Aside: Tong Zhang has a paper on regularized winnow methods, which fits in here somewhere, but not quite as cleanly.)
[224002130010] |Some random NIPS thoughts...
[224002130020] |I missed the first two days of NIPS due to teaching.
[224002130030] |Which is sad -- I heard there were great things on the first day.
[224002130040] |I did end up seeing a lot that was nice.
[224002130050] |But since I missed stuff, I'll instead post some paper suggests from one of my students, Piyush Rai, who was there.
[224002130060] |You can tell his biases from his selections, but that's life :).
[224002130070] |More of my thoughts after his notes...
[224002130080] |Says Piyush:
[224002130090] |There was an interesting tutorial by Gunnar Martinsson on using randomization to speed-up matrix factorization (SVD, PCA etc) of really really large matrices (by "large", I mean something like 106 x 106).
[224002130100] |People typically use Krylov subspace methods (e.g., the Lanczos algo) but these require multiple passes over the data.
[224002130110] |It turns out that with the randomized approach, you can do it in a single pass or a small number of passes (so it can be useful in a streaming setting).
[224002130120] |The idea is quite simple.
[224002130130] |Let's assume you want the top K evals/evecs of a large matrix A.
[224002130140] |The randomized method draws K *random* vectors from a Gaussian and uses them in some way (details here) to get a "smaller version" of A on which doing SVD can be very cheap.
[224002130150] |Having got the evals/evecs of B, a simple transformation will give you the same for the original matrix A.
[224002130160] |The success of many matrix factorization methods (e.g., the Lanczos) also depends on how quickly the spectrum decays (eigenvalues) and they also suggest ways of dealing with cases where the spectrum doesn't quite decay that rapidly.
[224002130170] |Some papers from the main conference that I found interesting:
[224002130180] |Distribution Matching for Transduction (Alex Smola and 2 other guys): They use maximum mean discrepancy (MMD) to do predictions in a transduction setting (i.e., when you also have the test data at training time).
[224002130190] |The idea is to use the fact that we expect the output functions f(X) and f(X') to be the same or close to each other (X are training and X' are test inputs).
[224002130200] |So instead of using the standard regularized objective used in the inductive setting, they use the distribution discrepancy (measured by say D) of f(X) and f(X') as a regularizer.
[224002130210] |D actually decomposes over pairs of training and test examples so one can use a stochastic approximation of D (D_i for the i-th pair of training and test inputs) and do something like an SGD.
[224002130220] |Semi-supervised Learning using Sparse Eigenfunction Bases (Sinha and Belkin from Ohio): This paper uses the cluster assumption of semi-supervised learning.
[224002130230] |They use unlabeled data to construct a set of basis functions and then use labeled data in the LASSO framework to select a sparse combination of basis functions to learn the final classifier.
[224002130240] |Streaming k-means approximation (Nir Ailon et al.): This paper does an online optimization of the k-means objective function.
[224002130250] |The algo is based on the previously proposed kmeans++ algorithm.
[224002130260] |The Wisdom of Crowds in the Recollection of Order Information.
[224002130270] |It's about aggregating rank information from various individuals to reconstruct the global ordering.
[224002130280] |Dirichlet-Bernoulli Alignment: A Generative Model for Multi-Class Multi-Label Multi-Instance Corpora (by some folks at gatech): The problem setting is interesting here.
[224002130290] |Here the "multi-instance" is a bit of a misnomer.
[224002130300] |It means that each example in turn can consists of several sub-examples (which they call instances).
[224002130310] |E.g., a document consists of several paragraphs, or a webpage consists of text, images, videos.
[224002130320] |Construction of Nonparametric Bayesian Models from Parametric Bayes Equations (Peter Orbanz): If you care about Bayesian nonparametrics. :) It basically builds on the Kolmogorov consistency theorem to formalize and sort of gives a recipe for the construction of nonparametric Bayesian models from their parametric counterparts.
[224002130330] |Seemed to be a good step in the right direction.
[224002130340] |Indian Buffet Processes with Power-law Behavior (YWT and Dilan Gorur): This paper actually does the exact opposite of what I had thought of doing for IBP.
[224002130350] |The IBP (akin to the sense of the Dirichlet process) encourages the "rich-gets-richer" phenomena in the sense that a dish that has been already selected by a lot of customers is highly likely to be selected by future customers as well.
[224002130360] |This leads to the expected number of dishes (and thus the latent-features) to be something like O(alpha* log n).
[224002130370] |This paper tries to be even more aggressive and makes the relationship have a power-law behavior.
[224002130380] |What I wanted to do was a reverse behavior -- maybe more like a "socialist IBP" :) where the customers in IBP are sort of evenly distributed across the dishes.
[224002130390] |The rest of this post are random thoughts that occurred to me at NIPS.
[224002130400] |Maybe some of them will get other people's wheels turning?
[224002130410] |This was originally an email I sent to my students, but I figured I might as well post it for the world.
[224002130420] |But forgive the lack of capitalization :):
[224002130430] |persi diaconis' invited talk about reinforcing random walks... that is, you take a random walk, but every time you cross an edge, you increase the probability that you re-cross that edge (see coppersmith + diaconis, rolles + diaconis).... this relates to a post i had a while ago: nlpers.blogspot.com/2007/04/multinomial-on-graph.html ... i'm thinking that you could set up a reinforcing random walk on a graph to achieve this. the key problem is how to compute things -- basically want you want is to know for two nodes i,j in a graph and some n >= 0, whether there exists a walk from i to j that takes exactly n steps. seems like you could craft a clever data structure to answer this question, then set up a graph multinomial based on this, with reinforcement (the reinforcement basically looks like the additive counts you get from normal multinomials)... if you force n=1 and have a fully connected graph, you should recover a multinomial/dirichlet pair.
[224002130440] |also from persi's talk, persi and some guy sergei (sergey?) have a paper on variable length markov chains that might be interesting to look at, perhaps related to frank wood's sequence memoizer paper from icml last year.
[224002130450] |finally, also from persi's talk, steve mc_something from ohio has a paper on using common gamma distributions in different rows to set dependencies among markov chains... this is related to something i was thinking about a while ago where you want to set up transition matrices with stick-breaking processes, and to have a common, global, set of sticks that you draw from... looks like this steve mc_something guy has already done this (or something like it).
[224002130460] |not sure what made me think of this, but related to a talk we had here a few weeks ago about unit tests in scheme, where they basically randomly sample programs to "hope" to find bugs... what about setting this up as an RL problem where your reward is high if you're able to find a bug with a "simple" program... something like 0 if you don't find a bug, or 1/|P|
if you find a bug with program P. (i think this came up when i was talking to percy -- liang, the other one -- about some semantics stuff he's been looking at.) afaik, no one in PL land has tried ANYTHING remotely like this... it's a little tricky because of the infinite but discrete state space (of programs), but something like an NN-backed Q-learning might do something reasonable :P.
[224002130470] |i also saw a very cool "survey of vision" talk by bill freeman... one of the big problems they talked about was that no one has a good p(image) prior model. the example given was that you usually have de-noising models like p(image)*p(noisy image|image) and you can weight p(image) by ^alpha... as alpha goes to zero, you should just get a copy of your noisy image... as alpha goes to infinity, you should end up getting a good image, maybe not the one you *want*, but an image nonetheless. this doesn't happen.
[224002130480] |one way you can see that this doesn't happen is in the following task. take two images and overlay them. now try to separate the two. you *clearly* need a good prior p(image) to do this, since you've lost half your information.
[224002130490] |i was thinking about what this would look like in language land. one option would be to take two sentences and randomly interleave their words, and try to separate them out. i actually think that we could solve this tasks pretty well. you could probably formulate it as a FST problem, backed by a big n-gram language model. alternatively, you could take two DOCUMENTS and randomly interleave their sentences, and try to separate them out. i think we would fail MISERABLY on this task, since it requires actually knowing what discourse structure looks like. a sentence n-gram model wouldn't work, i don't think. (although maybe it would? who knows.) anyway, i thought it was an interesting thought experiment. i'm trying to think if this is actually a real world problem... it reminds me a bit of a paper a year or so ago where they try to do something similar on IRC logs, where you try to track who is speaking when... you could also do something similar on movie transcripts.
[224002130500] |hierarchical topic models with latent hierarchies drawn from the coalescent, kind of like hdp, but not quite. (yeah yeah i know i'm like a parrot with the coalescent, but it's pretty freaking awesome :P.)
[224002130510] |That's it!
[224002130520] |Hope you all had a great holiday season, and enjoy your New Years (I know I'm going skiing. A lot. So there, Fernando! :)).
[224002140010] |ArXiV and NLP, ML and Computer Science
[224002140020] |Arxiv is something of an underutilized resource in computer science.
[224002140030] |Indeed, many computer scientists seems not to even know it exists, despite it having been around for two decades now!
[224002140040] |On the other hand, it is immensely popular among (some branches of) mathematics and physics.
[224002140050] |This used to strike me as odd: arxiv is a computer service, why haven't computer scientists jumped on it.
[224002140060] |Indeed, I spent a solid day a few months ago putting all my (well almost all my) papers on arxiv.
[224002140070] |One can always point to "culture" for such things, but I suspect there are more rational reasons why it hasn't affected us as much as it has others.
[224002140080] |I ran in to arxiv first when I was in math land.
[224002140090] |The following is a cartoon view of how (some branches of) math research gets published:
[224002140100] |Authors write a paper
[224002140110] |Authors submit paper to a journal
[224002140120] |Authors simultaneously post paper on arxiv
[224002140130] |Journal publishes (or doesn't publish) paper
[224002140140] |We can contrast this with how life goes in CS land:
[224002140150] |Conference announces deadline
[224002140160] |One day before deadline, authors write a paper
[224002140170] |Conference publishes (or rejects) paper
[224002140180] |I think there are a few key differences that matter.
[224002140190] |Going up to the mathematician model, we can ask ourselves, why do they do #3?
[224002140200] |It's a way to get the results out without having to wait for a journal to come back with a go/no-go response.
[224002140210] |Basically in the mathematician model, arxiv is used for advertising while a journal is used for a stamp of approval (or correctness).
[224002140220] |So then why don't we do arxiv too?
[224002140230] |I think there are two reasons.
[224002140240] |First, we think that conference turn around is good enough -- we don't need anything faster.
[224002140250] |Second, it completely screws up our notions of blind review.
[224002140260] |If everyone simultaneously posted a paper on arxiv when submitting to a conference, we could no longer claim, at all, to be blind.
[224002140270] |(Please, I beg of you, do not start commenting about blind review versus non-blind review -- I hate this topic of conversation and it never goes anywhere!)
[224002140280] |Basically, we rely on our conferences to do both advertising and stamp of approval.
[224002140290] |Of course, the speed of conferences is mitigated by the fact that you sometimes have to go through two or three before your paper gets in, which can make it as slow, or slower than, journals.
[224002140300] |In a sense, I think that largely because of the blind thing, and partially because conferences tend to be faster than journals, the classic usage of arxiv is not really going to happen in CS.
[224002140310] |(There's one other potential use for arxiv, which I'll refer to as the tech-report effect.
[224002140320] |I've many times seen short papers posted on people's web pages either as tech-reports or as unpublished documents.
[224002140330] |I don't mean tutorial like things, like I have, but rather real semi-research papers.
[224002140340] |These are papers that contain a nugget of an idea, but for which the authors seem unwilling to go all the way to "make it work."
[224002140350] |One could imagine posting such things on arxiv.
[224002140360] |Unfortunately, I really dislike such papers.
[224002140370] |It's very much a "flag planting" move in my opinion, and it makes life difficult for people who follow.
[224002140380] |That is, if I have an idea that's in someone elses semi-research paper, do I need to cite them?
[224002140390] |Ideas are a dime a dozen: making it work is often the hard part.
[224002140400] |I don't think you should get to flag plant without going through the effort of making it work.
[224002140410] |But that's just me.)
[224002140420] |However, there is one prospect that arxiv could serve that I think would be quite valuable: literally, as an archive.
[224002140430] |Right now, ACL has the ACL anthology.
[224002140440] |UAI has its own repository.
[224002140450] |ICML has a rather sad state of affairs where, from what I can tell, papers from ICML #### are just on the ICML #### web page and if that happens to go down, oh well.
[224002140460] |All of these things could equally well be hosted on arxiv, which has strong government support to be sustained, is open access, blah blah blah.
[224002140470] |This brings me to a question for you all: how would you feel if all (or nearly all) ICML papers were to be published on arxiv?
[224002140480] |That is, if your paper is accepted, instead of uploading a camera-ready PDF to the ICML conference manager website, you instead uploaded to arxiv and then sent your arxiv DOI link to the ICML folks?
[224002150010] |Interviewing Follies
[224002150020] |Continuing on the theme of applying for jobs, I thought I'd share some interviewing follies that have happened to me, that I've observed others do, and that I've heard about.
[224002150030] |There is a moral to this story; if you want to skip the stories and get to the moral, scroll to past the bullet points.
[224002150040] |Missing your plane.
[224002150050] |I had an interview in a place that was about a 1-2 hour flight away.
[224002150060] |I flew out first thing in the morning and back last thing at night.
[224002150070] |Except I didn't fly out first thing in the morning: I missed my flight.
[224002150080] |Why?
[224002150090] |Because I cut flights close (someone once said "if you've never missed a flight, you're spending too much time in the airport") and the particular flight I was on left not out of a normal gate, but out of one of those that you have to take a shuttle bus to.
[224002150100] |I didn't know that, didn't factor in the extra 5 minutes, and missed the flight.
[224002150110] |I called the places I was interviewing at, re-arranged meetings and the day proceeded with a small hiccup.
[224002150120] |I ended up getting an offer from this place.
[224002150130] |Missing a meeting.
[224002150140] |I was interviewing at a different place, going through my daily meetings, got really tired and misread my schedule.
[224002150150] |I though I was done when in fact I had one meeting to go.
[224002150160] |I caught a cab to the airport, flew back home, and noticed a few frantic emails trying to figure out where I was (this is before I had an email-capable cell phone).
[224002150170] |(As an aside, someone once told me that they would intentionally skip meetings on interview days with people outside their area, figuring that neither the candidate nor the interviewee really wanted such a meeting.
[224002150180] |They would hang out in the restroom or something, and blame a previous meeting running long on the miss.
[224002150190] |This was not the case in my setting.)
[224002150200] |I did not end up getting an offer from this place.
[224002150210] |Someone interviewing here a long time ago was scheduled to give their talk using transparencies.
[224002150220] |Two minutes before the talk they realized that they had left them in the hotel room.
[224002150230] |The already-assembled audience was asked to stay put, the speaker was quickly driven to the hotel and back, and proceeded to give one of the best interview talks on record here.
[224002150240] |This person ended up getting a job offer.
[224002150250] |Someone interviewing somewhere I know left their laptop in their hotel, just like number 3.
[224002150260] |But instead of having their host drive them back to the hotel, they borrowed someone's car to drive back to the hotel.
[224002150270] |They crashed the car, but managed to get their laptop, and gave a great talk.
[224002150280] |This person ended up getting a job offer.
[224002150290] |I flew in late to an interview, getting to my hotel around midnight.
[224002150300] |I woke up the next morning at seven for an 8:30 breakfast meeting.
[224002150310] |I unpacked my suit, tie, belt, socks and shoes.
[224002150320] |And realized I had forgotten to pack a dress shirt.
[224002150330] |All I had was the shirt I had worn on the plane: a red graphic tee shirt.
[224002150340] |My mind quickly raced to figure out if there was a place I could buy a dress shirt in the middle of nowhere at 7am.
[224002150350] |I quickly realized that it was impossible, wore my tee shirt under my suit jacket, and went through the day as if that was how I had planned it.
[224002150360] |I ended up getting a job offer.
[224002150370] |The moral of this story is that bad things happen during interviews.
[224002150380] |I can't compare any of my stories to the crash-the-car story, but we've all been there, done stupid things, and gotten through it unscathed.
[224002150390] |I think the trick is to pretend like it was intentional, or at least not get flustered.
[224002150400] |Yes I missed my flight, yes I forgot my shirt, yes I crashed your car.
[224002150410] |But it doesn't affect the rest of my day.
[224002150420] |You have to be able to relax and forgive yourself minor mistakes: the interviewers really are looking at the bigger picture.
[224002150430] |That said, there are definitely things you can do to botch an interview.
[224002150440] |They have to do with things like giving a bad talk (my experience is that a huge weight is placed on the quality of your job talk) and not having a clear vision for where you're going in research life.
[224002150450] |Don't be afraid to disagree with people you're talking to: we usually aren't trying to hire yes-men or yes-women.
[224002150460] |Once you get a job, especially a faculty job, you are the one who is going to make things happen for you.
[224002150470] |You have a world view and that's part of what we're hiring.
[224002150480] |Let us see it.
[224002150490] |We might not always agree, but if you have reasons for your view that you can articulate, we'll listen.
[224002150500] |But don't focus on the little things, and don't get flustered.
[224002160010] |A machine learner's apology
[224002160020] |Andrew Gelman recently announced an upcoming talk by John Lafferty.
[224002160030] |This reminded me of a post I've been meaning to write for ages (years, really) but haven't gotten around to.
[224002160040] |Well, now I'm getting around to it.
[224002160050] |A colleague from Utah (not in ML) went on a trip and spent some time talking to a computational statistician, who will remain anonymous.
[224002160060] |But let's call this person Alice.
[224002160070] |The two were talking about various topics and at one point machine learning came up.
[224002160080] |Alice commented:
[224002160090] |"Machine learning is just non-rigorous computational statistics."
[224002160100] |Or something to that effect.
[224002160110] |A first reaction is to get defensive: no it's not!
[224002160120] |But Alice has a point.
[224002160130] |Some subset of machine learning, in particular the side more Bayesian, tends to overlap quite a bit with compstats, so much so that in some cases they're probably not really that differentiable.
[224002160140] |(That is to say, there's a subset of ML that's very very similar to a subset of compstats... you could probably fairly easily find some antipoles that are amazingly different.)
[224002160150] |And there's a clear intended interpretation to the comment: it's not that we're not rigorous, it's that we're not rigorous in the way that computational statisticians are.
[224002160160] |To that end, let me offer a glib retort:
[224002160170] |Computational statistics is just machine learning where you don't care about the computation.
[224002160180] |In much the same way that I think Alice's claim is true, I think this claim is also true.
[224002160190] |The part of machine learning that's really strong on the CS side, cares a lot about computation: how long, how much space, how many samples, etc., will it take to learn something.
[224002160200] |This is something that I've rarely seen in compstats, where the big questions really have to do with things like: is this distribution well defined, can I sample from it, etc., now let me run Metropolis-Hastings.
[224002160210] |(Okay, I'm still being glib.)
[224002160220] |I saw a discussion on a theory blog recently that STOC/FOCS is about "THEORY of algorithms" while SODA is about "theory of ALGORITHMS" or something like that.
[224002160230] |(Given the capitalization, perhaps it was Bill Gasarch :)?)
[224002160240] |Likewise, I think it's fair to say that classic ML is "MACHINE learning" or "COMPUTATIONAL statistics" and classic compstats is "machine LEARNING" or "computational STATISTICS."
[224002160250] |We're really working on very similar problems, but the things that we value tend to be different.
[224002160260] |Due to that, I've always found it odd that there's not more interaction between compstats and ML.
[224002160270] |Sure, there's some... but not very much.
[224002160280] |Maybe it's cultural, maybe it's institutional (conferences versus journals), maybe we really know everything we need to know about the other side and talking wouldn't really get us anywhere.
[224002160290] |But if it's just a sense of "I don't like you because you're treading on my field," then that's not productive (either direction), even if it is an initial gut reaction.
[224002170010] |Coordinate ascent and inverted indices...
[224002170020] |Due to a small off-the-radar project I'm working on right now, I've been building my own inverted indices.
[224002170030] |(Yes, I'm vaguely aware of discussions in DB/Web search land about whether you should store your inverted indices in a database or whether you should handroll your own.
[224002170040] |This is tangential to the point of this post.)
[224002170050] |For those of you who don't remember your IR 101, here's the deal with inverted indices.
[224002170060] |We're a search engine and want to be able to quickly find pages that contain query terms.
[224002170070] |One way of storing our set of documents (eg., the web) is to store a list of documents, each of which is a list of words appearing in that document.
[224002170080] |If there are N documents of length L, then answering a query is O(N*L) since we have to look over each document to see if it contains the word we care about.
[224002170090] |The alternative is to store an inverted index, where we have a list of words and for each word, we store the list of documents it appears in. Answering a query here is something like O(1) if we hash them, O(log |V|) if we do binary search (V = vocabulary), etc.
[224002170100] |Why it's called an inverted index is beyond me: it's really just like the index you find at the back of a textbook.
[224002170110] |And the computation difference is like trying to find mentions of "Germany" in a textbook by reading every page and looking for "Germany" versus going to the index in the back of the book.
[224002170120] |Now, let's say we have an inverted index for, say, the web.
[224002170130] |It's pretty big (and in all honesty, probably distributed across multiple storage devices or multiple databases or whatever).
[224002170140] |But regardless, a linear scan of the index would give you something like: here's word 1 and here are the documents it appears in; here's word 2 and its doucments; here's word v and its documents.
[224002170150] |Suppose that, outside of the index, we have a classification task over the documents on the web.
[224002170160] |That is, for any document, we can (efficiently -- say O(1) or O(log N)) get the "label" of this document.
[224002170170] |It's either +1, -1 or ? (? == unknown, or unlabeled).
[224002170180] |My argument is that this is a very plausible set up for a very large scale problem.
[224002170190] |Now, if we're trying to solve this problem, doing a "common" optimization like stochastic (sub)gradient descent is just not going to work, because it would require us to iterate over documents rather than iterating over words (where I'm assuming words == features, for now...).
[224002170200] |That would be ridiculously expensive.
[224002170210] |The alternative is to do some sort of coordinate ascent algorithm.
[224002170220] |These actually used to be quite popular in maxent land, and, in particular, Joshua Goodman had a coordinate ascent algorithm for maxent models that apparently worked quite well.
[224002170230] |(In searching for that paper, I just came across a 2009 paper on roughly the same topic that I hadn't seen before.)
[224002170240] |Some other algorithms have a coordinate ascent feel, for instance the LASSO (and relatives, including the Dantzig selector+LASSO = DASSO), but they wouldn't really scale well in this problem because it would require a single pass over the entire index to make one update.
[224002170250] |Other approaches, such as boosting, etc., would fare very poorly in this setting.
[224002170260] |This observation first led me to wonder if we can do something LASSO or boosting like in this setting.
[224002170270] |But then that made me wonder if this is a special case, or if there are other cases in the "real world" where you data is naturally laid out as features * data points rather than data points * features.
[224002170280] |Sadly, I cannot think of any.
[224002170290] |But perhaps that's not because there aren't any.
[224002170300] |(Note that I also didn't really talk about how to do semi-supervised learning in this setting... this is also quite unclear to me right now!)
[224002180010] |Blog heads east
[224002180020] |I started this blog ages ago while still in grad school in California at USC/ISI.
[224002180030] |It came with me three point five years ago when I started as an Assistant Professor at the University of Utah.
[224002180040] |Starting some time this coming summer, I will take it even further east: to CS at the University of Maryland where I have just accepted a faculty offer.
[224002180050] |These past (almost) four years at Utah have been fantastic for me, which has made this decision to move very difficult.
[224002180060] |I feel very lucky to have been able to come here.
[224002180070] |I've had enormous freedom to work in directions that interest me, great teaching opportunities (which have taught me a lot), and great colleagues.
[224002180080] |Although I know that moving doesn't mean forgetting one's friends, it does mean that I won't run in to them in hallways, or grab lunch, or afternoon coffee or whatnot anymore.
[224002180090] |Ellen, Suresh, Erik, John, Tom, Tolga, Ross, and everyone else here have made my time wonderful.
[224002180100] |I will miss all of them.
[224002180110] |The University here has been incredibly supportive in every way, and I've thoroughly enjoyed my time here.
[224002180120] |Plus, having world-class skiing a half hour drive from my house isn't too shabby either.
[224002180130] |(Though my # of days skiing per year declined geometrically since I started: 30-something the first year, then 18, then 10... so far only a handful this year.
[224002180140] |Sigh.)
[224002180150] |Looking back, my time here has been great and I'm glad I had the opportunity to come.
[224002180160] |That said, I'm of course looking forward to moving to Maryland also, otherwise I would not have done it!
[224002180170] |There are a number of great people there in natural language processing, machine learning and related fields.
[224002180180] |I'd like to think that UMD should be and is one of the go-to places for these topics, and am excited to be a part of it.
[224002180190] |Between Bonnie, Philip, Lise, Mary, Doug, Judith, Jimmy and the other folks in CLIP and related groups, I think it will be a fantastic place for me to be, and a fantastic place for all those PhD-hungry students out there to go!
[224002180200] |Plus, having all the great folks at JHU's CLSP a 45 minute drive a way will be quite convenient.
[224002180210] |A part of me is sad to be leaving, but another part of me is excited at new opportunities.
[224002180220] |The move will take place some time over the summer (carefully avoiding conferences), so if I blog less then, you'll know why.
[224002180230] |Thanks again to everyone who has made my life here fantastic.
[224002190010] |Senses versus metaphors
[224002190020] |I come from a tradition of not really believing in word senses.
[224002190030] |I fondly remember a talk Ed Hovy gave when I was a grad student.
[224002190040] |He listed the following example sentences and asked each audience member to group them in to senses:
[224002190050] |John drove his car to work.
[224002190060] |We dove to the university every morning.
[224002190070] |She drove me to school every day.
[224002190080] |He drives me crazy.
[224002190090] |She is driven by her passion.
[224002190100] |He drove the enemy back.
[224002190110] |She finally drove him to change jobs.
[224002190120] |He drove a nail into the wall.
[224002190130] |Bill drove the ball far out into the field.
[224002190140] |My students are driving away at ACL papers.
[224002190150] |What are you driving at?
[224002190160] |My new truck drives well.
[224002190170] |He drives a taxi in New York.
[224002190180] |The car drove around the corner.
[224002190190] |The farmer drove the cows into the barn.
[224002190200] |We drive the turnpike to work.
[224002190210] |Sally drove a golf ball clear across the green.
[224002190220] |Mary drove the baseball with the bat.
[224002190230] |We drove a tunnel through the hill.
[224002190240] |The steam drives the engine in the train.
[224002190250] |We drove the forest looking for game.
[224002190260] |Joe drove the game from their hiding holes.
[224002190270] |Most people in the audience came up with 5 or 6 senses.
[224002190280] |One came up with two (basically the physical versus mental distinction).
[224002190290] |According to wordnet, each of these is a separate sense.
[224002190300] |(And this is only for the verb form!)
[224002190310] |A common "mistake" people made was to group 1, 2, 3, 13 and 14, all of which seem to have to do with driving cars.
[224002190320] |The key distinction is that 1 expresses the operation of the vehicle, 2 expresses being transported, 3 expresses being caused to move and 13 expresses driving for a job.
[224002190330] |You can read the full WordNet descriptions if you don't believe me.
[224002190340] |Now, the point of this isn't to try to argue that WordNet is wacky in any way.
[224002190350] |The people who put it together really know what they're talking about.
[224002190360] |After all, these senses are all really different, in the sense there really is a deep interprative difference between 1, 2, 3 and 13.
[224002190370] |It's just sufficiently subtle that unless it's pointed out to you, it's not obvious.
[224002190380] |There's been a lot of work recently from Ed and others on "consolidating" senses in the OntoNotes project: in fact, they have exactly the same example (how convenient) where they've grouped the verb drive in to seven senses, rather than 22.
[224002190390] |These are:
[224002190400] |operating or traveling via a vehicle (WN 1, 2, 3, 12, 13, 14, 16)
[224002190410] |force to a position or stance (WN 4, 6, 7, 8, 15, 22)
[224002190420] |exert energy on behalf of something (WN 5, 10)
[224002190430] |cause object to move rapidly by striking it (WN 9, 17, 18)
[224002190440] |a directed course of conversation (WN 11)
[224002190450] |excavate horizontally, as in mining (WN 19)
[224002190460] |cause to function or operate (WN 20)
[224002190470] |Now, again, I'm not here to argue that these are better or worse or anything in comparison to WordNet.
[224002190480] |The point is that there are (at least) two ways of explaining the wide senses of a word like "drive."
[224002190490] |One is through senses and this is the typical approach, at least in NLP land.
[224002190500] |The other is metaphor (and yes, that is a different Mark Johnson).
[224002190510] |I'm not going to go so far as to claim that everything is a metaphor, but I do think it provides an alternative perspective on this issue.
[224002190520] |And IMO, alternative perspectives, if plausible, are always worth looking at.
[224002190530] |Let's take a really simple "off the top of my head" example based on "drive."
[224002190540] |Let's unrepentantly claim that there is exactly one sense of drive.
[224002190550] |Which one?
[224002190560] |It seems like the most reasonable is probably OntoNotes' sense 2; Merriam-Webster claims that drive derives from Old-High-German "triban" which, from what I can tell in about a five minute search, has more to do with driving cattle than anything else.
[224002190570] |(But even if I'm wrong, this is just a silly example.)
[224002190580] |Obviously we don't drive cars like we drive cattle.
[224002190590] |For one, we're actually inside the cars.
[224002190600] |But the whole point of driving cattle is to get them to go somewhere.
[224002190610] |If we think of cars as metaphorical cattle, then by operating them, we are "driving" them (in the drive-car sense).
[224002190620] |These are mostly literal.
[224002190630] |However, for "drive a nail", we need to think of the nail as like a cow that we're trying to get into a pen (the wall).
[224002190640] |This is, I think, the most clear metaphorical usage.
[224002190650] |"He is driving away at his thesis" really means that he's trying to get his thesis to go somewhere (where == to completion).
[224002190660] |Driving balls is like driving cattle, except you have to work harder to do it because they aren't self-propelled.
[224002190670] |This is somewhat like driving nails.
[224002190680] |"What are you driving at" is analogous to driving-at-thesis to me.
[224002190690] |"Drive a tunnel through the mountain" is less clear to me.
[224002190700] |But it's also not a sense of this word that I think I have ever used or would ever use.
[224002190710] |So I can't quite figure it out.
[224002190720] |"Steam drives an engine" is sort of a double metaphor.
[224002190730] |Engine is standing in for cow and steam is standing in for cowboy.
[224002190740] |But otherwise it's basically the same as driving cattle.
[224002190750] |Maybe this isn't the greatest example, but hopefully at least it's a bit thought-worthy.
[224002190760] |(And yes, I know I'm departing from Lakoff... in a Lakoff style, there's always a concrete thing and a non-concrete thing in the Lakoff setup from what I understand.)
[224002190770] |This reminds me of the annoying thing my comrades and I used to do as children.
[224002190780] |"I come from a tradition..."
[224002190790] |Yields "You literally come from a tradition?"
[224002190800] |(No, I was educated in such a tradition.... although even that you could ask whether I was really inside a tradition.)
[224002190810] |"A talk Ed Hovy gave..."
[224002190820] |Yields "Ed literally gave a talk?"
[224002190830] |(No, he spoke to an audience.)
[224002190840] |"I drove the golf ball across the field" Yields "You got in the golf ball and drove it across the field?"
[224002190850] |Sigh.
[224002190860] |Kids are annoying.
[224002190870] |Why should I care which analysis I use (senses or metaphor)?
[224002190880] |I'm not sure.
[224002190890] |It's very rare that I actually feel like I'm being seriously hurt by the word sense issue, and it seems that if you want to use sense to do a real task like translation, you have to depart from human-constructed sense inventories anyway.
[224002190900] |But I can imagine a system roughly like the following.
[224002190910] |First, find the verb and it's frame and true literal meaning (maybe it actually does have more than one).
[224002190920] |This verb frame will impose some restrictions on its arguments (for instance, drive might say that both the agent and theme have to be animate).
[224002190930] |If you encounter something where this is not true (eg., a "car" as a theme or "passion" as an agent), you know that this must be a metaphorical usage.
[224002190940] |At this point, you have to deduce what it must mean.
[224002190950] |That is, if we have some semantics associated with the literal interpretation, we have to figure out how to munge it to work in the metaphorical case.
[224002190960] |For instance, for drive, we might say that the semantics are roughly "E = theme moves &E' = theme executes E &agent causes E'" If the patient cannot actually execute things (it's a nail), then we have to figure that something else (eg., in this case, the agent) did the actual executing.
[224002190970] |Etc.
[224002190980] |So it seems like the options are: come up with semantics and frames for every sense (this is what's done, eg., in VerbNet).
[224002190990] |Or, have a single (or small number) of semantics and frames and have some generic rules (hopefully generic!) for how to derive metaphorical uses from them.
[224002200010] |Google 5gram corpus has unreasonable 5grams
[224002200020] |In the context of something completely unrelated, I was looking for a fairly general pattern in the Google 1TB corpus.
[224002200030] |In particular, I was looking for verbs that are sort of transitive.
[224002200040] |I did a quick grep for 5grams of the form "the SOMETHING BLAHed the SOMETHING."
[224002200050] |Or, more specifically:
[224002200060] |I then took these, lower cased them, and then merged the counts.
[224002200070] |Here are the top 25, sorted and with counts:
[224002200080] |What the heck?!
[224002200090] |First of all, the first one is shocking, but maybe you could convince me.
[224002200100] |How about numbers 4 and 5?
[224002200110] |"The trolls ambushed the dwarfs" (and vice versa)?
[224002200120] |These things are the fourth and fifth most common five grams matching my pattern on the web?
[224002200130] |"The poet wicked the woman"?
[224002200140] |What does "wicked" even mean?
[224002200150] |And yet these all beat out "The bill passed the house" and "The court instructed the jury".
[224002200160] |But then #23: "The prince compiled the Mishna"??? (#30 is also funny: "the matrix reloaded the matrix" is an amusing segmentation issue.)
[224002200170] |If we do a vanilla google search for the counts of some of these, we get:
[224002200180] |This just flabbergasts me.
[224002200190] |I'm told that lots of people have expressed worries over the Google 1TB corpus, but have never actually heard anything myself...
[224002200200] |And never seen anything myself.
[224002200210] |Does anyone have an explanation for these effects?
[224002200220] |How can I expect to get anything done with such ridiculous data!
[224002210010] |Classification weirdness, regression simplicity
[224002210020] |In the context of some work on multitask learning, we came to realize that classification is kind of weird.
[224002210030] |Or at least linear classification.
[224002210040] |It's not that it's weird in a way that we didn't already know: it's just sort of a law of unexpected consequences.
[224002210050] |If we're doing linear (binary) classification, we all know that changing the magnitude of the weight vector doesn't change the predictions.
[224002210060] |A standard exercise in a machine learning class might be to show that if your data is linearly separable, then for some models (for instance, unregularized models), the best solution is usually an infinite norm weight vector that's pointing in the right direction.
[224002210070] |This is definitely not true of (linear) regression. Taking a good (or even perfect) linear regressor and blowing up the weights by some constant will kill your performance.
[224002210080] |By adding a regularizer, what you're basically doing is just saying how big you want that norm to be.
[224002210090] |Of course, by regression I simply mean minimizing something like squared error and by classification I mean something like 0/1 loss or hinge loss or logistic loss or whatever.
[224002210100] |I think this is stuff that we all know.
[224002210110] |Where this can bite you in unexpected ways is the following.
[224002210120] |In lots of problems, like domain adaptation and multitask learning, you end up making assumptions roughly of the form "my weight vector for domain A should look like my weight vector for domain B" where "look like" is really the place where you get to be creative and define things how you feel best.
[224002210130] |This is all well and good in the regression setting.
[224002210140] |A magnitude 5 weight in regression means a magnitude 5 weight in regression.
[224002210150] |But not so in classification.
[224002210160] |Since you can arbitrarily scale your weight vectors and still get the same decision boundaries, a magnitude 5 weight kind of means nothing.
[224002210170] |Or at least it means something that has to do more with the difficulty of the problem and how you chose to set your regularization parameter, rather than something to do with the task itself.
[224002210180] |Perhaps we should be looking for definitions of "look like" that are insensitive to things like magnitude.
[224002210190] |Sure you can always normalize all your weight vectors to unit norm before you co-regularize them, but that loses information as well.
[224002210200] |Perhaps this is a partial explanation of some negative transfer.
[224002210210] |One thing that you see, when looking at the literature in DA and MTL, is that all the tasks are typically of about the same difficulty.
[224002210220] |My expectation is that if you have two tasks that are highly related, but one is way harder than the other, is going to lead to negative transfer.
[224002210230] |Why?
[224002210240] |Because the easy task will get low norm weights, and the hard task will get high norm weights.
[224002210250] |The high norm weights will pull the low norm weights toward them too much, leading to worse performance on the "easy" task.
[224002210260] |In a sense, we actually want the opposite to happen: if you have a really hard task, it shouldn't screw up everyone else that's easy!
[224002210270] |(Yes, I know that being Bayesian might help here since you'd get a lot of uncertainty around those high norm weight vectors!)
[224002220010] |When Maximum Likelihood Doesn't Make Sense
[224002220020] |One of the most fun AHA moments in ML or stats is when you see that for multinomial distributions, your naive idea of relative frequency corresponds (through a bunch of calculus) to maximum likelihood estimation.
[224002220030] |Ditto means of Gaussians, though that's never quite as amazing because Gaussians seems sort of odd beasts anyway.
[224002220040] |It's nice when our intuitions match what stats tells us.
[224002220050] |I'll often watch a DVD, and then leave it in the DVD player until I want to watch something else.
[224002220060] |Usually it will return to the menu screen and the timer display will go through 0:00, 0:01, 0:02, ..., until it hits the length of the title screen loop, and then go back to 0:00.
[224002220070] |What I always wonder when I glance up at it is "what is the actual length of the title screen loop?"
[224002220080] |That is, what is the highest value it'll count up to.
[224002220090] |Being a good statistician, I set up a model.
[224002220100] |First I play the frequentist game.
[224002220110] |Since I glance up and pretty arbitrary times, my observations can be modeled as a uniform random variable from 0 to some upper bound U, which is the quantity I want to infer.
[224002220120] |Supposing that I only make one observation, say 0:27, I want a maximum likelihood estimate of U. It's easy to see that U=27 is the maximum likelihood solution.
[224002220130] |Anything less than 27 will render my observation impossible.
[224002220140] |For any U>=27, my observation has probability 1/(U+1), so the ML solution is precisely U=27.
[224002220150] |Note that even if I observe five observations a <= b <= c <= d <= e, the ML solution is still U=e. Huh. The math works. The model is as correct as I could really imagine it. But this answer doesn't really seem reasonable to me. Somehow I expect a number greater than my observation to come about.
[224002220160] |Perhaps I need a prior.
[224002220170] |That'll solve it.
[224002220180] |I'll add a prior p(U) and then do maximum a posteriori.
[224002220190] |Well, it's easy to see that if p(U) is unimodal, and its mode is less than my (highest) observation, then the MAP solution will still be the (highest) observation. If it's mode is more than my (highest) observation, then the MAP solution will be the mode of p(U).
[224002220200] |If I think about this a bit, it's hard for me to justify a multi-modal p(U), and also hard for me to be happy with a system in which my prior essentially completely determines my solution.
[224002220210] |Or I could be fully Bayesian and look at a posterior distribution p(U | data).
[224002220220] |This will just be a left-truncated version of my prior, again, not very satisfying.
[224002220230] |(Note that the "Maximum Likelihood Sets" idea, which I also like quite a bit, doesn't fix this problem either.)
[224002220240] |It also really bugs me that only my largest observation really affects the answer.
[224002220250] |If I get one hundred observations, most of them centered around 0:10, and then one at 0:30, then I'd guess that 0:30 or 0:31 or 0:32 is probably the right answer.
[224002220260] |But if I observe a bunch of stuff spread roughly uniformly between 0:00 and 0:30, then I'd be more willing to believe something larger.
[224002220270] |I don't really have a solution to this dilemma.
[224002220280] |Perhaps you could argue that my difficulties arise from the use of a Uniform distribution -- namely, that were I to use another distribution, these problems wouldn't really arise.
[224002220290] |I don't think this is quite true, as described below:
[224002220300] |We actually run in to this problem in NLP fairly often.
[224002220310] |I observe a billion documents and see, in this 1b documents, 100k unique words, many of which are singletons.
[224002220320] |But I'm not willing to believe that if I see that document 1,000,000,001, that there won't be a new unseen word.
[224002220330] |Sure it's less likely that I see a new unique word than in document 1001, where it is almost guaranteed, but there's still a relatively large probability that some new word will show up in that next document.
[224002220340] |The whole point is that somehow we expect random samples to be representatives of the underlying distribution.
[224002220350] |We don't expect them to define the underlying distribution.
[224002220360] |I don't actually expect to ever see the corner cases, unless I've seen everything else.
[224002220370] |UPDATE: The Bayesian approach actually does do something reasonable.
[224002220380] |Here is some Matlab code for computing posteriors with a uniform prior, and results with an upper bound of 100 and various observations are plotted below:
[224002240010] |((A => B) and (not B)) => (not A)
[224002240020] |I remember back in middle school or high school -- added uncertainty so as not to date myself too much -- I first learned of the existence of file compression software.
[224002240030] |We used pkzip, mostly.
[224002240040] |I remember one of the first things I did was: compress myfile.exe to myfile.exe.zip.
[224002240050] |Then compress that to myfile.exe.zip.zip.
[224002240060] |Then myfile.exe.zip.zip.zip.
[224002240070] |I cannot tell you how disappointed I was when, at one stage, the file size went up after "compression."
[224002240080] |We read papers all the time that demonstrate something roughly of the form "if A then B." The happens most obviously the closer you get to theory (when such statements can be made precise), but happens also in non-theoretical work.
[224002240090] |The point of this post is: if you believe A => B, then you have to ask yourself: which do I believe more?
[224002240100] |A, or not B?
[224002240110] |The simplest case is the compression case.
[224002240120] |Let's say a weak compressor is one that always reduces a (non-empty) file's size by one bit.
[224002240130] |A strong compressor is one that cuts the file down to one bit.
[224002240140] |I can easily prove to you that if you give me a weak compressor, I can turn it into a strong compressor by running it N-1 times on files of size N. Trivial, right?
[224002240150] |But what do you conclude from this?
[224002240160] |You're certainly not happy, I don't think For what I've proved really is that weak compressors don't exist, not that strong compressors do.
[224002240170] |That is, you believe so strongly that a strong compressor is impossible, that you must conclude from (weak) => (strong) that (weak) cannot possibly exist.
[224002240180] |Let's take the most obvious example from machine learning: boosting.
[224002240190] |The basic result of boosting is that I can turn a "weak classifier" into a "strong classifier."
[224002240200] |A strong classifier is one that (almost always) does a really good job at classification.
[224002240210] |A weak classifier is one that (always always) does a not completely crappy job at classification.
[224002240220] |That is, a strong classifier will get you 99.9% accuracy (most of the time) while I weak classifier will only get you 50.1% accuracy (most of the time).
[224002240230] |Boosting works by repeatedly applying the weak classifier to modified data sets in order ot get a strong classifier out.
[224002240240] |In all the boosting papers, the results are presented as positive.
[224002240250] |That is: look, obviously we want strong classifiers.
[224002240260] |But weak classifiers look much more attainable.
[224002240270] |And voila, by boosting magic, we can turn the weak into the strong.
[224002240280] |This is an A => B setting: (existence of weak classifiers) => (existence of strong classifiers).
[224002240290] |Sounds great, right?
[224002240300] |But let me ask you: do you believe more strongly that (weak classifiers exist) or more strongly that (strong classifiers do not exist)?
[224002240310] |For me, it's the latter.
[224002240320] |To some degree, no free lunch theorems apply here.
[224002240330] |This yields a totally different read of boosting results, namely: weak classifiers don't even exist!!!
[224002240340] |More on the language side, one of the more classic experimentally results we have is that if you give me a really good semantic representation, I can do an awesome job doing natural language generation from those semantics.
[224002240350] |In order to do translation, for instance, I just have to generate the semantics of a French sentence and then I can generate a near-perfect English translation.
[224002240360] |(French semantics) => (Perfect translations).
[224002240370] |But do I believe mose strongly that we can get perfect French semantics or that we can not get perfect translation?
[224002240380] |Right now, almost certainly the latter.
[224002240390] |My point is really one about critical reading.
[224002240400] |When you read a paper, things will be presented in one light.
[224002240410] |But that's often not the only light in which they can be interpreted.
[224002240420] |Apply your own interpretation.
[224002260010] |NAACL 2010 Retrospective
[224002260020] |I just returned from NAACL 2010, which was simultaneously located in my home town of Los Angeles and located nowhere near my home town of Los Angeles.
[224002260030] |(That's me trying to deride downtown LA as being nothing like real LA.)
[224002260040] |Overall I was pleased with the program.
[224002260050] |I saw a few talks that changed (a bit) how I think about some problems.
[224002260060] |There were only one or two talks I saw that made me wonder how "that paper" got in, which I think is an acceptable level.
[224002260070] |Of course I spend a great deal of time not at talks, but no longer feel bad about doing so.
[224002260080] |On tutorials day, I saw Hoifung Poon's tutorial on Markov Logic Networks.
[224002260090] |I think Hoifung did a great job of targetting the tutorial at just the right audience, which probably wasn't exactly me (though I still quite enjoyed it myself).
[224002260100] |I won't try to describe MLNs, but my very brief summary is "language for compactly expressing complex factor graphs (or CRFs, if you prefer)."
[224002260110] |That's not exactly right, but I think it's pretty close.
[224002260120] |You can check back in a few months and see if there are going to be any upcoming "X, Y and Daume, 2011" papers using MLNs :P.
[224002260130] |At any rate, I think it's a topic worth knowing about, especially if you really just want to get a system up and running quickly.
[224002260140] |(I'm also interested in trying Andrew McCallum's Factorie system, which, to some degree, trades easy of use for added functionality.
[224002260150] |But honestly, I don't really have time to just try things these days: students have to do that for me.)
[224002260160] |One of my favorite papers of the conference was one that I hadn't even planned to go see!
[224002260170] |It is Dependency Tree-based Sentiment Classification using CRFs with Hidden Variables by Tetsuji Nakagawa, Kentaro Inui and Sadao Kurohashi.
[224002260180] |(I saw it basically because by the end of the conference, I was too lazy to switch rooms after the prvious talk.)
[224002260190] |There are two things I really like about this paper.
[224002260200] |The first is that the type of sentiment they're going after is really broad.
[224002260210] |Example sentences included things that I'd love to look up, but apparently were only in the slides... but definitely more than "I love this movie."
[224002260220] |The example in the paper is "Tylenol prevents cancer," which is a nice positive case.
[224002260230] |The basic idea is that some words give you sentiment.
[224002260240] |For instance, by itself, "cancer" is probably negative.
[224002260250] |But then some words flip polarity.
[224002260260] |Like "prevents."
[224002260270] |Or negation.
[224002260280] |Or other things.
[224002260290] |They set up a model based on sentence level annotations with latent variables for the "polarity" words and for the "flipping" words.
[224002260300] |The flipping words are allowed to flip any sentiment below them in the dependency tree.
[224002260310] |Cool idea!
[224002260320] |Of course, I have to nit-pick the paper a bit.
[224002260330] |It probably would be better to allow arguments/adjuncts to flip polarity, too.
[224002260340] |Otherwise, negation (which is usually a leaf) will never flip anything.
[224002260350] |And adjectives/adverbs can't flip either (eg., going from "happy" to "barely happy").
[224002260360] |But overall I liked the paper.
[224002260370] |A second thing I learned is that XOR problems do exist in real life, which I had previously questioned.
[224002260380] |The answer came (pretty much unintentionally) from the paper The viability of web-derived polarity lexicons by Leonid Velikovich, Sasha Blair-Goldensohn, Kerry Hannan and Ryan McDonald.
[224002260390] |I won't talk much about this paper other than to say that if you have 4 billion web pages, you can get some pretty good sentimenty words, if you're careful to not blindly apply graph propagation.
[224002260400] |But at the end, they throw a meta classifier on the polarity classification task, whose features include things like (1) how many positive terms are in the text, (2) how many negative terms are in the text, (3) how many negations are in the text.
[224002260410] |Voila!
[224002260420] |XOR!
[224002260430] |(Because negation XORs terms.)
[224002260440] |I truly enjoyed Owen Rambow's poster on The Simple Truth about Dependency and Phrase Structure Representations: An Opinion Piece.
[224002260450] |If you're ever taken a class in mathematical logic, it is very easy for me to summarize this paper: parse trees (dependency or phrase structure) are your languge, but unless you have a theory of that language (in the model-theoretic sense) then whatever you do is meaningless.
[224002260460] |In more lay terms: you can always push symbols around, but unless you tie a semantics to those symbols, you're really not doing anything.
[224002260470] |Take home message: pay attention to the meaning of your symbols!
[224002260480] |In the category of "things everyone should know about", there was Painless unsupervised learning with features by Taylor Berg-Kirkpatrick, Alexandre Bouchard Côté, John DeNero and Dan Klein.
[224002260490] |The idea is that you can replace your multinomails in an HMM (or other graphical model) with little maxent models.
[224002260500] |Do EM in this for unsuperviesd learning and you can throw in a bunch of extra features.
[224002260510] |I would have liked to have seen a comparison against naive Bayes with the extra features, but my prior belief is sufficiently strong that I'm willing to believe that it's helpful.
[224002260520] |The only sucky thing about this training regime is that training maxent models with (tens of) thousands of classes is pretty painful.
[224002260530] |Perhaps a reduction like tournaments or SECOC would help bring it down to a log factor.
[224002260540] |I didn't see the presentation for From baby steps to leapfrog: How "Less is More" in unsupervised dependency parsing by Valetin Spitkovsky, Hiyan Alshawi and Dan Jurafsky, but I read it.
[224002260550] |The idea is that you can do better unsupervised dependency parsing by giving your learner progressively harder examples.
[224002260560] |I really really really tried to get something like this to work for unsearn, but nothing helped and most things hurn.
[224002260570] |(I only tried adding progressively longer sentences: other ideas, based on conversations with other folks, include looking at vocabulary size, part of speech (eg., human babies learn words in a particular order), etc.)
[224002260580] |I'm thrilled it actually works.
[224002260590] |Again, I didn't see Discriminative Learning over Constrained Latent Representations by Ming-Wei Chang, Dan Goldwasser, Dan Roth and Vivek Srikumar, but I learned about the work when I visited UIUC recently (thanks again for the invitation, Dan R.!).
[224002260600] |This paper does exactly what you would guess from the title: learns good discriminative models when you have complex latent structures that you know something about a priori.
[224002260610] |I usually ask people at the end of conferences what papers they liked.
[224002260620] |Here are some papers that were spoken highly of by my fellow NAACLers.
[224002260630] |(This list is almost unadulterated: one person actually nominated one of the papers I thought shouldn't have gotten in, so I've left it off the list.
[224002260640] |Other than that, I think I've included everything that was specifically mentioned to me.)
[224002260650] |Optimal Parsing Strategies for Linear Context-Free Rewriting Systems by Daniel Gildea.
[224002260660] |Products of Random Latent Variable Grammars by Slav Petrov.
[224002260670] |Joint Parsing and Alignment with Weakly Synchronized Grammars by David Burkett, John Blitzer and Dan Klein.
[224002260680] |For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia by Mark Yatskar, Bo Pang, Cristian Danescu-Niculescu-Mizil and Lillian Lee.
[224002260690] |Type-Based MCMC by Percy Liang, Michael I. Jordan and Dan Klein.
[224002260700] |I think I probably have two high level "complaints" about the program this year.
[224002260710] |First, I feel like we're seeing more and more "I downloaded blah blah blah data and trained a model using entirely standard features to predict something and it kind of worked" papers.
[224002260720] |I apologize if I've just described your paper, but these papers really rub me the wrong way.
[224002260730] |I feel like I just don't learn anything from them: we already know that machine learning works surprisingly well and I don't really need more evidence of that.
[224002260740] |Now, if my sentence described your paper, but your paper additionally had a really interesting analysis that helps us understand something about language, then you rock!
[224002260750] |Second, I saw a lot of presentations were speakers were somewhat embarassingly unaware of very prominent very relevant prior work.
[224002260760] |(In none of these cases was the prior work my own: it was work that's much more famous.)
[224002260770] |Sometimes the papers were cited (and it was more of a "why didn't you compare against that" issue) but very frequently they were not.
[224002260780] |Obviously not everyone knows about all papers, but I recognized this even for papers that aren't even close to my area.
[224002260790] |Okay, I just ranted, so let's end on a positive note.
[224002260800] |I'm leaving the conference knowing more than when I went, and I had fun at the same time.
[224002260810] |Often we complain about the obvious type I errors and not-so-obvious type II errors, but overall I found the program strong.
[224002260820] |Many thanks to the entire program committee for putting together an on-average very good set of papers, and many thanks to all of you for writing these papers!