The study of social cognition ("people thinking about people") and social neuroscience has exploded in the last few years. Much of energy -- but by no means all of it -- has focused on Theory of Mind.

"Theory of Mind" is something we are all assumed to have -- that is, we all have a theory that other people's actions are best explained by the fact that they have minds which contain wants, beliefs and desires. (One good reason for calling this a "theory" is that while we have evidence that other people have minds and that this governs their behavior, none of us actually has proof. And, in fact, some researchers have been claiming that, although we all have minds, those minds do not necessarily govern our behavior.)

Non-human animals and children under the age of 4 do not appear to have theory of mind, except in perhaps a very limited sense. This leads to the obvious question: what is different about human brains over the age of 4 that allows us to think about other people's thoughts, beliefs and desires?

It might seem like Theory of Mind is such a complex concept that it would be represented diffusely throughout the brain. However, in the last half-decade or so, neuroimaging studies have locked in on two different areas of the brain. One, explored by Jason Mitchell of Harvard, among others, is the medial prefrontal cortex (the prefrontal cortex is, essentially, in the front of your brain. "medial" means it is on the interior surface, where the two hemispheres face each other, rather than on the exterior surface, facing your skull). The other is the temporoparietal junction (where your parietal and temporal lobes meet), described first in neuroimaging by Rebecca Saxe of MIT and colleagues.

Not surprisingly, there is some debate about which of these brain areas is more important (this breaks down in the rather obvious way) and also what the two areas do. Mitchell and colleagues tend to favor some version of "simulation theory" -- the idea that people (at least in some situations) guess what somebody else might be thinking by implicitly putting themselves in the other person's shoes. Saxe does not.

Modulo that controversy, theory of mind has been tied to a couple fairly small and distinct brain regions. These results have been replicated a number of times now and seem to be robust. This opens up the possibility, among other things, of studying the cross-species variation in theory of mind, as well as the development of theory of mind as children reach their fourth birthdays. Publication bias There is an excellent article on publication bias in Slate today. There is no question that a number of biases affect what gets published and what doesn't. Some are good (not publishing bad studies), some are bad (not publishing studies that disprove a pet theory) and some are ambiguous (not publishing papers that "aren't interesting"). The big questions are which biases have the biggest impact on what makes its way into print, and how do you take that into account when evaluating the literature.

Read the Slate article here. Try this at home: Make your own stereogram Have you ever wanted to make your own 3D movie? Your own Magic Eye Stereogram? This post will teach you to create (and see) your own 3D images.

Magic Eye Stereograms are a relatively new technology, but they grew out of the classic stereograms created in 1838 by Charles Wheatstone. For those of you who don't know what a stereogram is, the word broadly refers to a 3D-like image produced by presenting different images to each eye.

The theory is pretty straight-forward. Focus on some object in your room (such as your computer). Now close one eye, then the other. The objects in your field of vision should shift relative to one another. The closer or father from you they are (relative to the object you are focusing on), the more they should shift. When you look at a normal photograph (or the text on this screen), this difference is largely lost. The objects in the picture are in the same position relative to one another regardless of which eye you are looking through. However, if a clever engineer rigs up a device so as to show different images to each eye in a way that mimics what happens when you look at natural scenes, you will see the illusion of depth.

For instance, she might present the drawing below on the left to your right eye, and the drawing on the right to your left eye:


If the device is set up so that each picture is lined up perfectly with the other (for instance, if each is in the center of the field of vision of the appropriate eye), you would see the colored Xs in the center at different depths relative to one another. Why? The green X shifts the most between the two images, so you know it is either the closest or the farthest away. Importantly, because it's farther to the left in the image shown to the right eye, it must be closer than the blue or red Xs.

You can demonstrate this to yourself using a pencil. Hold a pencil perfectly vertical a foot or two in front of your face. It should still look vertical even if you look with only one eye. Now, tilt the pencil so that the bottom part points towards your chest (at about a 45 degree angle from the floor). Close your right eye and move the pencil to the right or the left until the pencil appears to be perfectly vertical. Now look at the pencil with your right eye instead. It should appear to slope down diagonally to the left. That is exactly what is happening in the pictures above.

A device that would fuse these two images for you isn't hard to make, but it's even easier to learn how to fuse them simply by crossing your eyes. There are two ways of crossing your eyes -- making them point inwards towards your nose, and making them point outwards. One way will make the green X closer; one will make it farther away. I'll describe how to use the first method, because it's the own I typically use.

Look at the two images and cross your eyes towards your nose. This should cause each of the images to double. What you want to do is turn those four images into three by causing the middle two to overlap. This takes some practice. Try focusing on the Xs that form the rectangular frames of the images. Make each of those Xs line up exactly with the corresponding X from the frame of the other image. If you do this, eventually the two images should fuse into a single image, and you will see the colored Xs in depth. One tip: I find this harder to do on a computer screen than in print, so you might try printing this out.

That is the basic technique. You should be able to make your own and play around with it to see what you can do. For instance, this example has a bar pointing up out of the page, but you can also make a bar point into the page. You also might try creating more complicated objects. If you want, you can send me any images you make (coglanglab_AT_gmail_DOT_com), and I will post them (you can try including them as comments, but that is tricky).

One final tip -- you'll need to use a font that has uniform spacing. Courier will work. Times will not.

Finally, here's another stereogram that uses a completely different principle. If you can fuse these images, you should see an illusory white box floating in front of a background of Xs. In a future post, I'll explain how to make these. Sure, that's plausible I am happy to say that the results from the recently revived Video Experiment have been excellent, and while we're still collecting some data just in case, the revised paper should be submitted for publication shortly. That is one month since we got the reviewer's comments back on the original manuscript, which is a faster turn-around than I've ever managed before.

In the meantime, a lab-mate is running a new online survey called, "How Likely? A Plausibility Study."

The idea goes like this. We use lots of different types of information to understand what people are saying: Word order, general knowledge, intonation, emotion... and plausibility. If you hear a restaurant employee ask, "Can I bake your order?" you know that the resulting interpretation is implausible. It would be much more plausible to ask, "Can I take your order?"

That sounds like common sense, but we still don't have a good idea of how and when plausibility is used in comprehension. To do research in this area, the first thing we need is some sentences that are more or less plausible than others. The easy way to do it might be to decide for ourselves what we consider to be plausible and implausible sentences.

However, being people who study language all day, we probably aren't very typical. The point of this study is to get a range of people to say how plausible they think different sentences are. Then, these sentences and those ratings can be used in further research.

The survey contains 48 sentences and should take about 10 minutes to do. You can participate in it by clicking here. Why it doesn't matter if America falls behind in Science Earlier this year, an article in the New York Times argued that it doesn't matter that the US is losing its edge in science and research. In fact, the country can save considerable money by letting other countries do the hard work. The article went on to explain how this can be viewed as outsourcing: let other, cheaper countries do the basic research, and try to turn that research into products in the US.

Really?

The article quoted research published by the National Academy of Sciences, and while I fully recognize that they know more about this topic, have thought more about this topic, and are no doubt considerably smarter, I'm skeptical.

There are two problems I see. The first is that just because other countries are picking up the slack doesn't mean there isn't slack. The second is that I'm not convinced that, in the long term, allowing all the best, most cutting-edge research to take place in other countries is really economically sound.

Being a Good Citizen

The article in question seems to imply that there is some amount of research, X, that needs to be done. If other countries are willing to do it, then no more is needed.

To make this concrete, as long as one new disease is cured, say, every five years, there's simply no reason to invest any additional energy into curing diseases. That's enough. And for people who have some other disease that hasn't been cured, they can wait their turn.

The concept is most clear when it comes to disease, but I think the same argument applies everywhere else. Basic science is what gives us new technology, and technology has been humanity's method of improving our quality of life since at least a few million years ago. Perhaps some people think quality of life is improving fast enough -- or too fast, thank you -- but I, at least, would like my Internet connection to be a bit faster now rather than later.

The fact that China, Taiwan, Singapore & co. are stepping up to the plate is not a reason for us to go on vacation.

Can We Really be Competitive as a Backwater?

The article casts "outsourcing" science as good business by noting that America is still the best at turning science into products. So let other countries do the expensive investment into research -- we'll just do the lucrative part that comes later.

Do they think other countries won't catch on?

I have to imagine that Singapore and similar countries are investing in research because they want to make money. Which means they will want their share of the lucrative research-to-product business. So America's business plan, then, would have to be to try to keep our advantage on that front while losing our advantage on basic research.

This may well be possible. But it has some challenges. It's no accident that the neighborhood around MIT is packed with tech start-ups. I'm not a sociologist, but I can speculate on why that is. First, many of those tech start-ups are founded by MIT graduates. They aren't necessarily Boston natives, but having been drawn to one of the world's great research universities, they end up settling there.

Second, Flat World or not, there are advantages to being close to the action. Many non-scientists don't realize that by the time "cutting-edge" research is published, it is often a year or even several years old. The way to stay truly current is to chat with the researchers over coffee about what they are doing right now, not about what they are writing right now.

Third, science benefits from community. Harvard's biggest advantage, as far as I can tell, is the existence of MIT two miles down the road, and visa versa. Waxing poetic about the free exchange of ideas may sound a bit abstract, but it has a real impact. I have multiple opportunities each week to discuss my current projects with some of the best minds in the field, and I do better work for it.

In short, I think any country that maintains the world's premier scientific community is going to have impressive structural advantages when it comes to converting ideas into money.

That Said...

That said, I think there are two really useful ideas that come out of that article. The first is the challenge against the orthodoxy that strong science = strong economy. Without challenges like these, we can't home in on what exactly is important about funding basic research (not saying I've been successful here, but it is a start, at least). The second is that even if the US maintains its lead in science, that lead is going to shrink no matter what we do, so it's important to think about how to capitalize on discoveries coming in from overseas.

Political Note

Those who are concerned about basic research in the US should note that while John McCain does not list science funding as a priority on his website -- unless you count non-specific support of NASA -- and did not mention it in his convention speech, Barack Obama did both (he supports doubling basic science funding).

Folks in Eastern Washington may be interested to know that a clinical psychologist is running for Congress against an incumbent. Though Mark Mays has been professionally more involved in treatment than in research, research is among his top priorities.
Science's Call to Arms In case anyone was wondering, I am far from alone in my call for a new science policy in the coming administration. It is the topic of the editorial in the latest issue of Science Magazine America's premier scientific journal:
For the past 7 years, the United States has had a presidential administration where science has had little place at the table. We have had a president opposed to embryonic stem cell research and in favor of teaching intelligent design. We have had an administration that at times has suppressed, rewritten, ignored, or abused scientific research. At a time when scientific opportunity has never been greater, we have had five straight years of inadequate increases for U.S. research agencies, which for some like the National Institutes of Health (NIH) means decreases after inflation.

All of this has been devastating for the scientific community; has undermined the future of our economy, which depends on innovation; and has slowed progress toward better health and greater longevity for people around the world.
Dr. Porter, the editorialist, goes on to ask

So if you are a U.S. scientist, what should you do now?
He offers a number of ideas, most of which are probably not practical for a graduate student like myself ("volunteer to advise ... candidates on science matters and issues.").

The one that is most practical and which anybody can do is to promote ScienceDebate2008.com. He acknowledges that the program's goal -- a presidential debate dedicated to science -- will not be accomplished in 2008, bu the hope is to signal to the media and to politicians that people care about science and science policy.

And who knows? Maybe there will be a science debate is 2012? Androids Run Amok at the New York Times? I have been reading Steve Pinker's excellent essay in the New York Times about the advent of personal genetics. Reading it, though, I noticed something odd. The Times includes hyperlinks in most of its articles, usually linking to searches for key terms within its own archive. I used to think this linking was done by hand, as I do in my own posts. Lately, I think it's done by an android (and not a very smart one).

Often the links are helpful in the obvious way. Pinker mentions Kareem Abdul-Jabbar, and the Times helpfully links to a list of recent articles that mention him. Presumably this is for the people who don't know who he is (though a link to the Abdul-Jabbar Wikipedia entry might be more useful).

Some links are less obvious. In a sentence that begins "Though health and nutrition can affect stature..." the Time sticks in a hyperlink for articles related to nutrition. I guess that's in case the word stirs me into wondering what else the Times has written about nutrition. That can't explain the following sentence though:

Another kind of headache for geneticists comes from gene variants that do have large effects but that are unique to you or to some tiny fraction of humanity.

There is just no way any human thought that readers would want a list of articles from the medical section about headaches. This suggests that the Times simply has a list of keywords that are automatically tagged in every article...or perhaps it is slightly more sophisticated and the keywords vary based on the section of the paper.

I'm not sure how useful this is even in the best of circumstances. Has anyone ever actually clicked on one of these links and read any of the articles listed? If so, comment away!

(picture from Weeklyreader.com) Games with Words: New Web lab launched The new Lab is launched (finally). I was a long ways from the first to start running experiments on the Web. Nonetheless, when I got started in late 2006, the Web had mostly been used for surveys, and there were only a few examples of really successful Web laboratories (like the Moral Sense Test, FaceResearch and Project Implicit). There were many examples of failed attempts. So I wasn't really sure what a Web laboratory should look like, how it could best be utilized, or what would make it attractive and useful for participants.

I put together a website known as Visual Cognition Online for the lab I was working at. I was intrigued by the possibility of running one-trial experiments. Testing people involves a lot of noise, so we usually try to get many measurements (sometimes hundreds) from each participant, in order to get a good estimate of what we're trying to measure. Sometimes this isn't practical. The best analogy that comes to mind is football. A lot of luck and random variation goes into each game, so ideally, we'd like each team to play each other several times (like happens in baseball). However, the physics of football makes this impractical (it'd kill the players).

Running a study on the Web makes it possible to test more participants, which means we don't need as many trials from each. A few studies worked well enough, and I got other good data along the way (like this project), so when the lab moved to MN and I moved to graduate school, I started the Cognition and Language Lab along the same model.

Web Research blooms

In the last two years, Web research has really taken off, and we've all gotten a better sense of what it was useful for. The projects that make me most excited are those run by the likes of TestMyBrain.org, Games with a Purpose, and Phrase Detectives. These sites harness the massive size of the Internet to do work that wasn't just impossible before -- it was frankly inconceivable.

As I understand it, the folks behind Games with a Purpose are mainly interested in machine learning. They train computer programs to do things, like tag photographs according to content. To train their computer programs, they need a whole bunch of photographs tagged for content; you can't test a computer -- or a person -- if you don't know what the correct answer is. Their games are focused around doing things like tagging photographs. Phrase Detectives does something similar, but with language.

The most exciting results from TestMyBrain.org (full disclosure: the owner is a friend of mine, a classmate at Harvard, and also a collaborator) have focused on the development and aging of various skills. Normally, when we look at development, we test a few different age groups. An extraordinarily ambitious project would test some 5 year olds, some 20 year olds, some 50 year olds, and some 80 year olds. By testing on the Web, they have been able to look at development and aging from the early teenage years through retirement age (I'll blog about some of my own similar work in the near future).

Enter: GamesWithWords.org

This Fall, I started renovating coglanglab.org in order to incorporate some of the things I liked about those other sites. The project quickly grew, and in the end I decided that the old name (Cognition and Language Lab) just didn't fit anymore. GamesWithWords.org was born.

I've incorporated many aspects of the other sites that I like. One is simply to make the site more engaging (reflected, I hope, in the new name). It's always been my goal to make the Lab interesting and fun for participants (the primary goal of this blog is to explain the research and disseminate results), and I've tried to adopt the best ideas I've seen elsewhere.

Ultimately, of course, the purpose of any experiment is not just to produce data, but to produce good data that tests hypotheses and furthers theory. This sometimes limits what I can do with experiments (for instance, while I'd love to give individualized feedback to each participant for every experiment, sometimes the design just doesn't lend itself to feedback. Of the two experiments that are currently like, one offers feedback, one doesn't.

I'll be writing more about the new experiments over the upcoming days. Sounds of Silence My lament that, with regards to discussion of education reform, a trace of small liberal arts colleges has disappeared into the ether appears to have, itself, disappeared into the ether. Seriously, readers, I expected some response to that one. There are parts of my post even I disagree with. Making data public Lately, there have been a lot of voices (e.g., this one) calling for scientists to make raw data immediately available to the general public. In the interest of answer than call, here's some of my raw data:

Do you feel enlightened? Probably not. Raw data isn't all that useful if you don't know how it was collected, what the different numbers refer to, etc. Even if I told you this is data from this experiment, that probably wouldn't help much. Even showing you the header rows for these data will help only so much:

Some things are straightforward. Some are not. It's important to know that I record data with a separate row for every trial, so each participant has multiple trials. Also, I record all data, even data from participants who did not complete the experiment. If you're unaware of that, your data analyses would come out very wrong. Also I have some codes I use to mark that the participant is an experimenter checking to make sure everything is running correctly. You'd need to know those. It's key to know how responses are coded (it's not simply "right" or "wrong" -- and in fact the column called totalCorrect does not record whether the participant got anything correct).

The truth is, even though I designed this study myself and wrote the program that outputs the data, every time I go back to data from a study I haven't worked with in a while, it takes me a few hours to orient myself -- and I'm actually relatively good about documenting my data.

So if a law were passed -- as some have advocated for -- requiring that data be made public, one of two things will happen: either people will post uninterpretable data like my mini-spreadsheet above, or they'll spend huge amounts of time preparing their data for others' consumption. The former will help no one. And the latter is expensive, and someone has to pay for that. And this all has to be balanced against the fact that there are very few data sets anyone would want to reanalyze.

There are important datasets that should be made available. And in fact there are already mechanisms for doing this (in my field, CHILDES is a good example). This kind of sharing should be encouraged, but mandated sharing is likely to cause more problems than it solves. Apply to Graduate School? Each year around this time, I try to post more information that would be of use to prospective graduate students, just in case any such are reading this blog (BTW Are there any undergraduates reading this blog? Post in the comments!).

This year, I've been swamped. I've been focusing on getting a few papers published, and most of my time for blogging has gone to the Scientific-American-Mind-article-that-will-not-die, which, should I ever finish it, will probably come out early next year.

Luckily, Female Science Professor has written a comprehensive essay on The Chronicle of Higher Education about one of the most confusing parts of the application process: the pre-application email to a potential advisor.

Everyone tells applicants to send such emails, but nobody gives much information about what should be in them. Find the essay here.

I would add one comment to what she wrote. She points out that you should check the website to see what kind of research the professor does rather than just asking, "Can you tell me more about your research," which comes across as lazy. She also suggests that you should put in your email whether you are interested in a terminal master's. Read the website before you do that, though, since not all programs offer terminal master's (none of the programs I applied to do). Do your homework. Professors are much, much busier than you are; if you demonstrate that you are too lazy to look things up on the Web, why should they spend time answering your email?

---
For past posts on graduate school and applying to graduate school, click here. Welcome to the LingPipe blog

In anticipation of the 2.2 release of LingPipe, we decided to get with the program and create a blog.

Confidence-Based Gene Mentions for all of MEDLINE

I ran LingPipe’s new confidence-based named-entity extractor over every title and abstract body in MEDLINE. The model is the one distributed on our site built from the NLM GeneTag corpus (a refined version of the first BioCreative corpus) — that’s a compiled . There’s just a single category, .

The 2006 MEDLINE baseline contains 10.2 billion characters in titles and abstracts (with brackets for translations cut out of titles and truncation messages removed form abstracts). I extracted the text using LingPipe’s MEDLINE parser and wrote the output in a gzipped form almost identical to that used in NLM’s GeneTag corpus (also used for BioCreative).

I set the minimum confidence to be 0.001. I set the caches to be 10M entries each, but then capped the JVM memory at 2GB, so the soft references in the cache are getting collected when necessary. I should try it with a smaller cache that won’t get GC-ed and see if the cache is better at managing itself than the GC is.

Including the I/O , XML parsing, gzipping and unzipping and MEDLINE DOM construction, it all ran over all of MEDLINE in just under 9 hours. That’s 330,000 characters/second!!! That’s on a fairly modest 1.8GHz dual opteron, 8GB PC2700 ECC memory, Windows64, 1.5 64-bit JDK in server mode). That’s in a single analysis thread (of course, the 1.5 server JVM uses a separate thread for GC).

All I can say is woot!

Aho-Corasick Wikipedia Entry

I edited my first substantial Wikipedia page today:

Wikipedia: Aho-Corasick Algorithm
There’s been discussion of the Wikipedia’s accuracy ever since:
Nature article on Wikipedia
.

There was even a New Yorker article this week. Of course,
The Onion said it best, in their article Wikipedia Celebrates 750 Years Of American Independence.

I wanted to link in some doc for the exact dictionary matching I just built for LingPipe 2.4 ( ), but I couldn’t find a good intro to the Aho-Corasick algorithm online.

The former Wikipedia article was confusing in its description of the data structure and wrong about the complexity bound (read in terms of number of entries versus size of dictionary strings). I restated it in the usual manner (e.g., that used by Dan Gusfield in Algorithms on Strings, Trees and Sequences) . And I provided an example like the one I’ve been using for unit testing.

The usual statement of Aho-Corasick is that it’s linear in dictionary size plus input plus number of outputs, as there may be quadratically many ouptuts. Thinking a bit more deeply, it can’t really be quadratic without a quadratic sized dictionary. For instance, with a dictionary {a, aa, aaa, aaaa, aaaaa}, there are quadratically many outputs for aaaaa, but the dictionary is quadratic in aaaaa (sum of first n integers: 5+4+3+2+1). With a fixed dictionary, runtime is always linear in the input, though there may be outputs proportional to the number of dictionary entries for each input symbol.

I also restated it in terms of suffix trees rather than finite-state automata, as that matches the usual presentation.

Intro to IR book online

The following book promises to become the book on information retrieval. Although there’s less on index compression than Witten et al.’s Managing Gigabytes, it is much broader and more up to date. Chapters 13-18 aren’t really IR-specific at all, covering topics such as classification, clustering and latent semantic indexing. That means this is a great place for an introduction to the way LingPipe does all these things, as we’ve followed standard practice in all of our models.

Here’s the reference, with a hotlink:

Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. (forthcoming) Introduction to Information Retrieval. Cambridge University Press.

From what I’ve read, the treatments of various topics are better thought out and contain much more practical advice than the corresponding sections in Manning and Schütze’s previous book. I don’t know how they’ve broken up the writing, but Prabhakar Raghavan, the third author, not only works at Yahoo! but is the editor-in-chief of the CS journal, the Journal of the ACM.

There’s still plenty of time to send the authors feedback and earn a coveted spot in the acknowledgements of a book destined to be widely read.

Feature Hash Code Collisions in Linear Classifiers

I (Bob) am starting to feel like a participant in an early 20th century epistolary academic exchange (e.g. that between Mr. Russell and Mr. Strawson).

In a comment to John Langford’s response to my blog entry recapitulating his comments after his talk, Kuzman Ganchev points out that he and Mark Dredze did the empirical legwork in their 2008 paper:

To summarize, we’re considering the effect on classification accuracy of using hash codes (modulo some fixed n) of feature representations as dimensional identifiers rather than requiring a unique identifier for each of m underlying features. The reason this is interesting is that it requires much less memory and far less computational effort to compute features this way. The reason it might be problematic for accuracy is that there’s an increasing likeliood of collisions as the number of parameters n decreases.

For instance, character n-gram hash codes may be computed at a cost of only a couple assignments and arithmetic operations per character by the Karp-Rabin algorithm, which only implicitly represent the n-grams themselves. For Latin1 (ISO-8859-1) encoded text, Karp-Rabin can even be computed online with binary input streams ( ) without the expense of decoding Unicode characters. With n-gram-based features, input streams are usually effective with other character encodings. More complex features may be handled by simple modifications of the Karp-Rabin algorithm.

Ganchev and Dredze showed that for many NLP problems (e.g. spam filtering, 20 newsgroups, Reuters topics, appliance reviews), there is very little loss from drastic reductions in n relative to m. This is very good news indeed.

Getting back to John Langford’s last post, I would like to answer my own question about how having multiple hash codes per feature helps maintain discriminative power in the face of collisions. It may seem counterintuive, as having more features (2 * m) for the same number of parameters (n) seems like there will simply be more collisions.

Let’s consider a very simple case where there is a feature f which is split into two features f0 and f1 without collision. With maximum likelihood, if the original weight for f is β, then the weights for f0
and f1 and f2 will be β/2 (or any linear interpolation). With Laplace priors (L1 regularization), the behavior is the same because the penalty is unchanged abs(β) = abs(β/2) + abs(β/2). But, with Gaussian priors (L2 regularization), the penalty is no longer equivalent, because with β != 0, β2 > (β/2)2 + (β/2)2.

Setting aside regularization, let’s work through an example with two topics with the following generative model (adapted from Griffiths’ and Steyvers’ LDA paper):

So topic 0 is about geography and topic 1 about finance. In topic 0, there is a 50% chance of generating the word "river", a 50% chance of generating the word "bank", and no chance of generating the word "loan". It is easy to identify a set of regression parameters that has perfect classification behavior: β = (1,0,-1) [the maximum likelihood solution will not actually be identified; with priors, the scale or variance parameter of the prior determines the scale of the coefficients].

Now what happens when we blow out each feature to two features and allow collisions? The generative model is the same, but each feature is replicated. If there is a collision between the first code for "river" and the first code for "loan", the resulting coefficients look like:

The resulting coefficients again produce a perfect classifier. The collision at code 0 is simply a non-discriminative feature and the split versions pick up the slack.

If the collision is between a discriminative and non-discriminative feature code, there is still a perfect set of coefficients:

Of course, real problems aren’t quite as neat as the examples, and as we pointed out, regularization is non-linear except for maximum likelihood and Laplace priors.

In the real world, we typically find "new" features at runtime (e.g. a character 8-gram that was never seen during training). There is a very long tail for most linguistic processes. Luckily, there is also a lot of redundancy in most classification problems.

This problem doesn’t occur in the hermetically sealed world of a fixed training and test set with feature pruning (as in the ECML KDD 2006 Spam Detection Challenge, which distributed data as bags of words with words occurring only if they occurred 4 or more times in the training data).

Scientific Innovator’s Dilemma

After some e-mail exchange with Mark Johnson about how to stimulate some far-out research that might be fun to read about, I was sitting at the dinner table with Alice Mitzi, ranting about the sociology of science.

My particular beef is low acceptance rates and the conservative nature of tenure committees, program committees, and grant review panels. It makes it hard to get off the ground with a new idea, while making it far too easy to provide a minor, often useless, improvement on something already well known. Part of the problem is that the known is just a lot easier to recognize and review. I don’t spend days on reviews like I did as a grad student — if the writer can’t explain the main idea in the abstract/intro, into the reject pile it goes without my trying to work through all the math.

Mitzi listened patiently and after I eventually tailed off, said “Isn’t that just like the innovator’s dilemma, only for science?”. Hmm, I thought, “hmm”, I mumbled, then my brain caught up and I finally let out an “a-ha”. Then I said, “I should blog about this!”.

I learned about the problem in the title of one of the best business books I’ve ever read, The Innovator’s Dilemma, by Clayton M. Christensen. It’s full of case studies about why players with the dominant positions in their industries fail. You can read the first chapter and disk drives case study online, or cut to the dryer Wikipedia presentation of disruptive technology.

The basic dilemma is that an existing business generates so much revenue and at such a high margin, that any new activity not directly related to this existing business can’t be justified. My favorite case study is of earth-movers. Back in the day (but not too far back) we had steam shovels that used cables to move their enormous shovels. They were big, and they moved lots of earth. If you needed to do strip mining, foundation digging for skyscrapers, or needed to lay out a city’s road system, these were just what you wanted. The more dirt they moved the better. So along comes the gasoline powered engine. The steam shovel companies looked at the new techology and quickly adopted it; swapping out steam for gasoline meant you could move more dirt with more or less the same set of cables. It’s what we in the software business call a “no brainer”.

A few years later, an enterprising inventor figured out how to replace the cable actuators with hydraulics. When first introduced, hydraulics were relatively weak compared to cables, so you couldn’t build a shovel big enough to compete with a cable-actuated gasoline-powered shovel. The big shovel companies looked at hydraulics, but couldn’t figure out how to make money with them. The first hydraulic shovels were tiny, being good only for jobs like digging the foundation for a house or digging a trench from a house to sewer mains. Even more importantly, there was no existing market for small earth movers compared to the much more lucrative market for big earth movers, and even if you could capture all the little stuff, it still wouldn’t affect the big company’s bottom line.

So new companies sprung up in a new market to sell hydraulic shovels that could fit on a small truck. As hydraulic technology continued to improve in strength, more and more markets opened up that took slightly more power. Even so, nothing that’d make a dent in the bottom line of the big cable-activated shovel companies.

Eventually, hydraulics got powerful enough that they could compete with cable-activated shovels. At this point, the cable-actuated shovel companies mainly went out of business. Up until just before the capabilities crossed, it still didn’t make sense in terms of the big company’s bottom line to move to hydraulics. There just wasn’t enough income in it. Until too late.

Christensen’s book is loaded with case studies, and it’s easy to think of more once you have the pattern down. The business types prefer generic, unscaled graphs like this one to illustrate what’s going on:

How disruptive technology gains a hold over time, by gradually moving into more lucrative markets (source: Wikipedia)

Smaller disks disrupted larger disks for the same reason; sure, they could fit into a minicomputer (or microcomputer), but they cost a lot per byte. At every stage of disk diameter downsizing, the dominant players mostly went bankrupt or left the business in the face of the up-and-coming smaller disk manufactures, who always managed to overtake their big-disk competitors in terms of capacity, price (and reliability, if I recall correctly). You’d think the big companies would have learned their lesson after the third iteration, but that just shows how strong a problem the innovator’s dilemma remains.

In computational linguistics and machine learning research, the big company on top is whatever technique has the best performance on some task. I thought I’d never see the end of minor variants on three-state acoustic HMMs for context-dependent triphones in speech recognition when we knew they could never sort ‘d’ from ‘t’ (here’s some disruptive “landmark-based” speech recognition). Disruptive technologies might not have state of the art performance or might not scale, but they should have some redeeming features. One could view statistical NLP as being disruptive itself; in the beginning, it only did sequences with frequency-based estimators. But remember, just because a technique performs below the best published results doesn’t make it disruptive.

The remaining dilemma is that none of the follow-on books by Christensen or others provide a good read, much less a solution to the innovator’s dilemma.

Artists Ship, or the Best is the Enemy of the Good

Artists Ship

I try not to just link to other articles in this blog, but I was extremely taken with Paul Graham’s blog post The other half of “artists ship”, because it rang so true to my experience as both a scientist and software engineer. The expression is attributed to Steve Jobs in Steve Levy’s book on the Mac Insanely Great, which I’ve cut and pasted from Alex Golub’s blog:

… REAL ARTISTS SHIP … One’s creation, quite simply, did not exist as art if it was not out there, available for consumption, doing well. Was [Douglas] Engelbart an artist? A prima donna — he didn’t ship. What were the wizards of PARC? Haughty aristocrats — they didn’t ship. The final step of an artist — the single validating act — was getting his or her work into boxes … to make a difference in the world and a dent in the universe, you had to ship.

A while back, I commented on the artist-programmer connection in the blog post Industrialist or Auteur?, which pinned the blame on “obsessive pedantry”.

The Best is the Enemy of the Good

Two centuries and a decade earlier in 1772, Voltaire got to the bottom of why many of us have trouble finishing projects, stating Le mieux est l’ennemi du bien. (roughly “the best [better/perfect] is the enemy of the good”).

I remember seeing this first in the introduction to Barwise and Perry’s book Situations and Attitudes, where they thank their editor for reminding them that among the good qualities a book may possess, existence is a quite important one. Barwise and Perry copped to falling prey to the temptation to keep tweaking something to make it better while never delivering anything. That’s one reason why deadlines help, be they real (this morning’s NAACL/HLT submission deadline) or imaginary (if I don’t get this blog entry done today, no supper).

Build and Release

If you had to wait for my ultimate NLP API, LingPipe wouldn’t exist. Almost every method in every class could use improvement in everything from unit tests to documentation to efficiency. I have reasonably high standards (as the fans of extreme programming like to say, there are only two quality settings of interest to programmers, great and lives-depend-on-it), but I’m a pushover compared to what even a small (200-400 person, 30 or so of whom were core product coders) company like SpeechWorks required in the way of QA. At SpeechWorks, I had proposed an API for language segmentation (breaking a document down into spans by language, primarily for European text-to-speech from e-mail apps), which was informed by a long (and good) marketing document and read by no fewer than a dozen commentators, who also read later drafts. The code was written and reviewed by two of us, and we interfaced with the testing group who had to put the code through functional tests on a host of platforms. And then there was release engineering. Oh, and did I mention the documentation department? My first pass at API design was a little too Ph.D.-centric (no, my commentators said, the users wouldn’t want to tune interpolation parameters for character language models at run time — just guess a good value for them, please); LingPipe is what happens when there’s no one from marketing commenting on the API!

If you increase your team size to something like Microsoft (see How MS Builds Software), you won’t even get through the hierarchical build process in the month it took SpeechWorks to roll out software, and then you’ll be waiting on the internationalization team to translate your deathless dialog box prose into dozens of language. Perhaps that’s why Mark Chu-Carroll finds enough to hold his interest working on Google’s builds!

A Dissertation’s Just Practice

The worst case of the not-good-enough problem I’ve seen is in academia. Students somehow try to pack as much as they can into a thesis. I sure did. I had seven chapters outlined, and after five, my advisor (Ewan Klein) told me to stop, I had enough. Of course, some advisors never think their students have done enough work — that’s just the same problem from management’s perspective. My own advice to students was to save their life’s work for the rest of their life — a dissertation’s just practice. As evidence of that, they’re graded mostly on form, not content.

Revise and Resubmit

The Computational Lingusitics journal editorial board faces the same problem. Robert Dale (the editor) found that most authors who were asked to revise and resubmit their paper (that is, not rejected or accepted outright) never got around to it. Robert tracked some of the authors down and they said they simply didn’t have enough time to run all the extra experiments and analyses proposed by reviewers. Robert asked us to rethink the way we came to conclusions, and instead of asking “could this be better?” ask “is it good enough to be interesting?”. I couldn’t agree more.

Provost, Fawcett & Kohavi (1998) The Case Against Accuracy Estimation for Comparing Induction Algorithms

I couldn’t agree more with the first conclusion:

First, the justifications for using accuracy to compare classifiers are questionable at best.

of this paper:

In fact, I’d extend it to micro-averaged and macro-averaged F-measures, AUC, BEP, etc., Foster and crew’s argument is simple. They evaluate naive Bayes, decision trees, boosted decision trees, and k-nearest neighbor algorithms on a handful of UCI machine learning repository problems. They show that there aren’t what they call dominating ROC curves for any of the classifiers on any of the problems. For example, here’s their figure 1 (they later apply smoothing to better estimate ROC curves):

The upshot is that depending on whether you need high recall or high precision, the “best” classifier is different. As I’ve said before, it’s horses for courses.

To be a little more specific, they plot receiver operating characteristic (ROC) curves for the classifiers, which shows (1-specificity) versus sensitivity.

  • sensitivity = truePos / (truePos + falseNeg)
  • specificity = trueNeg / (trueNeg + falsePos)

In LingPipe, any ranked classifier can be evaluated for ROC curve using the method:

It’d be nice to see this work extended to today’s most popular classifiers: SVMs and logistic regression.

The Long Road to CRFs

CRFs are Done

The first bit of good news is that LingPipe 3.9 is within days of release. CRFs are coded, documented, unit tested, and I’ve even written a long-ish tutorial with hello-world examples for tagging and chunking, and a longer example of chunking with complex features evaluated over:

And They’re Speedy

The second bit of good news is that it looks like we have near state-of-the-art performance in terms of speed. It’s always hard to compare systems without exactly recreating the feature extractors, requirements for convergence, hardware setup and load, and so on.I was looking at

for comparison. Okazaki also evaluated first-order chain CRFs, though on the CoNLL 2000 English phrase chunking data, which has fewer tags than the CoNLL 2003 English named entity data.

While my estimator may be a tad slower (it took about 10s/epoch during stochastic gradient), I’m applying priors, and I think the tag set is a bit bigger. (Of course, if you use IO encoding rather than BIO encoding, like the Stanford named entity effort, then there’d be even fewer tags; on the other hands, if I followed Turian et al. (ref below), or the way we handle HMM encoding, there’d be more.)

It also looks like our run time is faster than the other systems benchmarked if you don’t consider feature extraction time (and I don’t think they did in the eval, but I may be wrong). It’s running at 70K tokens/second for full forward-backward decoding; first-best would be faster.

All the Interfaces, Please

Like for HMMs, I implemented first-best, n-best with conditional probabilities, and a full forward-backward confidence evaluation. For taggers, confidence is per tag per token; for chunkers, it’s per chunk.

Final Improvements

Yesterday, I was despairing a bit over how slow my approach was. Then I looked at my code, instrumented the time spent in each component, and had my D’oh! moment(s).

Blocked Prior Updates

The first problem was that I was doing dense, stochastic prior updates. That is, for every instance, I walked over the entire set of dense coefficient vectors, calculated the gradient, and applied it. This was dominating estimation time.

So I applied a blocking strategy whereby the prior gradient is only applied every so often (say, every 100 instances). This is the strategy discussed in Langford, Li and Zhang’s truncated gradient paper.

I can vouch for the fact that result vectors didn’t change much, and speed was hugely improved to the point where the priors weren’t taking much of the estimation time at all.

Caching Features

The other shortcoming of my initial implementation was that I was extracting features online rather than extracting them all into a cache. After cleaning up the prior, feature extraction was taking orders of magnitude longer than everything else. So I built a cache, and added yet another parameter to control whether to use it or not. With the cache and rich feature extractors, the estimator needs 2GB; with online feature extraction, it’s about 20 times slower, but only requires around 300MB of memory or less.

Bug Fixes

There were several subtle and not-so-subtle bugs that needed to be fixed along the way.

One problem was that I forgot to scale the priors based on the number of training instances during estimation. This led to huge over-regularization. When I fixed this problem, the results started looking way better.

Structural Zeros

Another bug-like problem I had is that when decoding CRFs for chunkers, I needed to rule out certain illegal tag sequences. The codec I abstracted to handle the encoding of chunkers and taggers and subsequent decoding could compute legal tag sequences and consistency with tokenizers, but the CRF itself couldn’t. So I was getting illegal tag sequences out that caused the codec to crash.

So I built in structural zeros. The simplest way to do it seemed to be to add a flag that only allowed tag transitions seen in the training data. This way, I could enforce legal start tags, legal end tags, and legal transitions. This had the nice side benefit of speeding things up, because I could skip calculations for structural zeros. (This is one of the reasons Thorsten Brants’ TnT is so fast — it also applies this strategy to tags, only allowing tags seen in training data for given tokens.)

Feature Extraction Encapsulation

I was almost ready to go a couple of days ago. But then I tried to build a richer feature extractor for the CoNLL entity data that used part-of-speech tags, token shape features, contextual features, prefixes and suffixes, etc. Basically the “baseline” features suggested in Turian, Ratinov, Bengio and Roth’s survey of cluster features (more to come on that paper).

It turns out that the basic node and edge feature extractors, as I proposed almost six months ago, weren’t quite up to the job.

So I raised the abstraction level so that the features for a whole input were encapsulated in a features object that was called lazily by the decoders and/or estimator. This allowed things like part-of-speech taggings to be computed once and then cached.

This will also allow online document features (like previous tagging decisions) to be rolled into the feature extractor, though it’ll take some work.

And a Whole Lotta’ Interfaces and Retrofitting

I added a whole new package, , to characterize the output of a first-best, n-best, and marginal tag probability tagger. I then implemented these with CRFs and retrofitted them for HMMs. I also pulled out the evaluator and generalized it.

Along the way, I deprecated a few interfaces, like , which is no longer needed given .

Still No Templated Feature Extraction

Looking at other CRF implementations, and talking to others who’d used them, I see that designing a language to specify feature extractions is popular. Like other decisions in LingPipe, I’ve stuck to code-based solutions. The problem with this is that it limits our users to Java developers.

Search-based Structured Prediction

I normally would not make a post such as this one (my blog is not my advertisementsphere), but given that it is unlikely this paper will appear in a conference in the near future (and conferences are just ads), I decided to include a link. John, Daniel and I have been working on an algorithm called Searn for solving structured prediction problems. I believe that it will be useful to NLP people, so I hope this post deserves the small space it takes up.

Approximating the Problem or the Solution

A while back I came across a paper that (in a completely separate context) argues for approximating problems in lieu of approximating solutions. This idea has a nice analogue in NLP: should we (A) choose a simple model for which we can do exact inference or (B) choose a complex model that is closer to the truth for which exact inference is not tractable. (A) is approximating the problem while (B) is approximating the solution.

It seems that all signs point to (B). In almost every interesting case I know of, it helps (or at the very least doesn't hurt) to move to more complex models that are more expressive, even if this renders learning or search intractable. This story is well known in word alignment (eg, GIZA) and MT (eg, model 4 decoding), but also has simpler examples in parsing (cf, McDonald), sequence labeling (cf, Sutton), relation extraction (cf, Culotta), as well as pretty much any area in which "joint inference" has been shown to be helpful.

One sobering example here is the story in word alignment, where one cannot go and directly use, say, model 4 for computing alignments, but must first follow a strict recipe: run a few iterations of model 1, followed by a few model 2, followed by some HMM, then model 4 (skipping model 3 all together).  The problem here is that learning model 4 parameters directly falls into local minima too easily, so one must initialize intelligently, by using outputs of previous iterations.  My guess is that this result will continue to hold for training (though perhaps not predicting) with more an more complex models.  This is unfortunate, and there may be ways of coming up with learning algorithms that automatically initialize themselves by some mechanism for simplifying their own structure (seems like a fun open question, somewhat related to recent work by Smith).

Aside from a strong suggestion as to how to design models and inference procedure (i.e., ignore tractability in favor of expressiveness), there may be something interesting to say here about human language processing.  If it is indeed true that, for the most part, we can computationally move to more complex models, forgoing tractable search, then it is not implausible to imagine that perhaps humans do the same thing.  My knowledge in this area is sparse, but my general understanding is that various models of human language processing are disfavored because they would be too computationally difficult.  But if, as in old-school AI, we believe that humans just have a really good innate search algorithm, then this observation might lead us to believe that we have, ourselves, very complex, intractable "models" in our heads.

Reproducible Results

In an ideal world, it would be possible to read a paper, go out and implement the proposed algorithm, and obtain the same results. In the real world, this isn't possible. For one, if by "paper" we mean "conference paper," there's often just not enough space to spell out all the details. Even how you do tokenization can make a big difference! It seems reasonable that there should be sufficient detail in a journal paper to achieve essentially the same results, since there's (at least officially) not a space issue. On the other hand, no one really publishes in journals in our subfamily of CS.

The next thing one can do is to release the software associated with a paper. I've tried to do this in a handful of cases, but it can be a non-trivial exercise. There are a few problems. First, there's the question of how polished the software you put out should be. Probably my most polished is megam (for learning classifiers) and the least polished is DPsearch (code from my AI stats paper). It was a very nontrivial amount of effort to write up all the docs for megam and so on. As a result, I hope that people can use it. I have less hope for DPsearch --- you'd really have to know what you're doing to rip the guts out of it.

Nevertheless, I have occasionally received copies of code like my DPsearch from other people (i.e., unpolished code) and have still been able to use them successfully, albeit only for ML stuff, not for NLP stuff. ML stuff is nice because, for the most part, its self-contained. NLP stuff often isn't: first you run a parser, then you have to have wordnet installed, then you have to have 100MB of data files, then you have to run scripts X, Y and Z before you can finally run the program. The work I did for my thesis is a perfect example of this: instead of building all the important features into the main body of code I wrote, about half of them were implemented as Perl scripts that would essentially add "columns" to a CoNLL-style input format. At the end, the input was like 25-30 columns wide, and if any were missing or out of order, bad things would happen. As a result, it's a completely nontrivial exercise for me to release this beast. The only real conceivable option would be to remove the non-important scripts, get the important ones back into the real code, and then release that. But then there's no way the results would match exactly those from the paper/thesis.

I don't know of a solution to this problem. I suppose it depends on what your goal is. One goal is just to figure out some implementation details so that you can use them yourself. For this, it would be perfectly acceptable in, say, my thesis situation, to just put up the code (perhaps the scripts too) and leave it at that. There would be an implicit contract that you couldn't really expect too much from it (i.e., you shouldn't expect to run it).

A second goal is to use someone else's code as a baseline system to compare against. This goal is lessened when common data is available, because you can compare to published results. But often you don't care about the common data and really want to see how it works on other data. Or you want to qualitatively compare your output to a baseline. This seems harder to deal with. If code goes up to solve this problem, it needs to be runnable. And it needs to achieve pretty much the same results as published, otherwise funny things happen ("so and so reported scores of X but we were only able to achieve Y using their code", where Y

ICML/UAI/COLT Workshops Posted

See here for the current list. They include: Nonparametric Bayes (woohoo!), machine learning and music, Bayesian modeling applications, prior knowledge for text and language processing, sparse optimization and variable selection, as well as stand-alone workshops on the reinforcement learning competition and mining and learning with graphs.

Because I'm one of the organizers, I'd like to call attention to the Prior knowledge for text and language processing workshop. We'd definitely like submissions on any of the following topics:

  • Prior knowledge for language modeling, parsing, translation
  • Topic modeling for document analysis and retrieval
  • Parametric and non-parametric Bayesian models in NLP
  • Graphical models embodying structural knowledge of texts
  • Complex features/kernels that incorporate linguistic knowledge; kernels built from generative models
  • Limitations of purely data-driven learning techniques for text and language applications; performance gains due to incorporation of prior knowledge
  • Typology of different forms of prior knowledge for NLP (knowledge embodied in generative Bayesian models, in MDL models, in ILP/logical models, in linguistic features, in representational frameworks, in grammatical rules…)
  • Formal principles for combining rule-based and data-based approaches to NLP
  • Linguistic science and cognitive models as sources of prior knowledge

Yes, I know that's a shameless plug, but do you really expect better from me?!

More complaining about automatic evaluation

I remember a few years ago complaining about automatic evaluation at conference was the thing to do. (Ironically, so was writing papers about automatic evaluation!) Things are saner now on both sides. While what I'm writing here is interpretable as a gripe, it's really intended as a "did anyone else notice this" because it's somewhat subtle.

The evaluation metric I care about is Rouge, designed for summarization. The primary difference between Rouge and Bleu is that Rouge is recall-oriented while Bleu is precision-oriented. The way Rouge works is as follows. Pick an ngram size. Get a single system summary H and a single reference summary R (we'll get to multiple references shortly). Let |H| denote the size of bag the defined by H and let |H^R| denote the bag intersection. Namely, the number of times some n-gram is allowed to appear in H^R is the min of the number of times it appears in H and R. Take this number and divide by |R|. This is the ngram recall for our system on this one example.

To extend this to more than one summary, we simple average the Rouges at each individual summary.

Now, suppose we have multiple references, R_1, R_2, ..., R_K. In the original Rouge papers and implementation, we compute the score for a single sentence as the max over the references of the Rouge on that individual reference. In other words, our score is the score against a single reference, where that reference is chosen optimistically.

In later Rouge paper and implementation, this changed. In the single-reference case, our score was |H^R|/|R|. In the multiple reference setting, it is |H^(R_1 + R_2 + ... + R_K)|/|R_1 + R_2 + ... + R_K|, where + denotes bag union. Apparently this makes the evaluation more stable.

(As an aside, there is no notion of a "too long" penalty because all system output is capped at some fixed length, eg., 100 words.)

Enough about how Rouge works. Let's talk about how my DUC summarization system worked back in 2006. First, we run BayeSum to get a score for each sentence. Then, based on the score and about 10 other features, we perform sentence extraction, optimized against Rouge. Many of these features are simple patterns; the most interesting (for this post) is my "MMR-like" feature.

MMR (Maximal Marginal Relevance) is a now standard technique in summarization that aims to allow your sentence extractor to extract sentences that aren't wholly redundant. The way it works is as follows. We score each sentence. We pick as our first sentence the sentence with the highest score. We the rescore each sentence to a weighted linear combination of the original score and minus the similarity between the proposed second sentence and its similarity to the first. Essentially, we want to punish redundancy, weighted by some parameter a.

This parameter is something that I tune in max-Rouge training. What I found was that at the end of the day, the value of a that is found by the system is always negative, which means that instead of disfavoring redundancy, we're actually favoring it. I always took this as a notion that human summaries really aren't that diverse.

The take-home message is that if you can opportunistically pick one good sentence to go in your summary, the remaining sentences you choose should be as similar to that one was possible. It's sort of an exploitation (not exploration) issue.

The problem is that I don't think this is true. I think it's an artifact, and probably a pretty bad one, of the "new" version of Rouge with multiple references. In particular, suppose I opportunistically choose one good sentence. It will match a bunch of ngrams in, say, reference 1. Now, suppose as my second sentence I choose something that is actually diverse. Sure, maybe it matches something diverse in one of the references. But maybe not. Suppose instead that I pick (roughly) the same sentence that I chose for sentence 1. It won't re-match against ngrams from reference 1, but if it's really an important sentence, it will match the equivalent sentence in reference 2. And so on.

So this is all nice, but does it happen? It seems so. Below, I've taken all of the systems from DUC 2006 and plotted (on X) their human-graded Non-Redundancy scores (higher means less redundant) against (on Y) their Rouge-2 scores.



Here, we clearly see (though there aren't even many data points) that high non-redundacy means low Rouge-2. Below is Rouge-SU4, which is another version of the metric:

Again, we see the same trend. If you want high Rouge scores, you had better be redundant.

The point here is not to gripe about the metric, but to point out something that people may not be aware of. I certainly wasn't until I actually started looking at what my system was learning. Perhaps this is something that deserves some attention.

Parallel Sampling

I've been thinking a lot recently about how to do MCMC on massively parallel architectures, for instance in a (massively) multi-core setup (either with or without shared memory).

There are several ways to approach this problem.

The first is the "brain dead" approach. If you have N-many cores, just run N-many parallel (independent) samplers. Done. The problem here is that if N is really like (say 1000 or greater), then this is probably a complete waste of space/time.

The next approach works if you're doing something like (uncollapsed) Gibbs sampling. Here, the Markov blankets usually separate in a fairly reasonable way. So you can literally distribute the work precisely as specified by the Markov blankets. With a bit of engineering, you can probably do this is a pretty effective manner. The problem, of course, is if you have strongly overlapping Markov blankets. (I.e., if you don't have good separation in the network.) This can happen either due to model structure, or due to collapsing certain variables. In this case, this approach just doesn't work at all.

The third approach---and the only one that really seems plausible---would be to construct sampling schemes exclusively for a massively parallel architecture. For instance, if you can divide your variables in some reasonable way, you can probably run semi-independent samplers that communicate on an as-needed basis. The form of this communication might, for instance, look something like an MH-step, or perhaps something more complex.

At any rate, I've done a bit of a literature survey to find examples of systems that work like this, but have turned up surprisingly little. I can't imagine that there's that little work on this problem, though.

Destination: Singapore

Welcome to everyone to ACL! It's pretty rare for me to end up conferencing in a country I've been before, largely because I try to avoid it. When I was here last time, I stayed with Yee Whye, who was here at the time as a postdoc at NUS, and lived here previously in his youth. As a result, he was an excellent "tour guide." With his help, here's a list of mostly food related stuff that you should definitely try while here (see also the ACL blog):

  • Pepper crab. The easiest to find are the "No Signboard" restaurant chain. Don't wear a nice shirt unless you plan on doing laundry.
  • Chicken rice. This sounds lame. Sure, chicken is kind of tasty. Rice is kind of tasty. But the key is that the rice is cooked in or with melted chicken fat. It's probably the most amazingly simple and delicious dish I've ever had. "Yet Kun" (or something like that) is along Purvis street.
  • Especially for dessert, there's Ah Chew, a Chinese place around Liang Seah street in the Bugis area (lots of other stuff there too).
  • Hotpot is another local specialty: there is very good spicy Szechuan hotpot around Liang Seah street.
  • For real Chinese tea, here. (Funny aside: when I did this, they first asked "have you had tea before?" Clearly the meaning is "have you had real Chinese tea prepared traditionally and tasted akin to a wine tasting?" But I don't think I would ever ask someone "have you had wine before?" But I also can't really think of a better way to ask this!)
  • Good late night snacks can be found at Prata stalls (eg., indian roti with curry).
  • The food court at Vivocity, despite being a food court, is very good. You should have some hand-pressed sugar cane juice -- very sweet, but very tasty (goes well with some spicy hotpot).
  • Chinatown has good Chinese dessert (eg., bean stuff) and frog leg porridge.

Okay, so this list is all food. But frankly, what else are you going to do here? Go to malls? :). There's definitely nice architecture to be seen; I would recommend the Mosque off of Arab street; of course you have to go to the Esplanade (the durian-looking building); etc. You can see a few photos from my last trip here.

Now, I realize that most of the above list is not particularly friendly to my happy cow friends. Here's a list of restaurants that happy cow provides. There are quite a few vegetarian options, probably partially because of the large Muslim population here. There aren't as many vegan places, but certainly enough. For the vegan minded, there is a good blog about being vegan in Singapore (first post is about a recent local talk by Campbell, the author of The China Study, which I recommend everyone at least reads). I can't vouch for the quality of these places, but here's a short list drawn from Living Vegan:

  • Mushroom hotpot at Ling Zhi
  • Fried fake meat udon noodles (though frankly I'm not a big fan of fake meat)
  • Green Pasture cafe; looks like you probably have to be a bit careful here in terms of what you order
  • Yes Natural; seems like it has a bunch of good options
  • Lotus Veg restaurant, seems to have a bunch of dim sum (see here, too)
  • If you must, there's pizza
  • And oh-my-gosh, there's actually veggie chicken rice, though it doesn't seem like it holds up to the same standards as real chicken rice (if it did, that would be impressive)

Okay, you can find more for yourself if you go through their links :).

Enjoy your time here!

Quick update: Totally forgot about coffee.

If you need your espresso kick, Highlander coffee (49 Kampong Bahru Road) comes the most recommended, but is a bit of a hike from the conference area. Of course, you could also try the local specialty: burnt crap with condensed milk (lots and lots of discussion especially on page two here).

Parsing with Transformations

I remember when I took my first "real" Syntax class, where by "real" I mean "Chomskyan." It was at USC in Fall 2001, taught by Roumyana Pancheva. It was hard as hell but I loved it. However, as a computationally minded guy, I remember snickering to myself the whole time we were talking about movements that get you from deep structure to surface structure. This stuff was all computationally ridiculous.

But why was it computationally ridiculous? It was ridiculous because my mindset, and I think the mindset of most computational folks at the time, was that of n^3 CKY or Earley style parsing. Namely exact parsing in a context free manner. This whole idea of transformations would kill anything like that in a very bad way.

However, there's been a recent shift in attitudes. Sure, people still do their n^3 parsing, but of course none of it is exact anyway (due to pruning). But more than that, things like linear time parsing algorithms as popularized by people like Joakim Nivre and Kenji Sagae and Brian Roark and Joseph Turian, have proved very useful. They work well, are incredibly efficient, and are easy to implement. They're also a bit more psychologically plausible (as Eugene Charniak said recently "we don't know what people are doing, but they're definitely not doing CKY.").

So I'm led to wonder: could we actually do parsing in a transformational grammar using all the new stuff we know about (for instance) left-to-right parsing?

One thing that stands in our way, of course, is the stupid Penn Treebank, which was annotated only with very simple transformations (mostly noun phrase movements) and not really "deep" transformations as most Chomskyan linguists would recognize them.

But I think you could still do it. It would end up as being partially unsupervised, but at least from a minimum description length perspective, I can either spend weights learning more special cases, or I can learn general transformational rules. It would take some thought and effort to write it out and figure out how to actually optimize such a thing, but I bet it could be done in a semester.

So then the question is: aside from smaller models (potentially), is there any other reason to do it?

I can think of at least one: parsing non-declarative sentences. Since almost all sentences in the Treebank are declarative, parsers do pretty crappy when tested on other things. Slav Petrov had a paper at EMNLP 2010 on parsing questions. Here is the abstract, which says pretty much everything:

... We show that dependency parsers have more difficulty parsing questions than constituency parsers. In particular, deterministic shift-reduce dependency parsers ... drop to 60% labeled accuracy on a question test set. We propose an uptraining procedure in which a deterministic parser is trained on the output of a more accurate, but slower, latent variable constituency parser (converted to dependencies). Uptraining with 100K unlabeled questions achieves results comparable to having 2K labeled questions for training. With 100K unlabeled and 2K labeled questions, uptraining is able to improve parsing accuracy to 84%, closing the gap between in-domain and out-of-domain performance.

Now, at least in principle, if you can parse declarative sentences, you should be able to parse questions. At least if you know about some basic syntactic transformations in English. (As an aside, the "uptraining" idea is almost exactly the same as the structure compilation idea that Percy, Dan and I had at ICML 2008, though Slav and colleagues apply it to a domain adaptation problem, while we just did simple semi-supervised learning.)

We have observed similar effects in the parsing of commands, such as "Put your head in a noose" where parsers -- even constituency ones -- really really want "Put" to be a noun! Again, if you know simple transformations -- like subject dropping -- you should be able to parse commands if you can already parse declarations.

As with any generalization, the hope is that by realizing the generalization, you don't need to store so many specific cases. So if you can learn that commands and questions are simple transformation on declarative sentences, and you can learn to parse declaratives, you should be able to handle the other case.

(Anticipating comments: yes, I know you could try to pre-transform your data, like they do in MT, but that's quite inelegant. And yes, I know you could probably take the treebank and turn a lot of the sentences into commands or questions to create a new data set. But that's kind of missing the point: I don't want to just handle commands or questions... I want to handle anything, even things that I might not have anticipated.)

<p>Some thoughts on supplementary materials Having the option of authors submitting supplementary materials is becoming popular in NLP/ML land. NIPS was one of the first conferences I submit to that has allowed this; I think ACL allowed it this past year, at least for specific types of materials (code, data), and EMNLP is thinking of allowing it at some point in the near future.

Here is a snippet of the NIPS call for papers (see section 5) that describes the role of supplementary materials:

In addition to the submitted PDF paper, authors can additionally submit supplementary material for their paper... Such extra material may include long technical proofs that do not fit into the paper, image, audio or video sample outputs from your algorithm, animations that describe your algorithm, details of experimental results, or even source code for running experiments. Note that the reviewers and the program committee reserve the right to judge the paper solely on the basis of the 8 pages, 9 pages including citations, of the paper; looking at any extra material is up to the discretion of the reviewers and is not required.

(Emphasis mine.) Now, before everyone goes misinterpreting what I'm about to say, let me make it clear that in general I like the idea of supplementary materials, given our current publishing model.

You can think of the emphasized part of the call as a form of reviewer protection. It basically says: look, we know that reviewers are overloaded; if your paper isn't very interesting, the reviewers aren't required to read the supplement. (As an aside, I feel the same thing happens with pages 2-8 given page 1 in a lot of cases :P.)

I think it's good to have such a form a reviewer protection. What I wonder is whether it also makes sense to add a form of author protection. In other words, the current policy -- which seems only explicitly stated in the case of NIPS, but seems to be generally understood elsewhere, too -- is that reviewers are protected from overzealous authors. I think we need to have additional clauses that protect authors from overzealous reviewers.

Why? Already I get annoyed with reviewers who seem to think that extra experiments, discussion, proofs or whatever can somehow magically fit in an already crammed 8 page page. A general suggestion to reviewers is that if you're suggesting things to add, you should also suggest things to cut.

This situation is exacerbated infinity-fold with the "option" of supplementary material. There now is no length-limit reason why an author couldn't include everything under the sun. And it's too easy for a reviewer just to say that XYZ should have been included because, well, it could just have gone in the supplementary material!

So what I'm proposing is that supplementary material clauses should have two forms of protection. The first being the existing one, protecting reviewers from overzealous authors. The second being the reverse, something like:

Authors are not obligated to include supplementary materials. The paper should stand on its own, excluding any supplement. Reviewers must take into account the strict 8 page limit when evaluating papers.

Or something like that: the wording isn't quite right. But without this, I fear that supplementary materials will, in the limit, simply turn into an arms race.

Intro to CL Books ...

Bob Carpenter has blogged about a new Intro to IR book online here. I'm looking forward to skimming it this weekend. I would also recommend the Python based NLTK Toolkit.

Books and resources like these are generally geared towards people with existing programming background. If a linguist with no programming skills is interested in learning some computational linguistics, Mike Hammond has written a couple of novice's intro books called Programming For Linguists. A novice would be wise to start with Hammond's books, move to the NLTK tutorials, then move on to a more serious book like Manning et al.

And if you're at all curious about what a linguist might DO once she has worked through all that wonderful material, you might could go to my own most wonderful List of Companies That Hire Computational Linguists page here.

And if you're not challenged by any of that above, I dare you to read Bob's Type-Logical Semantics. Go on, you think yer all smart and such. I dare ya! I read it the summer of 1999 with a semanticist, a logician, and a computer scientist and it made all of our heads hurt. I still have Chapter 10 nightmares.

I may have misstated Chatham’s belief’s below. It’s not clear that he agrees with the claim I complained about. But, his blog makes it clear that he believes this:

experience can be coded in a non-linguistic form, and that recoding into language is possible, at least over short delays

First, I didn’t realize it was at all controversial that experience can be coded in non-linguistic form. Of course it can. Does anyone doubt this? Second, I have no clue what Chatham means by recoding into language. Certainly thoughts and memories can be expressed by language, that should go without saying; but, Chatham seems to believe that at least some thoughts and memories are STORED in language form. This sounds like the old “we think in language” argument.

I am not convinced that we think in language. In fact, I seriously doubt we think in language. I think language is always a post-thought process.

casting a wide net For the first time, I used sitemeter to view the site hits for this blog. I only set that up a week ago, so the hits are recent only, but the range of locales is surprising. I'm bigger in India than I ever would have imagined. I can guess by some of the locations which of my friends are the likely culprit (Eric, you are spending wayyyyyyy too much time reading this blog). But some of these just have no explanation, other than Bloggers "Next Blog" button.

Here's a list of hit locations (for hits that lasted longer than 0.00 seconds, which were many, unfortunately).

Bombay, India
Brooklyn, NY (USA)
Cambridge, UK
Haifa, Israel
Honolulu, Hawaii (USA)
Hyderabad, India
Kinards, SC (USA)
Krakw, Poland
Leuven, Belgium
Mamers, NC (USA)
Melbourne, Australia
New York, NY (USA)
Pittsburgh, PA (USA)
Saint Paul, MN (USA)
San Diego, California (USA)
Seattle, Washington (USA)
Sunnyvale, CA (USA)
Tokyo, Japan
Tulsa, OK (USA)
Woking, UK
Data, Datum, Dati, Datillium, Datsun The folks over at Cognitive Daily have blogged about the number properties of the word "data", or rather, they have blogged about the nitpicky prescriptivist grammar complaints that inevitably attend comments on academic paper submissions.

Predictably, the comments sections is filled with people ignoring the main point, and instead making the same prescriptivist claims about the alleged plurality of "data". My 2 cents (in their comments) was simply that the word "data" has evolved into a word like like "deer" or "moose" which can be either singular or plural. Buffalo Buffalo Bayes The (somewhat) famous Buffalo sentence below seems to say something about frequency and meaning, I’m just not sure what:

Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo

Since the conditional probability of “buffalo” in the context of “buffalo” is exactly 1 (hey, I ain’t no math genius and I didn’t actually walk through Bayes theorem for this so whaddoo I know; I’m just sayin’, it seems pretty obvious, even to The Lousy Linguist).

Also, there is no conditional probability of any item in the sentence that is not 1; so from where does structure emerge? Perhaps the (obvious) point is that a sentence like this could not be used to learn language. One needs to know the structures first in order to interpret. Regardless of your pet theory of learning this sentence will crash your learner.

There are only two sets of cues that could help: orthographic and prosodic. There are three capitalized words, so that indicates some differentiation, but not enough by itself. A learner would have to have some suprasegmental prosodic information to help identify constituents. But how much would be enough?

Imagine a corpus of English sentences along these polysemic lines (with prosodic phrases annotated). Would prosodic phrase boundaries be enough for a learner to make some fair predictions about syntactic structure?

UPDATE (Nov 16, 2009): It only now occurs to me, years later, that the the very first Buffalo has no preceding context "buffalo". Better late than never?? YIKES! or The New Information Extraction The term information extraction may be taking on a whole new meaning to the greater world than computational linguists would have it mean. As someone working in the field of NLP, I think of information extraction as in line with the Wikipedia definition:

information extraction (IE) is a type of information retrieval whose goal is to automatically extract structured information, i.e. categorized and contextually and semantically well-defined data from a certain domain, from unstructured machine-readable documents.

But my colleague pointed out a whole new meaning to me a couple weeks ago, the day after an episode of the NBC sitcom My Name Is Earl aired (11/1/2007: Our Other Cops Is On!). Thanks to the wonders of The Internets, I managed to find a reference to the sitcom’s usage at TV Fodder.com:

Information extraction in a post-9/11 world involves delving into the nether regions of suspected terrorists....

In other words: TORTURE! The law of unintended consequences has brought the world of NLP and the so called War on Terror into sudden intersection (yes, there are "other" intersections... shhhhhhh, we don't talk about those). Perhaps the term IE is obsolete in CL anyway. Wikipedia described it as a subfield of IR. Manning & Schütze’s new book on the topic is called Introduction to Information Retrieval , not Introduction to Information Extraction. They define IR, on the link above, essentially as finding material that satisfies information needs (note: I'm not quoting directly because the book is not yet out).

Quibbling over names and labels of subfields is often entertaining, but it’s ultimately a fruitless endeavor. I defer to Manning & Schütze on all things NLP. Information Retrieval it is. Andrew Sullivan, Please Take a Cog Sci Class!!!! Even though he blogs at a mere undergrad level (I’m slightly higher, heehee) I basically respect Andrew Sullivan as a blogger. He blogs about a diverse set of topics and has thoughtful and intelligent (even if controversial) comments and analysis. And he’s prolific, to say the least (surely the advantage of being a professional blogger, rather than stealing the spare moment at work while your test suite runs its course). That said, he can sometimes really come across as a snobbish little twit. Like yesterday when he linked to an article about Shakespearean language which talks about a psycholinguistics study initiated by an English professor, Philip Davis; as is so often the case, the professor has wildly exaggerated the meaning of the study. Please see Language Log’s post Distracted By The Brain for related discussion. Here’s crucial quote from that post:

The neuroscience information had a particularly striking effect on non-experts’ judgments of bad explanations, masking otherwise salient problems in these explanations.

My claim: the neuroscience study discussed in the Davis article distracts the reader from Davis’s essentially absurd interpretations, and Andrew Sullivan takes the bait, hook, line and sinker (and looks like a twit in the end).

The article does not go into the crucial details of the study, but it says that it involves EEG (electroencephalogram) and MEG (magnetoencephalograhy) and fMRI (Functional Magnetic Resonance Imaging) noting that only the EEG portion has been completed. A pretty impressive array of tools for a single psycholinguistics study, I must say. Most published articles in the field would involve one or maybe two of these, but all three for a single study? Wow, impressive.

It’s not clear to me if this was a well designed study or not (my hunch is, no, it is a poorly designed study, but without the crucial details, I really don’t know). However, it is undeniable that professor Davis has gone off the deep end of interpretation. The study does not even involve Shakespearean English!!! It involves Modern English! Then Davis makes the following claims (false, all of them, regardless of the study):

["word class conversion"] is an economically compressed form of speech, as from an age when the language was at its most dynamically fluid and formatively mobile; an age in which a word could move quickly from one sense to another… (underlines added)

This is the classic English professor bullshit. I don’t even know what “economically compressed” means (Davis gives no definition); it has no meaning to linguistics that I know of. The quote also suggests Shakespeare’s English had some sort of magical linguistic qualities that today’s English does not possess. FALSE! Modern English allows tremendous productivity of constructions, neologisms, and ambiguity. A nice introduction to ambiguity can be found here: Ambiguous Words by George A. Miller.

Davis ends with a flourish of artistic bullshit hypothesizing: For my guess, more broadly, remains this: that Shakespeare's syntax, its shifts and movements, can lock into the existing pathways of the brain and actually move and change them—away from old and aging mental habits and easy long-established sequences. Neuroplasticity is only just now being studied in depth and it’s far from well understood, but the study in question says NOTHING about plasticity!!! There’s also no reason to believe that Shakespeare’s language does anything that other smart, well crafted language does not do. And we’re a generation at least away from having the tools to study any of this.

I’m accustomed to simply letting these all too common chunks of silliness go without comment, but then Andrew had to slip in his unfortunate bit of snooty arrogance. After pasting a chunk of the obvious linguistics bullshit on his site (then follow-up comments), he has to finish with "I knew all that already". Exactly what did you know, Andrew?

Since all of the major claims Davis makes are obvious bullshit, what exactly do you claim to have had prior knowledge of? What did Andrew know, and when did he know it? Really, Andrew, did you never take so much as a single linguistics course during all your years at Harvard and Oxford? The University at Maryland has excellent psycholinguists as does Georgetown. Please, consider sitting in on a course, won’t you? "filibuster" Some words just make me giggle. Linguistics Forum I just discovered this forum called Linguistics Forum. I only looked at a few of the posts and I was underwhelmed, but I've never been a forum-kind-of-guy, so my opinion should be of minimal interest to those of you who utilize these resources. Just thought I'd pass it along. Tigrigna Blog and Resources I just discovered a blog by a student of the language Tigrinya Qeyḥ bāḥrī.

From his site,

Being from a small city in Canada (Halifax, Nova Scotia) I found it very difficult to learn the mother tongue of my parents, as there are few resources availible from which I can learn. So, I decided to create a resource for myself, somewhere I could collect everything I know about the language and use it at my leisure. I thought about using my limited knowledge on HTML to create a webpage, that way I could have easy access to my work wherever I go.

And from Ethnologue

Tigrigna -- A language of Ethiopia

Population -- 3,224,875 in Ethiopia (1998 census). 2,819,755 monolinguals.
Region Tigray Province. Also spoken in Eritrea, Germany, Israel.

Alternate names -- Tigrinya, Tigray
Classification -- Afro-Asiatic, Semitic, South, Ethiopian, North
Language use -- National language. 146,933 second-language speakers.
Language development -- Literacy rate in first language: 1% to 10%.
Literacy rate in second language: 26.5%. Ethiopic script. Radio programs. Grammar. Bible: 1956. Comments -- Speakers are called 'Tigrai'. May 13, 2008 (screen grab from Psycholinguistics Arena)

What the hell happened on May 13, 2008? Pundit Plays Linguist. Fails.
(screen shot of a guest at McCain's BBQ. Video here)
Political pundits almost pathologically believe they have greater influence than they really do. Case in point, Talking Points Memo's editor and publisher and chief blogger Josh Marshal has been trying to promote the use of the phrase "ride the swing" as a metaphor for the case when "a reporter who has gotten way too cozy with a politician and has had their supposed objectivity affected" (original explanation here). The phrase refers to a posh BBQ that McCain hosted at one of his Arizona ranches where journalists were treated to a very comfy social experience that bordered on bribery (click on "video here" below the pic). As far as I can tell, Marshal is the primary pusher of the phrase and its most frequent user (a couple other examples here and here).
I suspect Marshal's linguistic campaign will fail. Attempts by a single person to explicitly promote the use of a new metaphor are rarely successful. This is not how language works. Successful new coinages are generally adopted less self-consciously. The process is not well understood, but examples like Marshal's are few and far between. Additionally, there are already several good metaphors for related frames, such as "drank the cool-aid" (which has equally obscure origins involving jungles and religious cults). Not sure we need a new one just for journalists.
(HT to my colleague CC for bringing this to my attention. At first, we had no clue what this metaphor referred to, and as such we literally couldn't understand what it was meant to evoke. CC did some blogger detective work and discovered its origin). Obama's Tango Conspiracy?
(screen shot from MSNBC's video)

Having nothing whatever to do with linguistics, nonetheless I feel compelled to report what seems like an entirely unreported snub by US President Barack Obama to the President of Argentina Cristina Kirchner. Watch MSNBC's video of the second photo shoot and you'll see Obama walk across the entire group to go shake hands with Canada's PM Stephen Harper (who missed the original shoot), but he passed right in front of Kirchner who raised her hand out to shake Obama's, but he ignored her entirely (creating a somewhat awkward moment), shook Harper's hand, then refused to make eye contact with Kirchner afterwords. I count that as two snubs.

Watch the video at Olbermann's "Countdown" site and at about 40 seconds in you'll see the moments I'm talking about. MSNBC's footage seems to be the only one with a wide enough angle to show the snubs.

The relevant frootage is here:
April 2, 2009; #5 "Obama meets the world press" (psssst, this has nothing to do with anything; just random rumor mongering...which is fun, ya know...) Taco Verbs
(screen shot of this blog's Sitemeter data)

A reader apparently was interested in "verbs that describes tacos." Since the IP address shows the Indiana Department of Education, I got 20 bucks says this was done by a lunch lady writing out next week's menu.

As for the "linguistics aspect", well, verbs don't describe nouns (like "tacos"), adjectives do. Verbs represent events. Rather, adjectives describe nouns. So, in the interest of serving my readers, exactly what kind of of adjectives describe tacos? Let's go to the experts:

Taco Bell:
  • crunchy taco
  • soft taco
  • taco supreme(bonus points for postnominal adjective)
  • double decker taco.
Lip Reading Response This is a response to Liberman's Saturday morning goofiness here: Regex Dictionary Nice one! A web-based dictionary you can search with regular expressions (HT MetaFilter). from the site's introduction page:

The Regex Dictionary is a searchable online dictionary, based on The American Heritage Dictionary of the English Language, 4th edition, that returns matches based on strings —defined here as a series of characters and metacharacters— rather than on whole words, while optionally grouping results by their part of speech. For example, a search for "cat" will return any words that include the string "cat", optionally grouped according to gramatical category:

* Adjectives: catastrophic, delicate, eye-catching, etc.
* Adverbs: marcato, staccato, etc.
* Nouns: scat, category, vacation, etc.
* Verbs: cater, complicate, etc.

In other words, the Regex Dictionary searches for words based on how they are spelled; it can find:

* adjectives ending in ly (197; ex.: homely)
* words ending in the suffix ship (89)
o Adjectives (1, midship)
o Nouns (80; ex.: membership)
o Suffixes (1, -ship)
o Verbs (6; ex.: worship)
* words, not counting proper nouns, that have six consecutive consonants, including y (79; ex.: strychnine)
* words, not counting proper nouns, that have six consecutive consonants, not counting y (2; ex.: latchstring)
* words of 12 or more letters that consist entirely of alternate consonants and vowels (45; ex.: legitimatize)
All Of Them Chinese Without Tone? ("please, wait a moment", image from braille.ch)

Vivian Aldridge has a nice website devoted to explaining braille systems for different languages (HT Boing Boing). If I understand correctly, tone is rarely represented for Chinese braille (let's forgive for the moment that "Chinese" is the name of a language family, not a particular language):

In the few examples of Chinese braille that I have come across, the signs for the tones were not used except in the following cases:
  • with the syllable yi, for which a good Chinese-German dictionary lists almost 50 different inkprint characters. In this case the indication of the tone helps to limit the number of possible meanings.
  • in words where a syllable with a suppressed vowel comes before syllable without a consonant, for example the word sh'yong (try out, test) in which the braille sign for the fourth tone is used instead of the apostrope. In this case the tone sign seems to be used to separate the two syllables.
Tone is a non-trivial feature of Chinese languages. Omniglot has a nice page with the system displayed (fyi, it cites braille.ch as one of its sources). The interesting point is that tone has the ability to be represented, but according to Vivian, it normally is not (however, she notes that she has only seen a few examples). I spent two years in college struggling in Mandarin courses. I would have liked to have dispensed with tone. Correct All Grammar Errors And Plagiarism! I was stupid enough to click through to Huffington Post's colossally stupid and fundamentally mistaken Worst Grammar Mistakes Ever post (I refuse to link to it). Of course, the 11 items had virtually nothing to do with grammar (the vast majority were punctuation and spelling errors). I must agree with Zwicky's pessimism regarding National Grammar Day: "It seems to me that the day is especially unlikely to provide a receptive audience for what linguists have to say."

But what prompted this post was the ad at the bottom for Grammarly, a free online "proofreader and grammar coach" which promised to Correct All Grammar Errors And Plagiarism.

A bold claim, indeed. I doubt a team of ten trained linguists could could felicitously make this claim. But the boldness does not stop there (it never does on the innerwebz). Click through to the online tool and wow, the bold claims just start stacking up like flap jacks at a Sunday fundraiser.

Just paste in your test and bam! you get
150+ Grammar Checks
Get detailed error explanations.
Plagiarism Detection
Find unoriginal text.
Text Enhancement
Use better words.
Contextual Spell Check
Spot misused words.

Dang! Them fancy computers, they sure is smart. Just for funnin, I pasted the text of Zwicky's NGD again post into the window and ran the check. Here's his report:

Not bad for a professor at one of the lesser linguistics departments. (pssst, btw, did ya spot that odd little grey balloon at the top of the second screen shot? Yeah, me too. It says "click allow if presented with a browser security message." Suspect, no doubt. Nonetheless, I trusted Chrome to protect me and plowed ahead). more on language death Razib continues his thoughtful discussion of the interplay of linguistic diversity/homogeneity and socio-economic disparity/prosperity.

Money quote:

If you have a casual knowledge of history or geography you know that languages are fault-lines around which intergroup conflict emerges. But more concretely I’ll dig into the literature or do a statistical analysis. I’ll have to correct for the fact that Africa and South Asia are among the most linguistically diverse regions in the world, and they kind of really suck on Human Development Indices. And I do have to add that the arrow of causality here is complex; not only do I believe linguistic homogeneity fosters integration and economies of scale, but I believe political and economic development foster linguistic homogeneity. So it might be what economists might term a “virtual circle.” (emphasis in original)

I have a long history of discussing language death on this blog and my position can be summed up by this Q&A I had with myself:

Q: Is language death a separate phenomenon from language change?
A: In terms of linguistic effect, I suspect not

Q: Are there any favorable outcomes of language death?
A: I suspect, yes (Razib proposes one)

Q: How do current rates of language death compare with historical rates?
A: Nearly impossible to tell

Q: What is the role of linguists wrt language death?
A: One might ask: what is the role of mechanics wrt global warming?
verb valencies A new online version of the 2004 book A Valency Dictionary of English has recently gone live. I haven't had a chance to play with it, but it looks like it has some good data about verb patterns.If you're into that kinda thing, I mean. son of of bitch, the weather is pickled This is an awesome video of a Korean language professional teaching Korean speakers how to use swear words in English. It's so good, it's pickled.

(HT kottke) what a PhD looks like... a pimple...

...and I remain happily ABD...

See The illustrated guide to a Ph.D. for full set of images and discussion. habits of mind A high school math teacher helps us all understand critical thinking. Personal fav: Looks at statements that are generally false to see when they are true.

(via kottke) what wrong with citing a paper? Hadas Shema, an Information Science PhD student, discusses some of the politics and problems with academic citations in The citation game. She got her facts from Bornmann, L., & Daniel, H. (2008). What do citation counts measure? A review of studies on citing behavior Journal of Documentation, 64 (1).

This one jumped out at me:

Journal-dependent factors: Getting published in a high-factor journal doesn't necessarily mean your paper is the best thing since sliced bread, but it means more people are likely to think so. Also, the first paper in a journal usually gets cited more often (I wonder if that's still relevant, given how wide-spread electronic access is these days) (emphasis added).

Lot's of crap has been published in major journals. And the corollary deserves to be mentioned: lots of good article ar published in minor journals and fail to get the respect or notice they deserve. a brief history of stanford linguistics dissertations The above image comes from the Stanford Dissertation Browser and is centered on Linguistics. This tool performs some kind of textual analysis of Stanford dissertations: every dissertation is taken as a weighted mixture of a unigram language model associated with every Stanford department. This lets us infer, that, say, dissertation X is 60% computer science, 20% physics, and so on... Essentially, the visualization shows word overlap between departments measured by letting the dissertations in one department borrow words from another department..

Thus, the image above suggests that Linguistics borrows more words from Computer Science, Education, and Psychology than it does from other disciplines. What was most interesting was using the Back button to creating a moving picture of dissertation language over the last 15 years. you'll see a lot of bouncing back and forth. Stats makes a couple jumps here and there.

HT Razib Khan Choose Your Own Career in Linguistics Trey Jones* at Speculative Grammarian invites y'all to play his cute, and yet somewhat depressing, game: Choose Your Own Career in Linguistics.

As a service to our young and impressionable readers who are considering pursuing a career in linguistics, Speculative Grammarian is pleased to provide the following Gedankenexperiment to help you understand the possibilities and consequences of doing so. For our old and bitter readers who are too far along in their careers to have any real hope of changing the eventual outcome, we provide the following as a cruel reminder of what might have been.

Let the adventure begin...

*hehe, he used to work at Cycorp, hehe... do you despise eReaders and have tons of extra cash ...then this is for you: The Penguin Classics Complete Library is a massive box set consisting of nearly every Penguin Classics book ever published and is available on Amazon for only (only!) $13,413.30.

A rundown:
  • 1,082 titles
  • laid end to end they would hit the 52-mile mark
  • 700 pounds in weight
  • 828 feet if you stacked them
  • They arrived in 25 boxes


My only complaint would be that Penguin Classics tend to be crappy books physically.

HT Kottke.