The biggest known database of sarcasm is here. Great

Reddit mascots are displayed at the company’s headquarters in San Francisco, California April 15, 2014. Reddit, a website with a retro-’90s look and space-alien mascot that tracks everything from online news to celebrity Q&As, is going after more eyeballs, and advertising, by allowing members of its passionate community to post their own news more quickly and easily. REUTERS/Robert Galbraith  (UNITED STATES – Tags: BUSINESS SCIENCE TECHNOLOGY) – RTR3LFPE
Reddit mascots are displayed at the company’s headquarters in San Francisco, California April 15, 2014. Reddit, a website with a retro-’90s look and space-alien mascot that tracks everything from online news to celebrity Q&As, is going after more eyeballs, and advertising, by allowing members of its passionate community to post their own news more quickly and easily. REUTERS/Robert Galbraith (UNITED STATES – Tags: BUSINESS SCIENCE TECHNOLOGY) – RTR3LFPE
Image: Reuters/Robert Galbraith
By
We may earn a commission from links on this page.

In spoken English, the difference between a genuine “Can’t wait” and a sarcastic “Can’t wait” is obvious. Online and in written form, not so much.

Sarcasm is one of the hardest language concepts for a machine to detect, but to improve natural-language processing (which could, for example, greatly advance the abilities of chatbots) we’re going to have to train computers to do just that. To make that happen, computer scientists need vast amounts of sarcasm data. Last week, computer science graduate students from Princeton University added their efforts to a growing body of research, with what they think is the largest database of sarcasm to date.

Their database of 1.4 million sarcastic remarks draws from Reddit comments from 2009 to 2016, which rely on two simple keystrokes to indicate sarcasm: /s.

It makes the difference between:

Yeah, Obama was the best at that.

and

Yeah, Obama was the best at that. /s

The scientists ran the /s-labeled comments through a computer program that applied filters designed to “remove noisy and uninformative comments,” according to a pre-publication paper (pdf) they posted on ArXiv. They used humans to spot-check their dataset, and found a rate of around 2% false positives and 3% false negatives. According to the researchers, the false positive rate was reasonable, but the false negative rate—which is primarily the result of a “large variation in the working definition of sarcasm”—signals the need for better filtering methods moving forward.

Still, the database is an improvement on previous projects, and 10 times bigger than the next largest dataset (pdf) of sarcasm, according to the researchers. Other than scale, the collection departs from previous efforts in two important ways: It relies on users’ own assertions (“I’m being sarcastic”), and not on outside evaluators to guess on their behalf (“I think this is sarcasm.”) And it uses Reddit comments, not tweets, which the researchers say are lower-quality language samples because they are snippets and not complete sentences.

The researchers created a separate database just from the r/politics subreddit, since they calculated that it has a high proportion of sarcasm compared to the average subreddit. (Another high-percentage sarcasm subreddit is r/MensRights, while users seem comparatively sincere in, for example, r/games and r/science.)

With each comment the researchers include context, usually needed—even for humans—to detect written sarcasm. It makes the difference between:

I wish there was some way to prove these statements. [no context]

I wish there was some way to prove these statements. [context: “Eric Trump: My father has paid ‘a tremendous amount of tax'”]