Auto-Generating Clickbait with Recurrent Neural Networks

thenomad · on Oct 13, 2015

If I could feed this an article and have it generate headlines based on the text of that article (and they were any good), there is a solid chance I would pay real money for that service.

Headlines are an absolute pain, and as the article says, they're decidedly unoriginal most of the time. I can't see an obvious reason that an AI would be much worse at creating them as a human.

SixSigma · on Oct 13, 2015

Instead of generating random articles

  1. Generate click bait headlines
  2. Write suitable copy for them
  3. ???
  4. Profit

Where 3, of course, is "build ad network"

limelight · on Oct 13, 2015

I know of at least company which actually does this (tests headlines before writing the actual articles).

bcohen5055 · on Oct 13, 2015

I've always wondered if sites do this the other way around. A site invests a lot in content they create you would think to get the most out of it they would serve it with different headlines to different demographics. Knowing from Facebook what headlines you have clicked through in the past can indicate how they should write future headlines to get you again.

limelight · on Oct 13, 2015

Oh, they definitely do it the other way around extensively.

petra · on Oct 13, 2015

In the testing phase, what happens when you click the headline?

limelight · on Oct 13, 2015

Sometimes it just loads a random article, sometimes it gives an error page.

vidarh · on Oct 14, 2015

And parts of 2. involves automatically farming out the writing to places like Fiverr..

thenomad · on Oct 13, 2015

1a. Split-test. A lot.

Aldo_MX · on Oct 13, 2015

That's absolutely easy:

<Random Unexpected Person> did <random unexpected action>, you wouldn't believe it!

thenomad · on Oct 13, 2015

There are many, many headline formulae in the world ( some of my favourites were written in the mid 1920s, in John Caples' book "Tested Advertising Methods" ) but they still take time to iterate through and hone for each article.

I'd like a robot to do that for me, please :)

blisterpeanuts · on Oct 13, 2015

I like the notion of swamping the Internet with fake click-bait headlines, to dilute the attractiveness of this (to me, odious) form.

Give me sincere, honest news and discussion, or else shut up.

Unfortunately, someone out there must really have a craving for "weird old tricks" and "shocking conclusions".

It's a sort of race-to-the-bottom, least common denominator effect.

Maybe someone will write a browser extension that filters out obvious click-bait headlines. Now that would be clever!

billmalarky · on Oct 13, 2015

People don't have a craving for this kind of crap, that is they don't actively search for it. It works by exploiting the brain. It's the publishing equivalent to junk food. We know it's awful. We know it's bad for us. But we struggle to not consume it because it's cheap and it pings our reward systems.

AnimalMuppet · on Oct 13, 2015

Actually, I think that the clickbait junk makes us think that it will ping our reward systems. For me, at least, it doesn't really reward me very well (even in a junk food way).

Maybe this means that the real clickbait trash is training me not to click on it, so I don't need the fake to do so?

TeMPOraL · on Oct 14, 2015

I cured myself out of clickbait headlines after I clicked on few and learned to expect no content on the other side. It's a simple association, really. You click on something X-y, you get no reward, you learn not to waste time on X-looking things.

billmalarky · on Oct 14, 2015

For you certainly. But the proof of the pudding is in the tasting. Clickbait drives an insane amount of traffic and it shows no slowing down. Cosmo has been successfully using "clickbait" titles for decades on the cover of their magazines.

TeMPOraL · on Oct 15, 2015

Of course. I'm not denying the effectiveness of this technique, just providing a n=1 datapoint. Maybe my personal idiosyncrasies make me immune to that particular type of traffic-driving technique (I have no doubts I'm vulnerable to other methods).

vidarh · on Oct 14, 2015

Part of our reward systems is what was initially named the "pleasure centre".

When the "pleasure centre" in the brain was first identified and named, it was named because it was thought that stimulating it caused pleasure, because rodents given the choice to stimulate it vs. other activity would stimulate the pleasure centre even over eating.

But as it turns out, the main function of stimulating this area is strong cravings and compulsion. You may get some pleasure from giving in to the cravings, but the cravings are independent of whether or not there's a "real" reward at the end of it.

billmalarky · on Oct 14, 2015

Reminds me of a comment I read a few weeks back when an ex-drug addict was describing how the anticipation of using drugs was often more rewarding than the use itself. Which explains the pleasures in drug use rituals.

Natsu · on Oct 14, 2015

I've just banned myself from visiting many of the worst offenders. Though it's getting really hard, any more, to find sites that won't sink to that level.

rcthompson · on Oct 13, 2015

Couldn't this trained RNN also be used to evaluate the "clickbait-ness" of article titles (rather than generate new ones)?

null000 · on Oct 14, 2015

Tl;dr: No. Wrong output format, wrong training set, wrong input.

To create a classifier that does that, you'd need a labeled set - i.e. someone would have to go through and say "this headline is 3 clickbaits. This other headline is 8 clickbaits". You could also sort between clickbaity and non-clickbaity, but that would still require manual work.

You could get that programatically through a few different means, but you'd need a lot more than just headlines.

It also probably wouldn't be a good idea to use a RNN - it doesn't suit the data format well. It'd be better to use a neural network (non-recurrent) or logistic regression with the entire headline as input.

Fortunately, it'll converge on a good solution a LOT faster - fewer parameters to tune + simpler output = fewer examples needed to figure out what's going on - so you might be able to get something that has plausible levels of accuracy with a day or two of set labeling (estimate brought to you by my ass).

RevRal · on Oct 14, 2015

>Unfortunately, someone out there must really have a craving for "weird old tricks" and "shocking conclusions".

This problem seems concurrent to the old mystery of Viruses Spontaneously Self-Constructing On People's Computers. "How did you get all these viruses on your computer?" "I didn't do anything it just happened." "Okay, well be really careful what you click on." "I am careful!"

cryowaffle · on Oct 13, 2015

> Give me sincere, honest news and discussion, or else shut up.

There are plenty of sources for what you desire it just isn't what's popular... is that a problem?

blisterpeanuts · on Oct 14, 2015

Yes, but even respectable news and information websites now include clickbait (Outbrain and other "sponsored" content). I've seen it on WSJ, NYT, and other sites even when I'm paying $10-$15 a month.

rndn · on Oct 13, 2015

Could this RNN model perhaps be used to filter click bait headlines from HN automatically? Perhaps one could perform some sort of backward beam search to figure out how likely a particular headline would've been produced by it. If there are words in a headline that the model doesn't know, one could perhaps just let it replace it with one that it knows.

oneJob · on Oct 13, 2015

Now if we can just teach AI to get sidetracked reading all this content we'd also prevent Judgement Day.

SkyNet: (speaking to self?) "Unleash hell on humans. Launch all missiles."

SkyNet: (responding to self?) "Not now, not now. Let me finish this article on John Stamos's belly button."

ChuckMcM · on Oct 13, 2015

https://xkcd.com/1283/

I really find RNNs to be pretty cool. When they are combined with a natural human tendency to see patterns they are hilarious. So perhaps we need to update our million monkeys hypothesis to a million RNNs with typewriters coming up with all the works of Shakespeare.

notahacker · on Oct 13, 2015

Shakespearean RNN http://cs.stanford.edu/people/karpathy/char-rnn/shakespear.t...

Surprisingly convincing if viewed as excerpts rather than a play.

Now to find some English teachers to try to interpret what Shakespeare meant by some of those lines!

mey · on Oct 13, 2015

https://xkcd.com/356/

clickok · on Oct 13, 2015

Nice! I've wanted to do something like this for awhile, too, but haven't had the time yet.

What's interesting to me, from a research point of view, is the degree of nuance the network uncovers for the clickbait. We all know that <person> is going to be doing <intriguing action>, but for each person these actions are slightly different. The sentence completions for "Barack Obama Says..." are mainly politics related while "Kim Kardashian Says..." involve Kim commenting on herself.

So it might not really understand what it's saying, but it captures the fact those two people will tend to produce different headlines.

Neat Idea: what if we tried the same thing with headlines from the New York Times (or maybe a basket of newspapers)? We would likely find that the Clickbait RNN's vision of Obama is a lot different from the Newspaper RNN's Obama. Teasing apart the differences would likely give you a lot more insight into how the two readerships view the president than any number polls would.

mikkom · on Oct 13, 2015

What I'm surprised most is that the headlines seem not to be much better than your average markov chain output

IanCal · on Oct 13, 2015

I think this is for three main reasons:

1. You can do really well with a simple grammar

2. You only need short output

3. Lack of training data

There's not an incredibly rich structure to extract, and with short outputs the weirdness doesn't compound and cycles aren't as likely. A common small dataset for playing with RNNs is all of Shakespeare which is somewhere in the region of 1M words.

However, this is still fun and interesting!

mikkom · on Oct 13, 2015

> 3. Lack of training data

> [...]

> There's not an incredibly rich structure to extract, and with short outputs the weirdness doesn't compound and cycles aren't as likely. A common small dataset for playing with RNNs is all of Shakespeare which is somewhere in the region of 1M words.

He does state that the network is trained with 2M headlines, meaning ~5-20M words. That should be enough.

I would have thought that RNN would somehow work better. It would be interesting to see direct comparison of fake hacker news headlines generated with Markov chains versus RNN.

IanCal · on Oct 13, 2015

True, I had managed to miss that, although it's working on 200 dimensional vectors rather than single letters as in the small shakespeare dataset. That feels like it might make it harder to train. I've personally found more problems dealing with Glove vectors compared to the word2vec ones, but I don't have any hard data for that.

seiji · on Oct 13, 2015

Also see http://www.headlinesmasher.com

VLM · on Oct 13, 2015

This was an enjoyable article. There is an obvious extension which is to mturk the results and feed the mturk data back into the net. Just give the turkers 5 headlines and ask them which they would click first, repeat a hundred times per a thousand turkers or whatever.

Years ago I considered applying for DoD grant money to implement something reminiscent of all this for military propaganda. That went approximately nowhere, not even past the first steps. Someone else should try this (insert obvious famous news network joke here, although I was serious about the proposal). To save time I'll point out I never got beyond the earliest steps because there is a vaguely infinite pool of clickbaitable English speakers on the turk, but the pool of bilingual Arabic (or whatever) speakers with good taste in pro-usa propaganda is extremely small, so the tech side was easy to scale but the mandatory human side simply couldn't scale enough to make the output realistically anything but a joke.

rlu · on Oct 13, 2015

> The training converges after a few days of number crunching on a GTX980 GPU. Let’s take a look at the results.

Stupid question: why is the GPU important here? I would have thought this was more of a CPU task..??

(then again, as I typed this I remembered that bitcoin farming is supposed to be GPU intensive so I'm guessing the "why" for that is the same as this)

unoti · on Oct 13, 2015

A lot of this kind of work ends up being repetitive-- like multiplying two matrices together that have a few thousand entries each. These are the sorts of things that GPU's do very well with. GPU's have the ability to do such things on a massively parallel scale. GPU's also tend to have more memory bandwidth doing the kinds of things that a CPU would get bogged down on in the memory cache.

soggypretzels · on Oct 13, 2015

GPU's are really good at parallel tasks such as calculating the color of every pixel on the screen, or doing the same operation on a large dataset. According to Newegg, the GTX980 has 2048 CUDA cores (parallel processing cores) that run at ~1266 MHz as opposed to a nice CPU which might have 4 cores that run at 4 GHZ. In other words, if you want to manipulate a whole bunch of things in one way in parallel, you can program it to use the GPU effectively, if you want to manipulate one thing a whole bunch of ways in series, CPU is your best bet.

(note: this is massively oversimplified)

semi-extrinsic · on Oct 13, 2015

Coarse rule-of-thumb: running on Geforce class GPUs you can get up to 5x, maaaybe 10x the performance per dollar as compared to a top-line CPU. Assuming your problem scales well on GPUs, many problems don't. The GTX980 is actually a great performer. For Tesla class systems like the K40 it's a lot closer to equal with the CPU on performance/$ (they're not much faster than the GTX980 but a lot more expensive). But you can get an edge with the Teslas when you start comparing multi-GPU clusters to multi-CPU clusters, since with GPUs you need less of the super-expensive interconnect hardware. (You're not going to put GTX cards in a cluster, you'd have massive reliability problems.)

IMHO, the guys showing 100x speedups on GPUs are Doing It Wrong; they use a poor implementation on the CPU, use just one CPU core, consider a very synthetic benchmark, or a bunch of other tricks.

imaginenore · on Oct 13, 2015

Getting this error:

    Error: 500 Internal Server Error

    Sorry, the requested URL 'http://clickotron.com/' caused an error:

    Internal Server Error
    Exception:

    IOError(24, 'Too many open files')
    Traceback:

    Traceback (most recent call last):
      File "/usr/local/lib/python2.7/dist-packages/bottle.py", line 862, in _handle
        return route.call(**args)
      File "/usr/local/lib/python2.7/dist-packages/bottle.py", line 1732, in wrapper
        rv = callback(*a, **ka)
      File "server.py", line 69, in index
        return template('index', left_articles=left_articles, right_articles=right_articles)
      File "/usr/local/lib/python2.7/dist-packages/bottle.py", line 3595, in template
        return TEMPLATES[tplid].render(kwargs)
      File "/usr/local/lib/python2.7/dist-packages/bottle.py", line 3399, in render
        self.execute(stdout, env)
      File "/usr/local/lib/python2.7/dist-packages/bottle.py", line 3386, in execute
        eval(self.co, env)
      File "/usr/local/lib/python2.7/dist-packages/bottle.py", line 189, in __get__
        value = obj.__dict__[self.func.__name__] = self.func(obj)
      File "/usr/local/lib/python2.7/dist-packages/bottle.py", line 3344, in co
        return compile(self.code, self.filename or '<string>', 'exec')
      File "/usr/local/lib/python2.7/dist-packages/bottle.py", line 189, in __get__
        value = obj.__dict__[self.func.__name__] = self.func(obj)
      File "/usr/local/lib/python2.7/dist-packages/bottle.py", line 3350, in code
        with open(self.filename, 'rb') as f:
    IOError: [Errno 24] Too many open files: '/home/ubuntu/clickotron/views/index.tpl'

striking · on Oct 13, 2015

That'll teach you to do disk I/O on every page render.

DanielBMarkham · on Oct 13, 2015

Yep. Have a separate process cache a few up and simply cp over the active one to be served as a big bag of immutable bits. Bonus points for using a CDN.

juddlyon · on Oct 13, 2015

I can't stop laughing at these. Check out the Click-o-tron site: http://clickotron.com/

tylerpachal · on Oct 14, 2015

My favorite: "residents can't remember if they lost their wine at the same time." [1]

[1] http://clickotron.com/article/5588/residents-cant-remember-i...

flashman · on Oct 14, 2015

I used a simpler technique (character level language modelling) to come up with an Australian real estate listing generator: http://electronsoup.net/realtybot

This is pre-generated, not live, for performance reasons. There are a few hundred thousand items though, so the effect is similar.

The data source is several tens of thousands of real estate listings that I scraped and parsed.

OhHeyItsE · on Oct 13, 2015

This is simply brilliant.

(Ranking algorithm baked into a stored procedure notwithstanding. [ducks])

neikos · on Oct 13, 2015

I am not sure how much I would give credit to the idea that the neural network 'gets' anything as it is written in the article.

> Yet, the network knows that the Romney Camp criticizing the president is a plausible headline.

I am pretty certain that the network does not know any of this and instead just happens to be understood by us as making sense.

notahacker · on Oct 13, 2015

Life Is About A Giant White House Close To A Body In These Red Carpet Looks From Prince William’s Epic ‘Dinner With Johnny'

from the article would be a good counterexample of the neural network "getting" anything.

If you're an algorithm "White House", "Prince William" and "Dinner With Johnny" is to "Red Carpet" as "Romney" is to "Camp" and "Bad President".

andrewtbham · on Oct 13, 2015

tldr; guy uses rnn lstm to create link bait site.

hopes crowd sourcing will filter out non-sense.

http://clickotron.com/

eb0la · on Oct 13, 2015

Site down. Did HN readers crashed the server? Everything old is new again (slashdot effect)?

chipgap98 · on Oct 13, 2015

"Tips From Two And A Half Men : Getting Real" is great. Some of the generate titles are incredible

billconan · on Oct 14, 2015

I can't understand the first two layer RNN which according to the author optimized the word vectors.

it says:

During training, we can follow the gradient down into these word vectors and fine-tune the vector representations specifically for the task of generating clickbait, thus further improving the generalization accuracy of the complete model.

how to you follow the gradient down into these word vectors?

if word vectors are the input of the network, don't we only train the weight of the network? how come the input vectors get optimized during the process?

alkonaut · on Oct 13, 2015

Missed opportunity for HN headline.

This program generates random clickbait headlines. You won't believe what happens next. You'll love #7.

indiv0 · on Oct 14, 2015

Reminds me of Headline Smasher [0].

Some pretty fun ones there but it doesn't use RNNs. It just merges existing headlines.

[0]: http://www.headlinesmasher.com/best/all

kidgorgeous · on Oct 13, 2015

Great tutorial. Been looking to do something like this for a while. Bookmarked!

petrey · on Oct 13, 2015

I think this one is my favorite:

Life Is About — Or Still Didn’t Know Me

JorgeGT · on Oct 13, 2015

The "top" article in "clickotron.com" is "New President Is 'Hours Away' From Royal Pregnancy" :)

CephalopodMD · on Oct 13, 2015

Your main site is down. Bottle can't handle serving files scalably or something? Point is, it broke.

lars · on Oct 13, 2015

That was exactly the problem, bottle+gevent serving static files. It's moved behind nginx now. (But you might have to wait for a DNS propagation before you get to the new server.)

hilti · on Oct 13, 2015

Interesting blog post, but site is down. How much traffic do You get from HN?

joshdance · on Oct 13, 2015

500 Internal Server Error on the site where you could upvote em.

lars · on Oct 13, 2015

Working on it:) It's getting a bit more traffic than expected at the moment.