A coder found “Brunch Point”—”the exact time of day in which brunch maximally occurs”—using statistics

Dig in.
Dig in.
Image: Erin_Can_Spell/Flickr via Creative Commons 2.0
We may earn a commission from links on this page.

Background

I created isitbrunchtimeyet.com a few years ago as a joke—one very much inspired by isitchristmas.com and isitlunchtimeyet.com. It has since been used mostly as a funny and/or slightly passive aggressive way to ask people to brunch. When I created it I arbitrarily set the brunch time to be 10:15am to 11:45am. It is that decision which leads us here—I want to do better than arbitrary.

Hypothesis

Twitter is a platform that allows users to “get real-time updates about what matters to you.” If we assume people generally tweet about things while they are doing those things then we can assume tweets containing “brunch” are generally happening while that person is actually having brunch. Therefore, if we collect enough tweets over a long enough period of time and analyze the time at which they were tweeted we could infer the specific time range in which brunch falls.

The data

We begin by using the Twitter Streaming API. This API allows us to subscribe to search terms, for example “brunch,” and get any tweet matching that term sent to our program in real-time. Not only did we collected “brunch” tweets but we also collected tweets containing “breakfast,” “lunch,” and “dinner” to use as controls (which we will review later). We allowed the program to run from 6/1/2015 to 5/31/2016 which yielded 100 million+ tweets for analysis. Twitter is global platform, so we have to do some additional work to understand the time of day a specific tweet is occurring. At the time of tweet we analyzed the timezone of the person tweeting, and any attached geolocation data (when available). Using this data we then made an informed estimate of the localized hour for each tweet.

As a result, we are able to break down a count of tweets by localized hour:

tweets per hour
Tweets containing the term brunch with the hour localized based on inferred timezone of the tweeter.
Image: Ben Jacobson

Solution 1: The naive solution

Simply looking at the histogram above one will quickly notice that 11am is the most popular hour to tweet about brunch—which therefore means 11am to 12pm must be brunch time, right?

naive solution graph
Image: Ben Jacobson

Well, that is a perfectly reasonable conclusion to draw. However, we think there is room for improvement. First, this solution lacks any real fidelity. It would seem unlikely that brunch starts exactly at 11am and ends exactly at 12pm. Second, if we look at the histogram again you can see that activity in 12pm is very close to 11am. This informs us that a lot of brunch activity takes place in the hours following 11am that this solution doesn’t take into account.

Solution 2: Probability density function

Since we are visualizing a histogram, it is logical for us to jump to a distribution function for further analysis. And since our data has a positive skew we will look to a probability density function (PDF). To start, we calculate the lognormal distribution PDF (the red line below) and locate the maximum point of that function (the mode). We chose the lognormal over the normal distribution because it appeared to fit the data better. We then identify the range in which a large portion, let’s say 1/4 (25%), of all tweets occur centered around that maximum:

Probability Density Function graph
Image: Ben Jacobson

This strategy gives us a brunch time of 9:25am to 12:24pm. However, this strategy has a few drawbacks. First, the 25% occurrence is an arbitrary value. It’s no better than 30% or 20% or any other number. Second, the lognormal PDF heavily depends on how we think about the x-axis. In this example we start at 3am, which we only chose because it generally had the lowest tweet count of our data set (otherwise it would like slightly bimodal). If we instead started the x-axis at, say 12am, we would get a different curve. Third, looking at the graph it feels like it has a similar issue with our first solution in that it isn’t really capturing the long tail of activity in the early afternoon. Each of these arbitrary assumptions are unacceptable tradeoffs in our quest for an objective answer.

Solution 3: Spline interpolation

Splines allow us to create a smooth curve through many different points. Let’s make a curve through each hour and graph it:

spline graph
Image: Ben Jacobson

If we look at this curve we can see the maximum aligns very nicely with where we would intuitively assume brunch is occurring based on the histogram. Let’s give this maximum a name, how about Brunch Point? Brunch Point can be defined as: the exact time of day in which brunch maximally occurs. Based on our data that time is 11:56am.

Brunch Point — the exact time of day in which brunch maximally occurs

brunch point graph
Image: Ben Jacobson

Now, that appears to be an excellent answer! However, we are looking for a time range, not just a point in time. So how can we calculate a range from the spline? Remember calculus from high school? This is the real-life-use case for calculus that was promised by your professors! Let’s look at the derivative:

brunch graph showing derivative
Image: Ben Jacobson

Don’t have your calculus books handy? What is a derivative? The derivative of a function will give us the rate of change (a.k.a. the slope, a.k.a. the m in y = mx + b) of a function at any given point in time. It helps us understand the acceleration of a function. Let’s find the tangent lines with the highest and lowest slope:

tangent line graph
Image: Ben Jacobson
other tangent line graph
Image: Ben Jacobson

At that maximum slope, the number of brunch tweets is experiencing its highest positive rate of change: people are speeding up their tweets about brunch—pushing hardest on the gas pedal. At the minimum slope, the number of brunch tweets is experiencing its highest negative rate of change: people are slowing down their tweets—pushing hardest on the brake pedal. Using these two points we get a brunch time of: 10:01am to 1:40pm.

We like this method the most for several reasons:

  • It appears to capture the “gut feeling” we had in our first solution—it feels correct when you visualize it.
  • It uses a spline which will remain very consistent even if the x-axis is translated.
  • It is based on concepts that we are very familiar with—acceleration and deceleration.

But how do we know if it’s correct? Let’s apply the solution to other terms and observe how it performs.

Comparing data

For comparison let’s look at the splines for “breakfast” and “lunch” tweets:

breakfast vs lunch graph
Image: Ben Jacobson

Since “breakfast” and “lunch” are more popular terms on twitter than “brunch” let’s modify these curves to be based on frequency—which makes it easier to see the distinctions between them. Then let’s also apply the solution above to get a sense of how they compare:

graph comparisons
Image: Ben Jacobson

Finally, we have some results:

  • Breakfast: 7:50am to 11:55am.
  • Brunch: 10:01am to 1:40pm.
  • Lunch: 11:40am to 2:25pm.

So that’s it—brunch is officially from 10:01am to 1:40pm. It is interesting that the end of breakfast and the beginning of lunch are within 15 minutes of each other. It’s also interesting that Brunch Point falls in that same time frame as well. Both of which give this solution at least a little bit of affirmation.

Now go eat some brunch!

tl;dr—Brunch is officially from 10:01am to 1:40pm, based on 100 million+ tweets and analysis.

All analysis and charts were done in python using pandas/matplotlib. Source code: https://github.com/bjacobso/brunch

This post originally appeared at Medium.