# Tag Archives: regression

This uses the same data set I obtained from my NU Data Mining final project [summary].

Recently, @MLBcathedrals tweeted a photo I submitted to them:

I got a bunch of retweet/favorite notifications at first then I got fewer as the day went on. Now a month later, I’ll get a favorite or retweet [RT] notification every so often. The process of getting a retweet follows a Poisson process, where there is a discrete and somewhat small outcome that can be thought of as count data — you can count retweets per minute.

I used the tweets I just had lying around from my project and pulled out several collections of native RTs that had their first RT in the data set and a high volume of retweets. This was to ensure I had the first part of the tweet’s life and not just had captured it in the middle. Time is measured in seconds from the first retweet event. This simplifies things by giving each collection of a RTed tweets a time base relative to when RTing started.

The common time base enabled me to make this comparison chart of different RTed tweets:

Not every RT pattern is the same. Some have many more RTs, some take a little while to get momentum, but generally they start off strong, then slowly die out taking the shape of a logarithmic function. The total number of RTs over time is interesting, but this problem works better if we look at the rate of RTing. The reason why the RT pattern flattens out is that there is steadily decreasing RTing rate over time. This makes intuitive sense if you have ever used Twitter, people react to things as they happen then rarely go back to it.

It turns out you can mathematically model the RT rate with a Poisson generalized linear model reasonably well. The following three graphs show the actual RT rate data points as red dots, the expected value regression as the black line, and a probability range as blue bands.

The model for this particular RTed tweet is described by the equation:

$ln(E[Y|t]) = 2.980154 - 0.0017236*t$

$Y$ is the number of RT per minute. $t$ is time. And $E[Y|t]$ is read as the expected value of $Y$ given $t$, or what is the most likely number of RTs per minute at a given time. The constant [2.980154] represents the rate at t=1, and the negative regression coefficient [-0.0017236] indicates that rate will decline. This regression line represents the expected value, which is essentially an average of possible outcomes. Using the Poisson distribution and the expected value, I constructed a probability distribution showing a band where 50% of all data points should be located, and another band that should encompass 90% of them.

The bands, in my opinion, are more important than the regression line, because we are dealing with count data. So having an expected value of say 2.5342 doesn’t mean much if you don’t know the probability for getting a value of 0, 1, 2, 3, 4, etc. For this reason the last graph in the series of three has only the actual data points in the probable area bands. For each minute, the data point has a 50% chance to be in the dark blue region, a 40% chance to be in the light blue region [90%-50%], and a 10% chance to be in the white.

This is all predicated on RTing being a random process with a fixed audience. This described most of the RTs I looked at fairly well, but there will always be other factors such as viral growth and time of day. Viral growth means it starts off slow then grows large. If this were to happen, it would not follow this pattern; it would look more like an S. For better or worse, most RTs come from accounts with a large number of followers, so they aren’t actually viral, they are propagations of already popular tweets.

This specific regression by itself won’t predict how many RTs a tweet might get before it’s tweeted, but it describes what happens after people have begun retweeting.