November 22, 2010

Hashtag Analytics: Part 1 = Data Acquisition

We're going to try something new here, something that is trendy in analytics these days.

We're going to analyze hashtag behavior on Twitter.

Our analysis will focus on the #blogchat community, a Sunday evening social media discussion hosted by Mack Collier on Twitter.  Every Sunday night, this community discusses various topics of interest to their community.  Folks communicate via the #blogchat hashtag, so that everybody can follow what folks are saying about each other.

I collected data from about eighteen weeks of #blogchat events.  After analyzing the behavior of the community, I reduced the dataset to five weeks of behavior.  I used four weeks as the segmentation period, with one week as a prediction period.  Later in the analysis period, we'll review eight weeks of data, using four weeks to predict the next four weeks.

Allow me to explain the variables I am tracking in my dataset.

The first variable is called "statement".  Here's what a statement looks like:
  • MineThatData:  I am really looking forward to #blogchat tonight!
In other words, the individual is communicating a statement to the entire audience.

The second variable is called "re-tweet".  This is a huge form of social currency.  The person issuing the re-tweet is giving another individual credit for saying something clever, or is trying to gain attention in some way.
  • MineThatData:  RT @mackcollier Bloggers really do a nice job of sharing interesting topics #blogchat.
The third variable is called "amplify".  This happens when a user adds to the statement offered by another individual.
  • MineThatData:  And they have unique opinions. RT @mackcollier Bloggers really do a nice job of sharing interesting topics #blogchat.
The fourth variable is called "converse".  Here, one user is having a conversation with another user.
  • MineThatData: @mackcollier Don't you think that bloggers could do a better job of being objective? #blogchat
The fifth variable is called "link".  Often, when a person makes a statement, the person links to another article to back up the statement.
  • MineThatData: We covered this topic on my blog last month: #blogchat
We sum each tweet a user issues, and create a sixth variable, called "tweets".

The first set of variables describe the actions a user might partake in.

The next two variables are very important.  These two variables account for feedback from other users in the community.

The seventh variable is called "RT".  Each time a user is "re-tweeted", I tally one for the user in the "RT" column.  In my earlier example, @minethatdata gets a value of "1" in the "re-tweet" variable.  @mackcollier gets a value of "1" in the "RT" variable, because his statement is re-tweeted.
  • MineThatData:  RT @mackcollier Bloggers really do a nice job of sharing interesting topics #blogchat.
The eighth variable is called "ANSW".  This is an important variable, because it means that the user is engaged in a conversation, and that the other person in the conversation elected to answer the user.  In our earlier example, @minethatdata gets a value of "1" in the "converse" variable, while @mackcollier gets a value of "1" in the "ANSW" variable.
  • MineThatData: @mackcollier Don't you think that bloggers could do a better job of being objective? #blogchat
My dataset is generated on a weekly basis.  Each week, I categorize all activity in the #blogchat community, for each user participating in the #blogchat community.

Let's assume that a user, called @user, issued the following tweets.
  • @user:  @person But don't you think that brands should be "all-in" in Social Media? #blogchat 
  • @user:  If big brands don't join the conversation, they're finished. #blogchat
  • @user:  What do you think are the three most important things a large brand should do first? #blogchat.
  • @user:  @person This link is very helpful. #blogchat. 
If this is all of the activity I can find for @user, then @user has the following profile for this week:
  • Statement = 2.
  • Re-Tweet = 0.
  • Amplify = 0.
  • Converse = 2.
  • Link = 2.
  • Tweets = 4.
  • RT = 0.
  • ANSW = 0.
The data set has one row per @user / week combination.

And when you have data formatted in this manner, you can make magic happen!  Tomorrow, we begin to explore the magic behind the #blogchat community. 


  1. Hi Kevin,

    Are you assigning your variables in an automated way or through some manual graft?

    Keen to read more, looking forward to part 2

  2. Old school, by hand, with good 'ole fashioned programming code.


Note: Only a member of this blog may post a comment.

Lone Wolf?

I just went to the Safeway website to look for Tempura batter. Odds are that most people don't go buy Tempura batter ... and that's ...