This is an expansion to an analysis I will write on machine learning and senators tweets. I have linked the Github page where the code for this specific code is, in case you would like to do it yourself or walk through the code.
To scrape the tweets from the senators, I used this list from the 115th congress. I then created a python script that would go through and verify that each twitter handle was correct by sending back a ‘200’ response. There were a couple of senators who had changed their handles from this list, explicitly Bernie Sanders of Vermont and Corey Gardner of Colorado. Next, I used this stack overflow thread as a guide to make sure I got as many tweets from each senator as possible with my free API. This format of tweet collection created a new CSV for every senator and took the tweet, time of tweet, and the senator’s name. I also used a 60 second pause between loops to make sure that I did not exceed the limit put on the API by Twitter. After all tweets had been collected there were 299,200 tweets in 100 CSVs.
Once the tweets were collected I started to work on the code book for how I would define what the sentiment of the tweets would be. I came up with these rules:
To make sure the different levels of the machine learning process would go by smoothly and quickly I collected ten random tweets from every senator and randomly sorted them into a training file. This made it so that I would have a training file with 1,000 examples that I could use for the training of the assisted machine learning process. There was a moment where I tried to use Yelp/Amazon reviews as a trainer, but it was not very effective for the analytical format. When completed there were 458 positive tweets, 325 neutral tweets, and 190 negative tweets.
The initial machine learning process used the Naive_Bayes: BernoulliNB process of analysis. As a test, I started off using the full 1,000 tweet sentiment analysis document that I had created to analyze the full 298,200 other tweets. This process took about 30 minutes on the Google Colab system that I was using. When all was said and done, the process was about 60% accurate. While this isn’t the best prediction, keep in mind, most tweets are roughly 70 characters and due to the nature of Twitter many tweets can be highly nuanced.
To try a different model, I used SKLearn and its LinearSVC. This process ran faster than the Naive_Bayes model and had the analysis done in 20 minutes on the same Google Colab system. However, this only provided a 50% accuracy in the sentiment assignment.
Overall, the sentiment of Tweets did not deliver a whole lot of actionable information. I did choose to do some more analysis on this data and some other interesting analysis.
When we look at the Democrat/Republican divide, the machine learning process is much more effective. This process was easy to create a training file for the machine learning code. I took the 299,200 and just added the demographic information for each senator through a couple quick Excel Vlookups. From here I had Senators birth year, state, and political party.
Using the LinearSVC model it reported an 80% accurate prediction of which political party the tweet belonged to. There was an issue with predicting Independent party tweets, since this set only had one official Independent.
I then decided to look at how this predicted when it came to the generation of tweet. For this, I made a category in the file to show which generation each senator was born into. This analysis with the LinearSVC model was not as impressive as the Democrat/Republican split (71% accuracy), but still interesting none-the-less.
Finally, I decided to look at how the state impacted the tweets. Since there are only two senators per state, so this analysis was not the best overall, but I thought it would be interesting to look at. Overall, it predicted correctly about 52% of the time, unless the senator was from Indiana, Iowa, or North Dakota (~70%).
All in all, machine learning is not the end-all-be-all for this level of analysis. There could be some potential for different training material that is not directly related to Twitter, but there would need to be more research done on which text source may be more accurate for this specific context.