Disclaimer: This blog has nothing to with politics and has been written just out of academic interest.
From quite some time, I wanted to write a blog about using Python to analyze real user data to brush up my concepts and explore new topics.
Coincidentally, last week, 600 million Indians voted to elect Mr. Narender Modi as Prime Minister for second consecutive time, and Mr. Modi is the most followed politician on Twitter in India and has effectively used Twitter to communicate with voters. Hence, his Twitter profile could be a rich source of data ripe for analysis. Plus, analyzing a politician’s statements is fun!
That’s why I decided to analyze his profile for this project. This blog would be helpful for someone who is looking for a simple data analytics project based on real data.
Before diving into details, here are the key parts of the projects:
- Create a developer account on Twitter
- Use Tweepy to scrap the Tweets
- Create a Panda DataFrame
- Statistical analysis of Tweets
- Sentiment analysis
- Word frequency analysis
- Topic Modeling
Create a developer account on Twitter
I just had to register and answer a few questions on the Twitter Developer website. The approval from Twitter came in around 2–3 hours.
We need the following information from your developer account: consumer_key, consumer_secret, access_key, and access_secret.
Use Tweepy to scrap the Tweets
Tweepy is easy-to-use Python library for accessing the Twitter API.
First, let’s import all the libraries we will be using.
import tweepy import pandas as pd import numpy as np from IPython.display import display import matplotlib.pyplot as plt import seaborn as sns from textblob import TextBlob import re import warnings warnings.filterwarnings('ignore') %matplotlib inline
Next, let’s save all the Twitter credentials. I have obviously hidden mine. It’s a good idea to create a separate file to store a credential, but I have used in the same file.
#It's not a good practice to include the keys in the same code, as we have #to display. However, I am lazy consumer_key = "XXXXXXXXXXXXXXXXX" consumer_secret = "XXXXXXXXXXXXXXXX" access_key = "XXXXXXXXXXXXXXXXXX" access_secret = "XXXXXXXXXXXXXXXXXX"
Next, we iterate to extract the Tweets scrapping 200 tweets at a time. In doing this analysis, I have taken inspiration from Rodolfo Ferro blog sentiment analysis of Trump’s tweets.
Create a Panda DataFrame
Let’s create a Panda DataFrame, which will help in analyzing the Twitter data. We also need to understand the structure of the data we have downloaded.
Next, we will add these relevant parameters to the DataFrame.
Statistical Analysis of Tweets
This section is relatively straight forward. We want to figure the simple statistics such as mean length of the tweet, most liked Tweet, and trend of Likes and Retweets over the years.
Next, let’s look at the trend of Likes and Retweets over the course of the last 2 years. I haven’t explored the limit Twitter puts on general access to API, but I believe it limits to 3200 tweets. That’s why we have just last two years of data.
In order to plot the trend, we will create Panda Series and then plot. Even in this short time period, we can observe an upward trend in terms of Likes and Retweets.
Let’s create some fun graphs.
First, we will create a correlation plot to understand characteristics of liked and retweeted tweets.
Some unsurprising learnings are - RTs and Likes are highly correlated. Some more insightful learning include:
- Medium length tweets result in more retweets and likes for Mr. Modi
- This is controversial. Tweets of Mr. Modi with negative sentiment often get more RTs and Likes. However, it may have to do with expressing regret over unfortunate events.
Second, let’s plot the sentiments, Likes, and RTs on a scatter plot to recheck our hypothesis.
This again points that tweets with negative sentiments getting more RTs and Likes. However, we haven’t done extensive cleaning and Indian English (with the usage of Hindi words) could have influenced these results.
For Sentiment Analysis, we will use TextBlob. It’s a Python Library that offers simple API access to perform basic natural language processing tasks.
TextBlob works just like Python Strings and hence can be similarly used to transform data.
Output → TextBlob(“ji”)
We would perform sentiment analysis in 3 steps - cleaning of Tweet’s text, Tokenization, and sentiment analysis.
In analyzing the tweets, we just want to restrict ourselves to tweets posted by Mr. Modi. Although we have no way of identifying the tweets retweeted by Mr. Modi. We have one shortcut. We can ignore the tweets that have zero likes, as Twitter doesn’t designate “Likes” on retweeted Tweets to users who retweeted. Plus, it’s highly unlikely that any tweet from Mr. Modi would have zero “Likes”.
Let’s start with data cleaning step.
Next, we will do tokenization and sentiment analysis.
Word frequency analysis
One of the memorable initiatives of Mr. Modi for me was “Swachh Bharat”- an initiative to keep India clean. Let’s see how many times he talked about “Swachh Bharat” in these tweets.
Let’s first create a word cloud to visualize the relative frequency of usage of different words. I have created a simple word cloud, but you can go fancy and create an image-colored word cloud.
Interestingly, West Bengal and Odisa, where Mr. Modi’s party made surprising gains, have been mentioned more frequently.
I also created a frequency table in the old python way:
- Create a dictionary mapping word and frequency
- Then, store word and its frequency as tuples in a list
- Finally, sort the list and print result
Suppose you have all the speeches of Mahatma Gandhi in text form. You could use topic modeling to figure key themes in these speeches.
I came across a brilliant talk by Christine Doig at PyTexas on Topic Modelling and tried implementing in our case.
Essentially, in topic modeling, we use statistical models to discover abstract topics that occur in a large volume of unorganized text. It’s an unsupervised learning method. The picture below explains what we aim to achieve with Topic Modeling.
There are two ways to conduct topic modeling - LSA (Latent Semantic Analysis) and probabilistic Inference methods such as LDA (Latent Dirichlet Allocation). Below is a pictorial representation of the difference between LSA and LDA. In our case, we will use LDA.
Let’s implement LDA to figure themes in Tweets.
The output is an interactive graph, which I am not sure how to embed. However, you could find the complete code with output here.
I could see two clear themes - politics and gratitude - represented by 4 and 1 respectively. However, topic modeling wasn’t much use here. I will probably apply these on a collection of speeches.
That’s it from my side. I know we can do a lot more analysis here. Please comment to let me know what else I can explore.