With sites like Twitter and Facebook allowing users to share their thoughts, locations, and even their breakfasts at the touch of a button, social media sites are a treasure trove of user data. They allow companies to find insight from the opinions of their consumer bases and allow researchers to perform real time analysis of a given population’s reactions to major events happening across the globe.
With the introduction of applications allowing users to ‘geotag’ (or share their location on) their tweets, posts, and photos, social media sites have given analysts the data needed to find spatial patterns in social media data. Companies can now go a step further and group their consumer bases by location, and researchers can focus specifically on how tourists are being affected by the ongoing financial situation in Greece or how Americans reacted to the USA winning the Women’s World Cup.
While this newly available data can aid analysts in trying to find answers to their spatially geared questions, many companies, academics, and even researchers lack the tools to both collect and analyze geotagged tweets or posts. Therefore, I created an open source Python toolbox[i] to be used in conjunction with ArcGIS to allow users to easily access and analyze geotagged tweets.
The toolbox comes with three built-in functions.
Figure 1: Toolbox Functions
- ‘get_tweets’ -- allows users to collect geotagged tweets on a user specified topic and in a user-specified location. This function uses the Twitter API[ii] to ingest geotagged tweets, filter them based on the specified search criteria, and save them in a database.
- ‘sentiment’ – allows users to calculate the sentiment of a tweet table, such as the table created in ‘get_tweets.’ This function uses the R qdap[iii] library to calculate the polarity of a given set of tweets. If the user does not select a grouping variable, then a column containing the sentiment of each tweet will be added to the original tweet table. If a user chooses to calculate sentiment based on factors such as ‘place’ or ‘query,’ then a new table will be created which will include the sentiment score for each place or query specified by the user. Multiple grouping variables can be used.
- ‘topics’ – allows users to perform a content analysis of the common word groupings, or ‘topics’ that are found over a particular tweet table. This function uses the python gensim[iv] module to calculate the topics of the collected tweets. By using Latent Dirichlet Allocation (LDA)[v] to group tweets into a set of topics, users can find insight into the conversations that are happening within their query or location of interest.
To get started, a user need simply install the toolbox, select the desired function, enter the required parameters, and hit ‘ok.’
On July 6, 2015, the South Carolina legislature (SCLD) met to decide on the future of the Confederate flag that hung on the state capitol’s grounds. Being a researcher of mass political behavior, I naturally was interested in seeing the public’s reaction to both the proceedings and the legislature’s decision, so I used this tool to get a quick snapshot of what people were saying about the event.
Figure 2: Get_Tweets
Figure 2 shows the parameters that were used to collect tweets about the SCLD. As you can see, users are able to include multiple queries and group designations to add further analysis options. For example, with the above queries, I can calculate the number of tweets each query appeared in or compare tweets that included #southcarolina to #confederateflag.
Once ‘ok’ is selected, a pop-up, like the one in figure 3, will appear showing the function’s progress. If there is an error, then an error message will be printed here. The pop-up will notify you when the function has successfully completed, and then you can upload and view the data with ArcGIS.
Figure 3: Running Pop-up
Figure 4 shows a snippet of data from my collection of tweets on the SCLD – overall there were a couple thousand tweets collected over the period from July 6th 11:16 am to July 7th 10:45 am[vi].
Figure 4: Collected Tweets
Once tweets are collected, a user can choose to either perform sentiment or topic analysis (or both) on the collected tweets.
Figure 5: Sentiment
Figure 6 shows the results of performing sentiment analysis on the collected tweets. The sentiment scores have been separated into 5 categories – with red being the most negative (anti-flag removal), yellow being neutral, and blue being positive(pro-flag removal). Since the sentiment scores are grouped by location, we can see that most cities within the United States appear to have neutral views – maybe this is due to media channels tweeting about the event. Many cities showing positive sentiment appear scattered across the country, while only a handful of cities show overall negative sentiment.
Figure 6: Sentiment of Tweets
Last but not least, topics can be calculated to show the variations in conversations among the collected tweets.
Figure 7: Topics
The table above shows the most common words within the 7 topics. While some topics, like numbers 5 and 7 appear to be more focused on jobs than the flag issue, most of the topics show the different conversations surrounding the confederate flag debate in South Carolina. For example, topic 2 seems to be about the legislature vote, while topic 3 is more focused on the racism surrounding the Confederate flag. Figure 8 shows the spatial variance of the topics with topics 1, 2, and 3 being the most common and spanning across the country while other topics, such as number 7, appear localized to a specific area/region.
I appreciate your feedback on this tool and would love to hear about how you end up using it -- tweet me at @mindynico1e. If you want to find out more about data science at L-3 please tweet (@rheimann) or email Richard Heimann, Chief Data Scientist (email@example.com), and if you're interested in learning more about women in data science please check out @WomenDataSci. Special thanks to the L-3 Data Tactics Data Science Team, ArcGIS, Python, and R.
[v] The number of topics is calculated by using the Kullback-Leibler divergence function for a specified range of topics and selecting the lowest point in the dip after the local maxima as the optimum number of topics for the set of tweets.
[vi] The data collected and databases created in this example can be found, https://github.com/DataTacticsCorp/ArcGIS_Tweets/tree/master/data