2 Highly Effective Ways to Estimate User Location in Social Media
The 140 characters in a tweet don’t leave much room for context. To understand a tweet, you often need to understand the who, what, and where behind it. The lab’s Soft-Boiled challenge spent some time looking at the "where," taking an automated approach to estimating the location of users or messages on Twitter (a problem referred to as geo-inferencing).
We found two general approaches to inferring locations on Twitter: based on the content of the messages and based on the structure of the social network. Both approaches use messages with known locations to estimate something about messages without a known location.
Soft-Boiled set out to implement one geo-inferencing method from both categories, in a scalable way. Our algorithms are implemented in Python, using Spark as a distributed computing environment. Python and Spark gave us an easy way to use the same code for building algorithms and analysis. Our implementation is geared towards Twitter data, which unfortunately can’t be included alongside our code, due to the terms of service. You can use the Twitter API to pull a sample of data and explore the algorithms. Pull requests are always welcome! Now back to the methods:
One of the most interesting facts from early analysis of online social networks was that friends on these networks tend to be geographically close to each other. Using just this fact, you can start to estimate the location of users based on where their friends are. A survey of many network-based methods is described in this recent paper. One such approach is described below:
Every iteration ends with a list of users and the best estimate of their location. Users cannot be estimated if they don’t have a sufficient number of connections with known or estimated locations.
Content-based methods rely on the words used by the user in the message and their profile rather than the structure of the social network. The simplest method would be to look at the language of the message and predict the most likely country for that language. That simple model would correctly predict the country of origin for a message with an accuracy of 97% for tweets in Japanese. Unfortunately, that simple model is only able to correctly predict the country of origin for 62% of tweets in English.
Content-based methods take a variety of approaches. Some turn geo-inferencing into a classification problem estimating city/country, while others approach it as a regression problem and estimate latitude/longitude directly. Such methods use a multitude of underlying machine learning techniques to effect those predictions. We will discuss an approach that estimates latitude/longitude directly using Gaussian Mixture Models (GMMs). GMMs estimate the distribution of a variable as a set of Gaussian distributions. One interesting byproduct of the probabilistic nature of this model is that an estimate of confidence is built into the prediction. The GMM-based approach described in attempts to estimate the geographic distribution of all the words in a corpus, then locate a message by combining the distributions of its words.
Soft-Boiled not only created code to produce an estimated location but also a confidence (step 4 above) that a message is within some radius (i.e. there is a 80% chance the true location is within 100km).
In our testing, we found that the social network-based methods provided better accuracy than content-based methods. Additionally, network based methods were much faster to run. Content-based methods were somewhat slower to run but were able to estimate locations for a much larger percentage of users. The primary cost in the content-based methods is building of the GMMs and evaluating the probability mass covered by some radius for the confidence estimate.
We were able to create a hybrid algorithm where content-based methods are used for an initial estimate of users’ locations and then refined using a network-based method. This hybrid algorithm gave the ability to tune performance and coverage to match the application.
Ultimately, inferring location using either of the classes of algorithm in the literature—or a hybrid of the two—is feasible and can be performed in a scalable and performant manner.
Get more insight into how we workLets go