Unformatted text preview: “Nowcasting” County Unemployment Using Twitter Data Megha Srivastava
Department of Computer Science
Stanford, CA 94305
[email protected] Thao Nguyen Department of Computer Science Undergraduate Student Stanford University Stanford, CA 94305 [email protected] Abstract Unemployment has negative effects on both the financial and mental health of the unemployed. Social media and web searches can provide valuable insights into national unemployment conditions. However, nonprofit organizations and policymakers combatting local unemployment must rely on less timely countylevel statistics. We combine geolocated Twitter data with data from the Bureau of Labor Statistics to “nowcast” (predict the present) unemployment rate and mass layoff events. Because relatively few tweets in a countymonth discuss unemployment directly, we use deep learning techniques to uncover hidden relationships between tweet content and local unemployment conditions. We compare different deep learning models and parameters, and achieve a root mean square value of 1.08 on our best performing model. 1 Introduction Although current access and use of information analyzing unemployment rates in the United States typically occurs on the national level, the emotional and communal effect of unemployment is better addressed by investigation on the countylevel. A study in the American Journal of Public health demonstrated significant health differences between the unemployed and employed, including an increase in physician visits and medications for unemployed, many with similar diagnoses as employed men . The increased need for family and community support also demonstrates the importance of timely access to local unemployment statistics. Providing countylevel officials with more current information on local unemployment trends could result in more prompt and effective community responses, to alleviate the negative social and personal effects of unemployment. We use Twitter as a source for countylevel data. Twitter is an online social networking platform with over 319 million active users sending “tweets” messages capped at 140 characters each. Even limiting tweets by the keyword “job” results in a range of tweets, such as: 1.
2. “#Nursing #Job alert: Patient Care Assistant ‐ Cardiology Job | Froedtert Health | #MenomoneeFalls, WI ” [ Milwaukee, WI Jan. 2015] “Dear 2015 u've been good to me! 1).Gave me 2 Nephews! 2).1 of my dream jobs(Saks) 3).Introduced me to Love #ThankYou... 2016 will be GREAT” [ Chicago, IL Jan. 2016] Twitter thus provides an ability to capture information about a community by analyzing tweets from its members. Twitter has been a popular platform for a wide variety of applications in the Natural Language Processing (NLP) community, such as sentiment analysis and realtime event detection . By separating Tweets by date and county location, we can analyze how the tweet’s content can use NLP techniques to “nowcast” the unemployment rate for the given time and county. Using Twitter’s data, which represents almost realtime conversation and information, thus helps facilitate a more targeted approach to battling unemployment. 2 Background Twitter is a rich source of data for NLP applications because of its availability, variety in formats and less variance in lengths. Although most of the recent works focus on sentiment analysis, we believe that the methods they employed are also useful in predicting unemployment rate, another social property of the texts. 1 The key difference between our models and other sentiment analysis models of Twitter data is that we perform a regression task, instead of classifying tweets into distinct categories. However, challenges such as preprocessing tweets and determining what models are ideal for this shortlength and less grammatical form of text are similar. Yuan and Zhou used binary dependence tree structures as inputs to a Recurrent Neural Network (RNN) for Twitter Sentiment Analysis . We adopted their preprocessing methods based upon the Stanford NLP Twitter Preprocessing Script, to deal with special tokens such as hashtag (#) and username (@). Additionally, Palangi et. al. used RNN with Long Short Term Memory (LSTM) cells to learn sentence embedding with applications in information retrieval . Ren et al. separated neural networks formed by inputs and by contextual keywords, each with its own convolution and pooling layers, and combined their outputs again later . Also employing neural networks but with more quantitative features, Xu et al. relied on search engine query data to forecast unemployment rates . Because of the similarities between tweets and search engine queries short length, often nonstandard language, tendency to discuss trends we build upon Xu et. al. and Palangi et. al. by using an RNN to forecast unemployment rates, and determining how much the LSTM model can improve performance. Ren et. al.’s success with convolution and pooling layers also inspired us to determine the performance of Convolutional Neural Network (CNN) with our task and whether the filter/ngram based approach in CNN will be a disadvantage compared to RNN’s ability to represent text sequentially. 3 Approach We approach our task using deep learning regression models to predict unemployment conditions. Since the potential use case of our work is a tool for local agencies to nowcast countylevel unemployment rates, our goal is to infer those figures for tweets in later months based on training on tweets from previous months. Therefore, we separate our training and testing data such that no tweet date in one set appears in the other, and all dates in the testing set are later in time than those in the training set. We proceed to train different deep neural network models to predict a 1dimensional output (the unemployment rate), with the optimization goal of minimizing the root mean squared error between the predicted and true average unemployment rate in the training set. 3.1
Dataset & Data Preprocessing Our original dataset contains over 30 million tweets containing the keyword “job”, and a separate table of unemployment rates spanning 940 metro areas and 84 dates from the United State Bureau of Labor Statistics. Each tweet is tagged with a GeoID, Month, and Year. An inner join of the two tables results in a combined dataset of 35,892,820 tweets spanning 873 metro areas, dates May 2014 through Oct. 2016, and unemployment rates 1.0 and 28.7. Thus, each tweet is tagged with an unemployment rate, and as Figure 1 demonstrates, the data is not distributed evenly, roughly centering around an unemployment rate between 5% and 6%. Figure 1. Histogram of Tweets by Unemployment Rate Due to space constraints, we remove tweets from randomly selected GeoIDs until we reach a final dataset of 3 million tweets. Our resulting datasets ensure no tweets from the same (GeoID, Month, Year) bucket exist in the same dataset. Separating our data into training and testing based on month/year results in a train/test split of 72 %/28 %. We preprocess our data using the method in Yuan & Zhou . We define our vocabulary as the vocabulary in the GloVe 50dimensional word vectors pretrained over 2 billion tweets , creating an additional “unknown” token to represent all the words in our dataset that are not in the embedding matrix vocabulary. 2 3.2 Baseline We ran two baseline models to predict unemployment rate. We first create a model that assumes the unemployment in a (GeoID, Month, Year) bucket is the same as the unemployment at the corresponding (GeoID, (Month1) % 12, Year) bucket. One would not expect drastic changes in the unemployment rate between months, hence this baseline provides us a sense of how accurate prediction can be when we have the knowledge of the most recent unemployment data. This baseline results in a rootmean squared (RMS) error of .58380, with the average difference between the previous and current unemployment rates equal to .13977, and Spearman Correlation Coefficient of .95128, p = 0.0. However, “nowcasting” is based on the idea that local agencies may not have access to unemployment statistics on a monthly basis, and smaller communities may have to wait for a longer time to observe changes in the unemployment rate. Thus, our goal is to measure how well models trained on Twitter data can come close to the baseline of predicting unemployment rate based on actual previous figures, rather than surpassing this result. Therefore, our second baseline model is a linear regression model that takes an average of GLoVE vector representations of each word in a tweet as inputs. With this model, our rootmean squared error is 1.41421 per tweet and 2.38125 per bucket, and we hope to improve upon these values with our deep learning models. 3.3
Deep Learning Models To use Deep Learning Models for our task, we store each tweet as a sequence of index values into our embedding matrix to save space, and pad each tweet with a special token corresponding to the last row of our embedding matrix until a total of 20 tokens is obtained. Our embedding matrix across all models is initialized with the GloVe 50d word vectors pretrained over 2 billion tweets, extended by 1 row for the “not found” token. Provided word vectors as features, each model outputs a value as the unemployment rate prediction. We train the models with true labels of UER corresponding to each tweet, experimenting with different deep neural architectures and hyperparameters. We test our model on an unseen testset with the following two approaches, based on the fact that every tweet belonging to the same (GeoID, Month, Year) bucket corresponds to the same UER. 1
|# tweets| E valuation #1 M ean Squared Error =
E valuation #2 M ean Squared Error = |# Buckets| |# tweets| ∑ (predict(tweett ) t=1 |# Buckets| |# tweets in bucket| b=1 t=1 ∑ 1
(( |# tweets in bucket| ∑ true(tweett ))2 predict(tweett )) true(bucketb ))2 For our experiments, we report the RMS for both Evaluation metrics, along with a Spearman’s Rank Correlation metric between the model’s predicted ranking of buckets by unemployment rate and the true ranking. 3.3.1 Recurrent Neural Network Figure 2. Sample RNN Sequence Recurrent Neural Networks are good at capturing short sequences of inputs, such as Tweets. Each word in the tweet is converted into its embedding vector, which is fed as an input to the RNN in a sequence. Our RNN model consists of a single hidden layer and L2 loss (regression) applied to the last output of the whole sequence. 3 3.3.2 Long Shortterm Memory Figure 3. Different update rules for a time step t, for LSTM (left) and GRU (right) Similar to RNN, we also feed in embedding vectors to our LSTM. But compared to the former, each cell of LSTM also includes input gates selectively ignoring parts of the current input, forget gates determine the usefulness of past memory states and preserve part of them, output gates separate the final memory from the hidden state. The combination of these gates give LSTM more power in selecting which information to pass to the next steps, and which one to hold back as memory. The problems of vanishing and exploding gradients commonly found in RNNs would also be alleviated in this case. 3.3.3
Gated Recurrent Unit Figure 4. Interactions among GRU components in a single cell  Gated Recurrent Unit (GRU) is a simpler variant of LSTM, designed to keep around memories to capture long dependencies and vary the influence of inputs at different time steps. Compared to LSTM, GRU only has 2 gates: update gate and reset gate. The intuition behind the model is that units with longterm dependencies will have active update gates, while units with shortterm dependencies will have active reset gate instead. The performance of GRU and LSTM are usually similar, but the former is more computationally efficient because of fewer variables. 3.3.4 Convolutional Neural Network Figure 5. CNN architecture, with three different filter sizes (3, 4 and 5) Finally, we implement a Convolutional Neural Network based on the model in Yoon Kim’s “Convolutional Neural Networks for Sentence Classifiations” , with our inputs a sequence of indices corresponding to the positions of the words in the embedding matrix. We use these indices to lookup the embedding matrix, and represent each tweet as a 20x50 matrix. We then perform the convolution operation on the input matrix for different filter sizes. Each layer in the network consists of a convolutional layer followed by maxpooling, with the MAX operation having previously shown success in other neural network tasks such as object recognition models . Each layer also includes a Rectified Linear Unit for nonlinearity, and our network contains a layer for each filter size. Our initial model contains 128 filters for filter sizes 3, 4, and 5, and we compare performance over different filter sizes. 4 4 Experiment We initially compare a basic Recurrent Neural Network with an RNN Cell with Gated Recurrent Unit and Long Shortterm Memory. Since GRU is found to perform the best of these 3 approaches, we go on to investigate GRU’s performance across different hyperparameters. We also compare these models with Convolutional Neural Networks at different filter sizes. 4.1
Recurrent Neural Networks Comparison We first compared different Recurrent Neural Network Architectures on performance over an unseen testset. Over our shuffled training set of 3 million tweets, each epoch took 4 hours to complete with a batch size of 64. After roughly 8 hours of training, all three models report statistically significant Spearman Correlation Coefficient values above 50%, as shown in Table 1. Moreover, each network performs better than our baseline linear regression model. Table 1: Comparison of Performance with Different Recurrent Neural Network Architectures on Unseen Test Set Model Type Simple Recurrent Neural Network Gated Recurrent Unit Long Shortterm Memory RMSE per Tweet 1.17339 1.13367 1.10046 RMSE per Bucket 1.82243 1.60382 1.65442 Spearman Correlation, PValue .59, p = 2.05 10 66 .67, p = 5.40 10 94 .65, p = 2.35 10 86 All three models demonstrate lower Root Mean Squared Error when calculated on a per tweet basis, suggesting that the amount of tweets the network determines as irrelevant might be few but extreme in predicted unemployment rate value. Given that our dataset is filtered to focus on jobs, most tweets are likely relevant to the task of predicting unemployment rate. Moreover, Gated Recurrent Unit (GRU) performs better than the other two network architectures, despite GRU’s lack of memory control, which LSTM provides. Previous research  supports our results that the more advanced information flow control that GRU and LSTM have provide improvement upon simple RNN networks. Because GRU not only performs the best, but also is less computationally intensive than LSTM, we chose GRU for further experimentation on the effect of hyperparameters. To get a complete sense of the models’ correlation predictions, we also compare the highest ranked and lowest ranked (GeoID, Month, Year) buckets based on unemployment rate between each model and the true value, shown in Tables 2 and 3. Table 2 : Different RNN Architectures Predicted vs True Highest Unemployment Buckets on Unseen Test Set Rank Simple RNN Gated Recurrent Unit LSTM True 1 Nogales, AZ Sept. 2016 Nogales, AZ Sept. 2016 Nogales, AZ Sept. 2016 Rio Grande City, TX July 2016 2 Taos, NM Sept. 2016 Nogales, AZ August 2016 Nogales, AZ August 2016 Raymondville, TX July 2016 3 Nogales, AZ August 2016 Nogales, AZ June 2016 Nogales, AZ May 2016 Rio Grande City, TX August 2016 4 Deming, NM August 2016 Nogales, AZ May 2016 Nogales, AZ Oct. 2016 Rio Grande City, TX June 2016 5 TruckeeGrass Valley, CA July 2016 Nogales, AZ Oct. 2016 Nogales, AZ June 2016 Raymondville, TX June 2016 5 Table 3: Different RNN Architectures Predicted vs True Lowest Unemployment Buckets on Unseen Test Set Rank Simple RNN Gated Recurrent Unit LSTM True 1 Huron, SD July 2016 Huron, SD July 2016 2 Alexandria, MN June 2016 Huron, SD August 2016 Huron, SD July 2016 Sioux Falls, SD Sept. 2016 3 Alexandria, MN May 2016 Huron, SD Sept. 2016 Huron, SD Sept. 2016 Huron, SD Oct. 2016 4 Huron, SD Sept. 2016 Huron, SD Oct. 2016 Huron, SD Oct. 2016 Sioux Falls, SD Oct. 2016 5 Worthington, MN May 2016 Sioux Falls, SD May 2016 Sioux Falls, SD June 2016 Sioux Falls, SD July 2016 Huron, SD August 2016 Huron, SD Sept. 2016 These results demonstrate that the Top 5 lowest unemployment rates are more correlated than the Top 5 highest unemployment rates. Additionally, Simple RNN outputs results that are more varied across different regions, while both GRU and LSTM consistently predict Nogales, AZ for the highest unemployment rate results, varying only 2 locations for the Top 5 lowest unemployment rates. From examining our training dataset (which includes buckets belonging to years 2014 and 2015), the buckets with the highest UERs contain GeoID’s Deming, NM and Nogales, AZ, both of which are captured by the Simple RNN. GRU and LSTM, however, appear largely biased towards Nogales, AZ, which we believe is due to overfitting on the training dataset. Slightly more than 50% of tweets from Nogales, AZ contain the keyword “nogales”, allowing the model to view “nogales” as an important feature for predicting high UER. Since neither true GeoID’s Raymondville nor Rio Grande City would likely produce tweets with the keyword “nogales”, their score for this feature would be quite low. And despite Raymondville appearing in the training dataset in the 3rd highest unemployment bucket, the amount of tweets from Raymondville (347) is significantly lower than the amount of tweets from Nogales (1185), likely resulting in the model fitting towards tweets from Nogales. The overfitting issue is not as apparent in Table 3, which shows results for Lowest Unemployment. This is due to the fact that Huron, SD and Sioux Falls, SD are part of the lowest unemployment rate buckets across our training set, thus matching the test set more closely (interestingly, “siouxfalls” is mapped to unknown key and thus is not likely a keyword the model overfits towards). The Simple RNN shows a greater variety of locations, and is thus generalizing more but performing less than the other models. We investigate how to prevent overfitting by experimenting with different hyperparameters with the GRU model, as it performed the best. 4.2 Performance of Gated Recurrent Unit with Varying Hyperparameters We chose two hyperparameters to vary: Hidden Layer Size and Dropout. A smaller Hidden Layer Size for the neural network results in a tradeoff between representing the model with fewer parameters and increased computational efficiency. If the unemployment rate relies on many parameters, then decreasing the Hidden Layer Size may result in decreased performance. However, reducing the number of parameters in the model helps increase the model’s ability to generalize. Similarly, Dropout randomly drops units in the network during training, thus “thinning” the network, and has shown success in mitigating overfitting . We compare results across different values for Hidden Layer Size and Dropout “keep” probability of GRU, and report the highest and lowest predicted vs. true unemployment buckets in each case. Table 4: Comparing Different Hidden Layer Sizes for GRU Model Parameter Hidden layer Size = 200 Hidden layer Size = 150 Hidden layer Size = 50 Hidden layer Size = 25 RMSE per Tweet 1.13367 1.08872 1.08707 1.11446 RMSE per Bucket 1.60382 1.66821 1.66591 1.64819 Spearman Rank Correlation, PValue .67, p = 5.40 10 94 .65, p = 5.52* 10 6 86 .65, p = 1.82* 10 84 .64, p = 1.69* 10 83 Table 5: GRU Predicted vs True Highest UER Buckets on Unseen Test Set for Different Hidden Layer Sizes Rank Hidden layer Size = 200 Hidden layer Size...
View Full Document
- Fall '18
- Benjamin Perez