Past Projects — Harvard Undergraduate Data Analytics Group

Predicting Types of Crime

A Project by Ethan Kim et al.

This is a mapping of our crime categories across various census tracts in the city of Boston. Districts of note: Downtown Crossing has a high occurrence of all types of crime except for death related ones. This would make sense given the it is one o… — This is a mapping of our crime categories across various census tracts in the city of Boston. Districts of note:
Downtown Crossing has a high occurrence of all types of crime except for death related ones. This would make sense given the it is one of the busiest parts of the city.
Lower Roxbury seem to have the highest occurrence of all types of crime including death.
Again death related crimes differentiates itself as its occurrences are fairly somewhat contained in specific stretches of neighborhoods.

“What geographic and socioeconomic factors are associated with which types of crimes? What are the associations?”

Context

A longstanding goal of government and society is to reduce criminal activity. To do so, we must first understand patterns in criminal activity -- where does it happen, and what factors, both geographical and socioeconomic, are associated with these places?

Crime forecasting is the process of predicting future crimes based on historical patterns of crime within a city or neighborhood. Crime forecasting is important because it arms local police departments with information about what types of crime of likely to occur in a given area, the frequency of these crimes, and which areas are more likely to be crime "hot spots" in the future. When related to data such as time of day, weather, likely area, etc. police can better plan crime prevention strategy, parole routes, and educational programs. Beyond improving policing, crime prediction can give communities and community leaders an understanding of the challenges they face and help them better implement policy that may prevent crimes from happening altogether.

Our project attempts to predict the category of crime committed in the Boston area based on location data, temporal data, and data about socioeconomic factors.

Models

After synthesizing and cleaning the data, we ended up with a dataset containing 33 predictors predictors and 191255 observations. A brief summary of the performance of our baseline model and our improved models is below.

Baseline model (multiple logistic regression, with all predictors): Test accuracy of 33%
Improved model 1 (Random forest with 100 trees and maximum depth of 18): Test accuracy of 56.3%
Improved model 2 (Neural network with two hidden layers, each with 100 nodes): Test accuracy of 53%
Improved model 3 (Random forest optimized over number of trees, depths, and number of predictors at each node): Test accuracy of 56.6%

Summary

Distance to the nearest streetlight seems to be a significant predictor (different for day and night time) for Force and Property crimes, and possibly for Public crimes. The sign of our t-statistic suggests that Force crimes tend to occur nearer streetlights, while Property crimes tend to occur away from streetlights. Public crimes also tend to occur away from streetlights.
Of the seven measures of inequality we began with, median income and total value of property within a 200-meter radius of the crime were not very important; all other predictors (Gini coefficient for the census tract in which the crime was committed, percentage of people in high-income housing, percentage of people with low education, percentage of people with high education, percentage of people in new housing, percentage of people in old housing, and percentage of people in poverty) were important. The occurrences of all types of crimes tended to increase with every measure of inequality, but we did not find anything to suggest that any type of crime increased with inequality more severely than the others.
The importance ranking of the predictors from our random forest models suggested that the predictors associated with time -- such as HOUR, DAY OF WEEK, MONTH, and YEAR -- were most important in predicting which type of crime will occur.
Our highest test accuracy was 56.6%. Since optimized random forest and neural network models are some of the most sophisticated and advanced models that exist, it seems that predicting the types of crimes is very challenging. There is a lot of unpredictable variation in the data since humans are themselves unpredictable.

Click here to learn more.

How Trump’s Tweets Follow and Move the Stock Market

A Project by Jerry Huang, Roger Zhang, et al.

At first, most of the tweets are from an Android and then there is a switch to predominantly iPhone with a mix of Twitter Web Client and then eventually Twitter Media Studio. The Twitter Media Studio and Web Client tweets are likely to be staffers s… — At first, most of the tweets are from an Android and then there is a switch to predominantly iPhone with a mix of Twitter Web Client and then eventually Twitter Media Studio. The Twitter Media Studio and Web Client tweets are likely to be staffers since these platforms, especially the former, are geared towards press/media teams.

The basic Dow variables (Open, High, Low, Close) tell us only about the strength of the US economy. As the US economy keeps growing, the general trend of Dow variables is just to rise. On the other hand, the daily range of the Dow (Max - Min) is the… — The basic Dow variables (Open, High, Low, Close) tell us only about the strength of the US economy. As the US economy keeps growing, the general trend of Dow variables is just to rise. On the other hand, the daily range of the Dow (Max - Min) is the essentially the daily volatility, which is probably why the Dow daily range and VIX Open are highly correlated.

“How much Donald Trump’s tweets improve our ability to predict the change in the VIX between a given day and the following day?”

The most frequent tweets by Trump mention country, and people, but also include his opponents like Democrat, China, Mexico. Also he mentioned a lot about the fake news and witch hunt.

Variables of Interest and Intuitions

Financial Predictors:

The difference between Dow opening and closing prices for each of the five proceeding business days.
- This information should represent or tell us something about the current trends in the stock market and short-term volatility (i.e. Are the tweets on a given day and subsequent change in VIX both a result of some "third variable" event/news).

Tweet-based Predictors:

Sentiment analysis of each tweet
- This variable is the sum of positive and negative polarity or valence for all of Trump's tweets on a given day. Also whether his tweets contain facts, or more on his personal emotional expressions.
  - e.g. a tweet "it is snowing" has very low absolute polarity and low subjectivity score, where a tweet like "I hate the snow!" would have a high (negative) absolute polarity and subjectivity score.
  - The intuition here is that volatility depends on the magnitude of the polarity and subjectivity since VIX does not account for direction.
The number of times on a given day that a tweet referenced certain keywords like "China" or "tariff"
- The intuition here, since the keywords are economically focused, is that the total number of these keywords mentioned gives a metric of how economically focused the tweets of the day were.
  - This is not necessarily an exhaustive or refined list.
- Full list of our chosen keywords:
  - "stock", 'market', "agreement", "negotiator", "negotiation", "trade", "china", "economy", "jobs", "tariff", "employ", "s&p", "auto", "farmer"
"TRUMPINESS"

Media reports indicate that Trump switched to an iPhone from an Android phone in early 2017. Previously, Trump uses Android for Tweeting, and his staffers use other platforms, including Twitter for iPhone, Twitter for Web, for tweeting. Therefore, we can use the tweets created during this pre-iPhone period as a training set, using device as the ground-truth value of whether Trump posted a certain post.

We extract features from the text and tweet metadata as our predictors. Using TF-IDF, we generate a vector weighting the most important words among Trump's vocabularies. We also extracted dummy features that indicate whether a tweet includes a link, picture, video, hashtag, "@" or not, as well as the sentiment scores (polarity and subjectivity) of Trump's Tweet.

Here we fit the indicators we have into the Random Forest classification model to give us the probability of whether the twitter is tweeted by Trump himself or by his team member.
1. Features include hour of tweet, whether the tweet includes link, whether the tweet includes hashtag, whether the tweet includes "..." (indicate threading), polarity sentiment score, subjectivity sentiment score, year, month, day, minute of tweet, as well as the generated TF-IDF vectors.
Then we use standard 0.5 as the cut off line to classify whether the tweet is sent by Trump himself, or it's by other members in his team. With this binary classification, we have created two datasets: one without this classification named as "Unfiltered", the other one that filters the tweet sent by Trump called "Filtered".

There is a peak at around 11, 12, compared with the staff version where the peak happens at around 20. The middle time is from 15-1.

Conclusion

We used Dow Jones index and parameters derived from Donald Trump's tweets to build a classification model that predicts whether a day's VIX Open index would significantly increase, plateau, or significantly decrease compared with the last trading day. We first trained a binary classification model trained on Donald Trump's tweet using sentiment analysis and TF-IDF vectors to identify tweets that are actually posted by Donald Trump himself - who presumably would have larger market impact. Then, using the filtered true Donald Trump tweets, sentiment scores, and dummy variables that indicate whether a word is included in the tweets, incorporating Dow Jones index, we trained a series of models based on Random Forest to classify the change of VIX Open index.

For our best model including all of our parameters, we achieved a test accuracy of 75.11%, 4.89% higher than our model that only includes Dow Jones index. This indicates that Donald Trump's parameters we've derived is a valid signal and can inform decision making.

Click here to learn more.

Maximize Lending Interest with Fairness

A Project by Andrew Rittenhouse et al.

Using different models include kNN, logistic, random forest, and Neural Network to investigate the expected return value.

Interest rate and specific financial ratio vary across different groups.

“How to build an investment strategy that will advise our client on Lending Club on which loans to invest in?”

Context

Lending Club releases quarterly and annual data on the loans that it has facilitated, providing huge databases with past results and parameters from successful and unsuccessful loans. However, due to the staggering size of these databases, it is impossible for any human investor to utilize this data in an efficient way.

Nevertheless, this abundance of data presented an opportunity to build highly lucrative models by employing data science techniques. Investors on Lending Club are eager to determine which loans seem the most promising in the sense that the loans will have high returns. With data science techniques, these models would ultimately predict and decide which loans are profitable to invest in, taking into account the risk of default as well as the potential interest returns from a loan. Creating successful and accurate models would allow us to invest with relatively high confidence and encourage participation in Lending Club for investors.

Summary of Discrimination

In order to ascertain the extent of racial discrimination in the model, we calculated the average racial demographics of the test set by averaging the proportion of each demographic for all of the observations. Then, after selecting the best n loans, we once again calculated the average racial demographics.

For every racial demographic, the racial demographics of the n “best loans” reflected those of the greater test set within 2 percent. Thus, we conclude with reasonable confidence that our models are choosing loans that reflect the data that has been fed into them and therefore are not statistically discriminating.

That being said, selection bias may still lurk in this dataset. It may very well be possible that certain racial groups are more likely to use Lending Club at a proportion that exceeds their proportional make-up of their local zip code. Our models unfortunately cannot correct for this possible selection bias.

Summary of Models / Analysis of Models

In order to build our models, we first had to clean our dataset completely. This involved imputing any missing data within our data frame and selecting important predictors from the thousands of features available to us. Using a Random Decision Forest Regressor, we were able to select the 120 most significant predictors in the dataset.

Afterwards, we built five models based on the major prediction schemes we learned in class: Logistic, Unlimited Depth Decision Forest, Limited Depth Decision Forest, K Nearest Neighbors, and Neural Network. After optimizing our models, we conclude that the Logistic Model yielded the best results.

Nevertheless, we were able to greatly exceed Lending Club’s existing standards for investment interest returns. Our Logistic Model achieved an average of 12.58%12.58% interest returns on selected investment compared to Lending Club’s average of 4−5%4−5% returns that they reported on their website.

  
      Test Accuracy
      Investment Returns
      ROC AUC
    
      Logistic
      0.735294
      0.126066
      0.662668
    
      Decision Forest (Unlimited)
      0.700163
      0.126803
      0.650907
    
      Decision Forest (Limited)
      0.745915
      0.119016
      0.663640
    
      kNN
      0.684641
      0.133525
      0.551456
    
      Neural Network
      0.424837
      0.112951
      0.554948

	Test Accuracy	Investment Returns	ROC AUC
Logistic	0.735294	0.126066	0.662668
Decision Forest (Unlimited)	0.700163	0.126803	0.650907
Decision Forest (Limited)	0.745915	0.119016	0.663640
kNN	0.684641	0.133525	0.551456
Neural Network	0.424837	0.112951	0.554948

Click to learn more.

Predicting Types of Crime

A Project by Ethan Kim et al.

Context

Models

Summary

How Trump’s Tweets Follow and Move the Stock Market

A Project by Jerry Huang, Roger Zhang, et al.

Variables of Interest and Intuitions

Financial Predictors:

Tweet-based Predictors:

Conclusion

Maximize Lending Interest with Fairness

A Project by Andrew Rittenhouse et al.

Context

Summary of Discrimination

Summary of Models / Analysis of Models

Address

Contact Us

Socials