Maximize Lending Interest with Fairness

A Project by Andrew Rittenhouse et al.

Using different models include kNN, logistic, random forest, and Neural Network to investigate the expected return value.

Using different models include kNN, logistic, random forest, and Neural Network to investigate the expected return value.

Interest rate and specific financial ratio vary across different groups.

Interest rate and specific financial ratio vary across different groups.

How to build an investment strategy that will advise our client on Lending Club on which loans to invest in?

Context

Lending Club releases quarterly and annual data on the loans that it has facilitated, providing huge databases with past results and parameters from successful and unsuccessful loans. However, due to the staggering size of these databases, it is impossible for any human investor to utilize this data in an efficient way.

Nevertheless, this abundance of data presented an opportunity to build highly lucrative models by employing data science techniques. Investors on Lending Club are eager to determine which loans seem the most promising in the sense that the loans will have high returns. With data science techniques, these models would ultimately predict and decide which loans are profitable to invest in, taking into account the risk of default as well as the potential interest returns from a loan. Creating successful and accurate models would allow us to invest with relatively high confidence and encourage participation in Lending Club for investors.

Summary of Discrimination

In order to ascertain the extent of racial discrimination in the model, we calculated the average racial demographics of the test set by averaging the proportion of each demographic for all of the observations. Then, after selecting the best n loans, we once again calculated the average racial demographics.

For every racial demographic, the racial demographics of the n “best loans” reflected those of the greater test set within 2 percent. Thus, we conclude with reasonable confidence that our models are choosing loans that reflect the data that has been fed into them and therefore are not statistically discriminating.

That being said, selection bias may still lurk in this dataset. It may very well be possible that certain racial groups are more likely to use Lending Club at a proportion that exceeds their proportional make-up of their local zip code. Our models unfortunately cannot correct for this possible selection bias.

Summary of Models / Analysis of Models

In order to build our models, we first had to clean our dataset completely. This involved imputing any missing data within our data frame and selecting important predictors from the thousands of features available to us. Using a Random Decision Forest Regressor, we were able to select the 120 most significant predictors in the dataset.

Afterwards, we built five models based on the major prediction schemes we learned in class: Logistic, Unlimited Depth Decision Forest, Limited Depth Decision Forest, K Nearest Neighbors, and Neural Network. After optimizing our models, we conclude that the Logistic Model yielded the best results.

Nevertheless, we were able to greatly exceed Lending Club’s existing standards for investment interest returns. Our Logistic Model achieved an average of 12.58%12.58% interest returns on selected investment compared to Lending Club’s average of 4−5%4−5% returns that they reported on their website.

Test Accuracy Investment Returns ROC AUC
Logistic 0.735294 0.126066 0.662668
Decision Forest (Unlimited) 0.700163 0.126803 0.650907
Decision Forest (Limited) 0.745915 0.119016 0.663640
kNN 0.684641 0.133525 0.551456
Neural Network 0.424837 0.112951 0.554948