Neighborhood Factors and Student Success
This project is designed to complete the Capstone, as described in IBM/Coursera's certification course for Data Scientists. One of the requirements was that one should use data from Foursquare's API to draw conclusions about geographic locations. Given my interest in performing more research into subjects related to education, I have chosen to compare the data that we collected about New York neighborhoods to another dataset, average SAT scores in New York schools. The goal of this analysis will be to determine whether or not student success is correlated with certain environmental factors, as will be described, below. If successful, the goal will then be to determine which factors hold the most influence. If these factors can be revealed, then it may be that decision-makers and educations can form better strategies to help their students succeed.
- I will be using neighborhood definitions from the City of New York website . I really just need latitude and longitude data to define where each "neighborhood" is.
- Each neighborhood is assigned information supposed to help us characterize it. This information is in the form of number of different types of venues, courtesy of Foursquare .
- I will be using income data, separated by ZIP code, found on Kaggle . It lists different income brackets for each ZIP code, along with the total number of returns and the total income reported, the sum of each return.
- I will be using New York City crime data from 2014-15, also found on Kaggle . It lists each incident, along with description and location.
- Finally, the average SAT scores are also from Kaggle, collected in 2014 . It features individual schools, along with location, and average SAT scores for senior students, separated by subject.
First, one must decide on a good definition for student "success." Thankfully, reliable metrics such as test scores (SAT and others) have been in use for many years, and this data is readily available. Because the SAT is a numeric value and we are looking to validate our model by its predictiveness, I will use various forms of regression to try to match neighborhood features with that neighborhood's average SAT scores.
It becomes necessary to select and synthesize very specific variables from the data listed above.
First, I met the difficulty where some neighborhoods share a ZIP code. Therefore, information that relied on ZIP codes for location had to be shared across multiple neighborhoods. We used the "uszipcode" Python library to assign each neighborhood a ZIP code .
From the income data set, my goal was to get a general "income per capita" figure. Therefore, I took the sum of each income bracket's "income" figure and divided it by the sum of each bracket's "number of returns filed" to try to get a rough estimate. Because the data was separated by ZIP code, I assigned it to our neighborhood data using that.
From the crime data, I really just wanted the number of crimes that occurred in a given neighborhood, regardless of type. Crime data was located using two coordinate systems, Latitude/Longitude pairs, along with NY City's own, historical grid reference. Using the Latitude/Longitude data required the use of geopy's "distance" method to try to figure out which neighborhood center was the closest to the given incident. This is how I assigned each incident to a neighborhood. However, even dealing with a small subset of our original dataset, the process took all night.
For the school data set, I needed to attribute each school to a neighborhood. Because Latitude/Longitude pairs were provided for each school, we used the same method as we did for our crime data. The process was performed much more quickly, given the number of rows involved. However, what we discovered was that not all neighborhoods in New York have schools in them. Because we cannot confirm predictions for neighborhoods that don't have schools, we dropped those rows, leaving us 137 neighborhoods to work with.
Finally, I used data that had been collected from Foursquare's "explore" feature to get a list of venues, by type, in each neighborhood and represented them as a percentage of that neighborhood's whole.
The final form for our Data
Some quick exploration gives us the following visualizations. Here, I tried to determine whether or not there was an obvious correlation between presumptive predictors, such as average income and crime rates, for student average SAT scores. Note that there is an apparent negative correlation between crime rates and SAT scores. However, barring some outliers which make analysis difficult, our income data does not possess such an obvious correlation.
Because I am trying to predict a numeric value, I decided to compare and contrast several forms of linear regression. Using the data, defined above, I trained the following models:
- Multiple Linear Regression
- Multiple Linear Regression with Polynomial Features
- Ridge Regression
- Ridge Regression with Polynomial Features
When I trained each model, I held out a test group. Using this test group, I tried to predict what each neighborhood's average SAT score would be, given their other features. Below, I have made a distribution plot to contrast the real SAT values with each model's prediction.
The above diagrams indicate that the Polynomial Ridge Regression algorithm was most effective in predicting the distribution of student scores in the test group. However, when I compare the precise difference between each prediction and its corresponding real value, it paints a different picture.
Model Errors, Compared
Here, I have listed each model's score, using the test set, in terms of "Mean Squared Error" and "R^2 Score". In R^2, we want to see a value that is close to 1. In Mean Squared Error, lower is better. While our Polynomial Ridge Regression model seemed to give us a better approximation of the distribution, the simple Ridge Regression model gives us significantly fewer errors.
The goal of this analysis is to determine which factors have the greatest sway over each model's prediction. We could compare each model's coefficients, trying to relate them back to their original features. However, thankfully, Scikit provides us with a method that does just this, "permutation_importance" . Here, I have used this function to try to determine the importance of each feature using the most predictive model, Ridge Regression.
Feature Importance, Ranked
What we have determined is that, once it is well trained, we can analyze a predictive model to determine which features are most important in making its predictions. However, as it stands, even the best model I have presented is not very predictive (with an R^2 score of less than 0). It is clear that there is too much noise. One would do well to begin again, using more available data and more computational power (enabling us to explore higher degree polynomial models). Further refinement will mean selecting only a few features from each dataset, for instance, these features with the most predictive power, given our Ridge Regression algorithm above, and no others. Future refinements of this analysis might:
- Attempt the above analysis again, limiting venue data to a subset of decided "potential" influencers, to reduce noise.
- Search Foursquare for the entire number of predictive locations, such as Playgrounds, Pharmacies, and Metro stations.
- Instead of including the "count" of venues in the neighborhood, reassess the model to use "distance" from each school to the venue.
- Use entire crime-data set (instead of subset, to be sure of accurate neighborhood distributions)
- Attempt to find more granular information on individual income (or, try analyzing income by the number of individuals in each income bracket)
- Try disregarding the "neighborhood" distinction, completely, and, instead, judge each school individually with features established by Latitude/Longitude pairs and distance.
The above analysis has demonstrated this: by iteratively improving models, we have seen corresponding iterative improvements in the predictiveness of said models. Therefore, we might conclude that, by some continued refinement, we will be able to achieve a reasonably predictive model. Much work is to be done, and the models trained above are not helpful in determining the importance of real-world features. However, they do promise improvement.