Week 3 of Civic Analytics with Hub (correlation and clustering)

ManushiMajumdar · ‎07-28-2020

This week we focus on the attributes of a dataset to understand how relationships between attributes can be detected and interpreted. We also extend that understanding further to spot hidden patterns in our data.

In the first example we fetch neighborhood boundaries for Washington, DC to observe correlation in socioeconomic factors. We enrich the neighborhoods layer with a few socioeconomic variables such as, variables for Population, Median Household Income, Households below poverty levels, to name a few. We then display the data as a scatter matrix - a collection of scatter plots - which compare the relation of each numerical variable with the other to see if changes in one variable reflect as changes in the other variable in some way. Having obtained a visual understanding of these correlated variable pairs, we then use statistical tests from the scipy (Scientific Python) library of Python to numerically compute this correlation for a few variable pairs.

The second notebook demonstrates two different techniques of detecting clusters or patterns in data. We begin by fetching data for rodent inspection and treatment sites in Washington, DC for the last 30 days to detect point clusters if any, which helps inform strategies for follow-up treatments and inspections. The second example we look at checks to see if neighborhoods within the city of Tucson can be grouped together based on similarities in income variables. We read in data and then extract variables of interest in a separate dataframe. This data is used as the input for the KMeans unsupervised learning method from the scikit-learn library of Python. This helps us detect neighborhood clusters that exhibit similarity in our variables of choice.