Predicting Health with Spatial Data Variables

MatthewKetterer · ‎03-20-2025

Hello, my name is Matt Ketterer and I am recent Master of Applied Statistics graduate from Pennsylvania State University. In my research project I utilized spatial data to estimate the health benefits of public park access. While I am not an ArcGIS user, I found through my research that spatial data can lead to new feature engineered variables. This blog post is about my journey using spatial data to predict health by combining it and national health survey data.

The Trust for Public Land (tpl.org) is dedicated to increasing and tracking public park access in the US. Its mission of “connecting everyone to the outdoors” is an easy to get behind concept and would likely be beneficial to people’s health. Still, I wondered if I would be able to model this relationship to show if access improves health outcomes and by how much. For instance, it would be valuable to know from a healthcare cost savings standpoint the extent to which park access could reduce health care visits by encouraging outdoor activities and physical activity. I wanted to see if I could model this using the US metropolitan and micropolitan areas from the CDC’s Behavioral Risk Factor Surveillance System (BRFSS) data. This is a subset from a massive telephone health survey that focuses on town and city areas, where most of the population resides, and is predicted to grow in the future.

The ParkServe database is a great tool that I came across. While its main function is to serve as an inventory for parks, it has census demographics and health-related filters embedded in the website. I used it to compare access percentages across the surveyed health responses from the CDC.

Data Gathering and Exploratory Analysis

To manage, model, and report the data I used the R programming language, as it is freely available and robust. In the future, I would like to use ArcGIS and R together to make it easier for map creation and estimation of spatial variables. Since the public park data is through a web portal, I had to manually retrieve the values for each of the 130 of the CDC’s surveyed metropolitan and micropolitan areas from the ParkServe Database GIS portal. This part of data gathering was sometimes cumbersome. Once I had all the values for each city/region, I could merge it with the approximately 220,000 individual responses to their corresponding city in the CDC dataset using the R programming language. The hardest part was when some of the regions didn’t match completely, due to slight changes in the core based statistical area (CBSA) definitions between 2015 and 2024. In the case where a county was missing between two CBSA definitions, I had to manually combine counties to come up with a park access estimate using weighted population averages. Below the map shows all the areas used in the dataset and their corresponding park access percentage for each city's residents.

Building a Statistical Model and Results

To estimate the effect of public park access, I chose a response variable to the BRFSS survey question of “How many days in the last 30 days was your physical health not good?”. This is a discrete count variable between 0 and 30 that can be modelled with a count distribution, in this case a zero-inflated negative binomial distribution (not to be confused with binomial used in logistic regression). The advantage of using a count variable in a statistical model, as opposed to a categorical one, such as asking if your health is “Good”, “Fair”, or “Poor”, is that there is more information, thus giving more statistical power, since there are 31 ‘bins’ of responses and they are ordered. The model was also adjusted for many factors that would affect health, this was to give some validation to the park access variable. These predictor variables used in the model were park access, age, race, sex, education, and smoking. The region itself was also included in the model to see if there were any differences in unhealthy days responses but was not statistically different from any other city.

The figure above shows predicted unhealthy days decrease as park access increases and is stratified by age and education.

Here in this graph the predicted unhealthy days response is stratified by sex and race, and shows the same trend, decreasing unhealthy days for residents of cities with higher park access.

After adjusting for other factors known to be a cause of poor health, park access has a small but significant effect. Inhabitants living in a city with a better park system have a lower predicted number of sick days than those that live in the lowest tier. This would imply that increased access helps drive park usage and physical activity such as walking and being in nature, improving health. Also, when we are talking about whole populations, small positive benefits add up to cost savings.

Final Thoughts

I am excited and inspired to pursue more spatial and non-spatial data together in my future research. I hope to work directly with ArcGIS and R in the future to explore it's possibilities. In the meantime, I’m going to take my kid to the park. If you ever get a chance to explore the Puget Sound area please do, or just visit your local park, or advocate for more parks where you live.

Thanks for reading! Please feel free to contact me with questions by email at mketterbob@gmail.com or on LinkedIn.