Answering Why Questions
An introduction to using regression analysis with spatial data
By Lauren Scott, Esri Geoprocessing Spatial Statistics Product Engineer, and Monica Pratt, ArcUser Editor
Regression analysis allows you to model, examine, and explore spatial relationships and can help explain the factors behind observed spatial patterns. Regression analysis is also used for prediction. Tools included in the Modeling Spatial Relationships toolset, found in ArcToolbox, help answer why questions such as n Why are there places in the United States where people persistently die young? What might be causing this? n Why do some places experience more crime or fire events? Can we model the characteristics of these places to help reduce these incidents? n Why do some locations have a higher-thanexpected rate of traffic accidents? Are there factors contributing to this? Are there policy implications or mitigating actions that might reduce traffic accidents across the city and/or in particular areas? You may want to understand why people are persistently dying young in certain regions, for example, or predict rainfall where there are no rain gauges. The tools in this toolset include Ordinary Least Squares (OLS) Regression and Geographically Weighted Regression (GWR). OLS, the best known of all regression techniques, is the proper starting point for all spatial regression analyses. It provides a global model of the variable or process you are trying to understand or predict (early death/rainfall) and creates a single regression equation to represent that process. GWR is one of several spatial regression techniques increasingly used in geography and other disciplines. GWR provides a local model of the variable or process you are trying to understand/predict by fitting a regression equation to every feature in the dataset. When used properly, these methods are powerful and reliable statistics for examining/estimating linear relationships. Linear relationships are either positive or negative. If you find that the number of search and rescue events increases when daytime temperatures rise, the relationship is said to be positive; there is a positive correlation. Another way to express this positive relationship is to say that search and rescue events decrease as daytime temperatures decrease. Conversely, if you find that the number of crimes goes down as the number of police officers patrolling an area goes up, the relationship is said to be negative. You can also express this negative relationship by stating that the number of crimes increases as the number of patrolling officers decreases. The illustration at the top of the next page depicts both positive and negative relationships as well as the case where there is no relationship between two variables. Correlation analyses and their associated graphics, depicted in this illustration, test the strength of the relationship between two variables. Regression analyses, on the other hand, make a stronger claim. These analyses attempt to demonstrate the degree to which one or more variables potentially promote positive or negative change in another variable. Using Regression Analysis Regression analysis can be used for many types of applications such as modeling fire frequency to determine high-risk areas and better understand the factors that contribute to high-risk areas. It can be used to model property loss from fire as a function of variables such as degree of fire department involvement, response time, or property value. If you find that response time is the key factor, you may need to build more fire stations. If you find that involvement is the key factor, you may need to increase equipment/ officers dispatched. Regression analysis can help you better understand phenomena to make better decisions, predict values for phenomena at other locations or times, and test hypotheses. Modeling a phenomenon can yield a better understanding that can affect policy or provide input for deciding which actions are most appropriate. The basic objective is to measure
Regression Analysis Components
Terms and basic concepts
It is impossible to discuss regression analysis without first becoming familiar with a few of the terms and basic concepts specific to regression statistics. Regression equation is the mathematical formula is applied to the explanatory variables to best predict the dependent variable you are trying to model. Although those in the geosciences think of X and Y as coordinates, the notation in regression equations uses X and y. The dependent variable is always y, and independent or explanatory variables are always X. Each independent variable is associated with a regression coefficient describing the strength and the sign of that variable's relationship to the dependent variable. A regression equation might look like the accompanying illustration where y is the dependent variable, the Xs are the explanatory variables, and the s are regression coefficients.
= 0 + 11 + 22 + ....... + = 0 + 1 1 + 2 2 + ....... +
Dependent Variable Coef cients Explanatory Variables Random Error Term/Residuals
Elements of an OLS regression equation Suppose you want to both model and predict residential burglary (RES_BURG) for the census tracts in your community. You've identified median income (MED_INC), the number of vandalism incidents (VAND), and the number of household units (HH_UNITS) to be key explanatory variables. The equation would have the elements shown here: RES_BURG = 0 + 1 (MED_INC) + 2 (VAND) + 3 (HH_UNITS) +
40 ArcUser Spring 2009
www.esri.com