Answering Why Questions Continued from page 41 you are trying to model/understand, running the regression tool to determine which variables are effective predictors, then removing/adding variables until you find the best model possible. The accompanying article, "Regression Analysis Components: An introduction to terms and basic concepts," defines the terms used when discussing this type of analysis. Regression Analysis Issues OLS regression is a straightforward method that has both well-developed theory behind it and effective diagnostics to assist with interpretation and troubleshooting. OLS is only effective and reliable, however, if your data and regression model meet/satisfy all the assumptions inherently required by this method. Be sure to visit the ArcGIS Resource Center for ArcGIS Desktop and read "How Regression Models Go Bad." This article supplies an excellent discussion of this topic with examples. Spatial data often violates the assumptions/ requirements of OLS regression, so it is important to use regression tools in conjunction with appropriate diagnostic tools that can assess whether or not regression is an appropriate method for your analysis, the structure of the data, and the model being implemented. Spatial Regression Spatial data exhibits two properties that make it difficult (but not impossible) to meet the assumptions and requirements of traditional (nonspatial) statistical methods such as OLS regression. First, geographic features are more often than not spatially autocorrelated. This means that features near each other tend to be more similar than features that are farther away. This creates an overcount type of bias when using traditional (nonspatial) regression methods. Second, geography is important. Often, the processes most important to the model are nonstationary; these processes behave differently in different parts of the study area. This characteristic of spatial data can be referred to as regional variation or spatial drift. True spatial regression methods were developed to robustly deal with these two characteristics of spatial data and even incorporate the special qualities of spatial data to better model data relationships. Some spatial regression methods deal effectively with spatial autocorrelation, while others accommodate nonstationarity processes well. At present, no spatial regression methods are effective for both characteristics. However, for a properly specified GWR model, spatial autocorrelation is typically not a problem. There seems to be a big difference between how a traditional statistician and a spatial statistician view spatial autocorrelation. The traditional statistician sees spatial autocorrelation as a bad thing that needs to be removed from the data (through resampling, for example) because it violates the underlying assumptions of many traditional (nonspatial) statistical methods. For the geographer or GIS analyst, spatial autocorrelation is evidence of important underlying spatial processes at work. It is an integral component of the data. Removing space removes data from the spatial context--it is like getting only half the story. The spatial processes and spatial relationships evident in the data are a primary interest and are one of the reasons geographers get so excited about spatial data analysis. However, to avoid an overcounting type of bias in your model, you must identify the full set of explanatory variables that will effectively capture the inherent spatial structure in your dependent variable. If you cannot identify all these variables, you will very likely see statistically significant spatial autocorrelation in the model residuals. Unfortunately, you cannot trust your regression results until this is remedied. Use the Spatial Autocorrelation tool in the Spatial Statistics toolbox to test for statistically significant spatial autocorrelation in your regression residuals. There are at least three strategies for dealing with spatial autocorrelation in regression model residuals: resampling input variables, isolating spatial and nonspatial components, and incorporating spatial autocorrelation into the regression model. n Resample until the input variables no longer exhibit statistically significant spatial autocorrelation. While this does not ensure the analysis is free of spatial autocorrelation problems, these problems are far less likely Regression Analysis Components Continued from page 41 P-values are generated by a statistical test that is performed by most regression methods to compute a probability for the coefficients associated with each independent variable. The null hypothesis for this statistical test states that a coefficient is not significantly different from zero (in other words, for all intents and purposes, the coefficient is zero and the associated explanatory variable is not helping your model). Small p-values reflect small probabilities and suggest that the coefficient is, indeed, important to your model with a value that is significantly different from zero (the coefficient is not zero). For example, you would say that a coefficient with a p-value of 0.01 is statistically significant at the 99 percent confidence level; the associated variable is an effective predictor. Variables with coefficients near zero do not help predict or model the dependent variable; they are almost always removed from the regression equation unless there are strong theoretical reasons to keep them. R2 / R-Squared values, which range from 0 to 100 percent, are a measure of model performance. Multiple R-squared and Adjusted R-Squared are both statistics derived from the regression equation to quantify model performance. If your model fits the observed dependent variable values perfectly, R-Squared is 1.0 and you (no doubt) have made an error. Perhaps you've used a form of y to predict y. More likely, you will see R-Squared values such as 0.49, which you can interpret in the following manner: this model explains 49 percent of the variation in the dependent variable. To understand what the R-squared value is indicating, create a bar graph showing both the estimated and observed y values sorted by the estimated values. Notice how much overlap there is. This graphic provides a visual representation of how well the model's predicted values explain the variation in the observed dependent variable values. The Adjusted R-Squared value is always a bit lower than the Multiple R-Squared value because it reflects model complexity (the number of variables) as it relates to the data. www.esri.com 42 ArcUser Spring 2009