Building an urban growth prediction model enables city planners to make informed urban policy decisions and assists commercial investors in making profitable choices on where to build city infrastructure. While the resultant monetary and social benefits are abundant, it can be a challenge to pinpoint which areas would be most suitable for urbanization or how probable urban development might be in a specific area.
A team that included Jian Lange and Witold Fraczek from Esri and Carsten Lange, a professor in the Department of Economics at California State Polytechnic University, Pomona, worked on a project to develop and build a machine learning model to identify locations with a higher probability of urban development.
The team decided to focus on The Research Triangle in North Carolina as the study area, which encompasses North Carolina State University, Duke University, and the University of North Carolina at Chapel Hill, and spans across approximately 3,744 square miles. This study area was chosen for its relatively limited size and because the land use patterns in the area are geographically typical for the United States.
The goal of the project was to extract, analyze, and compare patterns of urban development from land cover raster data between 2001 and 2016, and then build a machine learning model that would best predict future patterns of urban development.
When analytic framework preferences seemingly diverge
While the project scope acted as the true north star, the team was faced with the reality of having different preferences when it came to working with the data itself. While the team from Esri preferred working with ArcGIS Pro, Professor Lange preferred working with R, a programming language for statistical computing. The different methods posed the risk of having to constantly move large amounts of data between the different software applications, potentially complicating data integrity and hindering efficient collaboration. In the early stages of the project, data was exchanged between ArcGIS Pro and R using comma separated value (CSV) files, but this proved to be a time-consuming process. Also, whenever a change was made in the underlying GIS dataset, several intermediate datasets had to be recreated, stored, and exchanged.
Bringing ArcGIS and R Together
The solution to this challenge was to introduce R-ArcGIS Bridge, an Esri R package that allows the seamless passing of data and analytic results between ArcGIS Pro and R. Leveraging R-ArcGIS Bridge enabled the team to automate analytic processes and maintain changes made to the underlying GIS datasets. It also enabled the team to tap into the powerful spatial data processing and advanced mapping capabilities of ArcGIS Pro while combining the statistical modeling capabilities of R.
Project Workflow and Model Summary
The team prepared and processed the land cover raster datasets in ArcGIS Pro, and then passed the data via R-ArcGIS Bridge to R as an R data frame. The data was then modeled and trained in R using the Random Forest method for predictions. And finally, the predictions were passed back to ArcGIS Pro via R-ArcGIS Bridge to visualize the final results.
The detailed workflow observed is listed in the following steps:
- Data preparation
The data itself was sourced in the form of land cover raster data from the National Land Cover Database (NLCD) from 2001 and 2016. In order to identify areas of urban growth during this period, ArcGIS Pro was used to re-categorize the datasets into urban and non-urban land use types. Furthermore, to prepare explanatory variables for the predictive model, ArcGIS Pro was used to create seven new raster layers – including drive time to the nearest urban center, proximity to freeways, and proximity to environmentally protected areas.
The various spatial datasets then needed to be aggregated into a single attribute table in order to be analyzed in R. First, the ‘Raster to Point’ conversion tool was used to convert the raster data to a point feature class. Then, the ‘Extract Multi Values to Points’ tool available via ArcGIS Spatial Analyst was used to extract values from the explanatory raster datasets and append them as attributes to the point feature class previously created. This resulted in a dataset with 8.9 million records in total. This entire process was automated using ModelBuilder.
2. Passing data from ArcGIS Pro to R
In order to pass the resulting dataset to R for analysis, R-ArcGIS Bridge needed to be set up on a computer which had R, RStudio, and ArcGIS Pro installed. The Bridge was then set up using the R-ArcGIS support option in the Geoprocessing tab, and then RStudio was started and the ArcGIS binding package was loaded along with other packages. The data transfer process started with the function arc.check_product(), which binds the RStudio session to the ArcGIS installation. To pass the feature class attributes to an R data frame, the function arc.open() was run with the name of the feature class and its full path to connect the feature class to an R variable. Then, arc.select() was used to pass the feature class attributes to the R data frame DataAllFromPro.
3. Training and modeling in R
The team decided to use Random Forest, a supervised machine learning algorithm based on multiple decision trees, to predict urban development. The dataset was split into training and test data, with the former consisting of 85% of randomly chosen records and the latter consisting of 15% to validate the model’s performance. With the help of the R-ArcGIS Bridge, the team was able to able to dynamically access the spatial data contained in ArcGIS with R to train the Random Forest model.
The team also decided to balance out the dataset, which showed that 99% of the area between 2001 and 2016 had remain unchanged (i.e., showed no urban development) with only 0.01% showing change. This could pose a threat of the machine learning model choosing to consistently predict no change in urban development due to oversampling. In order to balance the dataset, the Synthetic Minority Oversampling Technique (SMOTE) from the DMwR-package in R was used. SMOTE deletes records from the majority class and then uses a k- Nearest-Neighbors algorithm to artificially create new records for the minority class. The random forest model yielded predictions that showed 89% accuracy of urban development and 95% accuracy of no urban development.
4. Passing data back from R to ArcGIS
Once the predictions were modeled, the resulting dataframe, which included the predicted probabilities for an urban area to be developed, needed to be transferred back to ArcGIS Pro for visualizations. In RStudio, arc.write() was used to write the data frame from R to the original geodatabase in ArcGIS. The resulting table in ArcGIS included columns containing the prediction results from the Random Forest model, including the predicted probability for each record to change from non-urban to urban. The ArcGIS Pro table imported from R was then joined to the original point feature class using the ‘Add Join’ tool, then a raster layer was created based on the joined point feature using the ‘Point to Raster’ tool.
5. Visualizing final results
The advanced cartographic capabilities of ArcGIS Pro made it ideal for mapping and visually evaluating the prediction results. The predicted patterns were as expected, with projected urbanization areas being closely located to existing developed areas and roads. The team also found it interesting to learn that some of the predicted urban areas matched actual urban areas shown in 2020 satellite imagery via basemaps.
Enabling multi-disciplinary analytic collaboration
Overall, this project achieved its objective of predicting urban growth in the study area and allowed for the team to leverage their strengths in both ArcGIS and R. The R-ArcGIS Bridge was a key component in ensuring smooth analytic operations and collaboration by enabling R to dynamically access data from ArcGIS Pro and save R results back to an ArcGIS dataset. This prototype project also shows the possibilities of multi-disciplinary analytics teams working closely together to solve complex problems with spatial data science while playing to their strengths and tapping into preferred analysis frameworks.
For more details on the project, read the full article here.
For more information on the R-ArcGIS Bridge, visit our page.