ArcGIS Pro

Data selection and preparation

Do you perform online searches to look for publicly available datasets that contain the information you need for an analysis? Of course. Now, how often do you find a clean dataset with the exact information that you need, nothing more, nothing less, and in a form that is ready to use for your scenario? Probably very rarely. In this blog post, we will look at a process for data search, selection and cleanup.

Let’s say you are an analyst at a marketing firm. Your client is a university that wants to boost its enrollment from public schools in Marion County, Indiana. You are responsible for allocating the resources available to you at your firm towards outreach and promotion efforts on behalf of the university. To begin your analysis, you need the locations of all public high schools in Marion County, IN. A search on ArcGIS Hub for “public schools united states” returns several results, among which is a dataset of all public schools in the United States, shared by the Oak Ridge National Laboratory, which has been updated recently.

The 'Last Updated' date tells us that this data was updated recently.

Click on the title Public Schools to open it.

Click on View Metadata and look for information on terms of use. Under Constraints, you are able to confirm that the dataset is in the public domain, and it would be permissible to use it in your analysis.

Next, on the Data tab, use the filter buttons in the column headers to filter the dataset to only include schools in Marion County, IN that teach students Grades 9 through 12. This will be the preliminary list of high schools your promotion needs to cover. Download the filtered dataset as a shapefile.

Next, you will examine the data. In ArcGIS Pro, add the shapefile to a map, and open its attribute table. Sort the schools in descending order of address. Of the total 58 high schools, notice that some of the schools are located extremely close to others. Use the select tool to select one such cluster on the map. You can tell that it has 3 schools, as 3 records get selected in the attribute table.

The Marion County boundary is also included in the screenshot for reference

One of those schools is named “Area 31 Career & Tech Center” and has an enrollment of 0. Clearly, it is not a high school, and apparently a case of the facility being used as a career and tech center (presumably after school hours). Delete this location from the list.

The other two are “Ben Davis High School” and “Ben Davis Ninth Grade Center”. From the Start Grade and End Grade columns, it is evident that the Ninth Grade Center serves only Ninth Grade students and the High School serves Grades 10 through 12. From the map, it appears that they are part of the same facility or school building.

Since you need school locations to plan direct outreach and marketing, it makes sense to treat this as one single location high school that needs to be covered, rather than two. Merge the Ninth grade center with the high school using these steps:

Continue cleaning up the dataset by following these steps:

After these data cleanup steps are complete, the dataset has 42 high school sites.

It’s rare to find data that’s already perfectly formatted for your needs, but the work you do to prepare and clean a dataset also gives you better understanding for the analysis you’re embarking on. In another blog post, this prepared schools dataset is used to enrich a boundary feature layer for use in a territory design analysis. You can read about the analysis in the Learn ArcGIS Lesson Balance Territories for College Recruiters.

About the author

I am a Product Engineer and Writer at Esri, focused primarily on the Business Analyst applications for web, mobile and desktop.

Connect:

Next Article

Harnessing the Power of Imagery: A Programmatic Approach

Read this article