ArcGIS Pro

End-to-end spatial data science 3: Data preparation and data engineering using Python


In the previous blog article, we used R to process a 30-year daily precipitation dataset (~11,000 rasters) into a collection of 90 CSV files, where each CSV file contains seasonal calculations of four precipitation variables at each location in a 4km by 4km gridded dataset of the US.  For example, the x/y location in the first row of this CSV file experienced precipitation in 14 of the 90 days in the Fall of 2010, totaling about 188 millimeters of precipitation.

Example of the summer/fall 2010 CSV file with precipitation variable calculations.

In this blog article, we’ll use Python* to do some additional data engineering steps to further aggregate the data by calculating the long-term, 30-year averages of each of the four precipitation variables, for each season.  Our final dataset will therefore contain 16 precipitation variables for each location in the 4km by 4km grid.

Python libraries

Just as we did in the previous blog post, our first few lines of code will be for loading the Python libraries we require for our task.

Combining multiple CSVs into one Pandas DataFrame

The first thing we need to do is have a look inside the folder of CSVs we created in the first blog.  We’ll create a variable for the folder location where the CSVs are stored, then use the glob library to search for files in this folder location that match a certain pattern, in this case all files that end with a .csv file extension.  The result is stored as a list with 90 elements representing the 90 seasonal CSV files.

Next, we’ll use the Pandas .read_csv function in a Python list comprehension to read each CSV file into a Pandas DataFrame, then use the pandas .concat function to combine all 90 DataFrames into one large DataFrame.

We can have a look at the dataset dimensions (rows and columns) using the .shape property, then use the .head function to look at the first few rows of the DataFrame.  You should recognize the precipitation variables from the columns in the CSV files, then double check that the total of 57,795,720 rows corresponds to the 30-year seasonal time series at each location (481,631 locations * 120 seasons).

Last, let’s make sure we get a sense of whether we have any missing data.  For this, I’ll use the .isnull function to find missing values in the DataFrame, then the .sum function to add them up for each row.  In this case, the axis parameter indicates whether you are adding horizontally (rows) or vertically (columns).

Here we can see that four of the columns contain about 0.25% missing values, which by most standards is negligible and will not have an impact on the final results.  In our case, we know that these missing values are expected because they are locations that experienced zero precipitation throughout an entire season, and therefore will not have measures of variability (Gini Coefficient) or inequity (Lorenz Asymmetry Coefficient) within a season.

Data engineering: Calculating the 30-year averages of the four seasonal precipitation variables

At this point, we have a 120-season time series of each of the four precipitation variables at each location in this dataset, where the season is indicated by the “season_year” column.  We ultimately want to calculate a single average of each of the precipitation variables for each season, at each location, so the resulting dataset will contain four seasonal averages (winter, spring, summer, fall) of each precipitation variable, at each location.

We’ll first create a new “season” column by stripping off the year from the “season_year” column.  This is achieved using the .str.split function, which splits a text string based on a separator or delimiter (in this case, a space) and then returns the first index position, which here is only the season from the “season_year” column.  We can then use the .unique function on the new column to verify that the four seasons are the only values available.

We now have what we need to calculate one single average for each precipitation variable, for each season, at each location.  We’ll use .groupby to create groupings of data, then .agg to apply an operation to the groupings.  In our case, we’re grouping the data by combination of location and season (e.g. four seasons at each location), then taking the averages of all four precipitation variables within each season.  The .reset_index method at the end ensures that the DataFrame has a numeric index that starts at 0 and increases by 1 for each subsequent row.

We can see here that the DataFrame has been reduced from 57,795,720 rows (4 seasons x 30 years x 481,631 locations) to 1,926,524 rows containing the average of each of the four precipitation variables, for each season, in each location (4 seasonal averages x 481,631 locations).

While this step has drastically reduced the number of rows in the table, it is still a long table, meaning that there is duplicate information in the table rows.  In this case, there are four rows for each location (“coordinates” column), with each of these rows representing the seasonal average of the four precipitation variables.  We’ll next use the .pivot_table function to flip the table from long to wide, such that each row represents an individual location and the columns contain the corresponding season/precipitation variable information.  The index parameter specifies which column becomes the row index of the resulting DataFrame, which in this case is the “coordinates” column.  We specify “season” in the columns parameter, and the values parameter contains the four precipitation variables.  The resulting DataFrame now contains one row for each location and a total of 16 precipitation variables (4 variables x 4 seasons).

You might notice that the resulting DataFrame appears to have two header rows (the precipitation variable, with four seasons under it).  The pivot table step actually results in something called a MultiIndex DataFrame (e.g. hierarchical index), which means that it can have multiple levels.  In this example, the two levels are the precipitation variable and the season.  We had to use one more step to flatten the multiindex, which we achieved by converting the Multindex DataFrame to a NumPy array, then back to a single index DataFrame using the .to_records function.

As you can see, the column names contain parentheses and single quotes, and it is always a best practice to remove such special characters.  We can print all the column names using the .columns attribute, then pass a dict into the .rename function, where each key is the original column name mapped to each value as the new column name.

At this point, we have nearly everything we need to proceed with the next parts of our analysis.  Each row in the DataFrame contains 16 columns representing 30-year seasonal averages of four different precipitation variables (4 seasons x 4 variables), along with a “coordinates” column indicating one of the 481,631 locations in the dataset.  Obviously, this is the column that we’ll use to get this information on the map, but there are a few more simple yet crucial steps we have to do first.

The first step is to create two new columns representing the x- and y-coordinates of each location.  Recall earlier in this blog that we used the .str.split function to split a text string based on a separator or delimiter, and we’ll do the same thing again here to split the values in the “coordinates” column based on the space between them.  This results in two new individual columns, “x_coord” and “y_coord”.

With our x- and y-coordinates now in their own columns, we can use the ArcGIS API for Python to convert the DataFrame into a Spatially Enabled DataFrame (SeDF).  A SeDF is essentially a Pandas DataFrame with an additional SHAPE column representing the geometry of each row.  This means that it can be used non-spatially in traditional Pandas operations, but can also be displayed on a map and used in true spatial operations such as buffers, distance calculations, and more.

Here, we pass the DataFrame into the .from_xy method on the ArcGIS API for Python’s GeoAccessor class, specifying “x_coord” and “y_coord” as the x- and y-columns, respectively, as well as the appropriate coordinate system.

Note the addition of the "SHAPE" field to the DataFrame, and its "geometry" data type.

Last, we’ll use the ArcGIS API for Python’s .to_featureclass method to export the final, cleaned SeDF as a feature class in a geodatabase so we can use it in further analysis in ArcGIS Pro.

Final thoughts

At this point, we now have a feature class that represents a gridded point dataset covering the contiguous United States.  Each of the 481,631 grid points corresponds to the raster cell centroid of the original 4km by 4km PRISM precipitation dataset.

Example of the 4km by 4km gridded dataset overlaid on the average precipitation raster from 1981-2010.

Each of these points also contains 16 columns representing the 30-year seasonal averages of the four precipitation variables calculated in Blog #2 in the series.  Using ArcGIS Pro’s Data Engineering view, we can see and explore the summary statistics and distributions of each variable in the feature class.

ArcGIS Pro Data Engineering view. The 4km by 4km dataset contains 481,631 locations.


Or if you prefer… do it in R!

1. R packages

2. Combine all CSV files into one R data frame

3. Data engineering and wrangling

4. Export data for further analysis in ArcGIS using the R-ArcGIS Bridge

About the author

Nick Giner is a Product Manager for Spatial Analysis and Data Science. Prior to joining Esri in 2014, he completed Bachelor’s and PhD degrees in Geography from Penn State University and Clark University, respectively. In his spare time, he likes to play guitar, golf, cook, cut the grass, and read/watch shows about history.

Inline Feedbacks
View all comments

Next Article

Map in a minute: Map an earthquake using ArcGIS Online and ArcGIS Living Atlas

Read this article