{"id":2714642,"date":"2025-03-10T06:00:51","date_gmt":"2025-03-10T13:00:51","guid":{"rendered":"https:\/\/www.esri.com\/arcgis-blog\/?post_type=blog&#038;p=2714642"},"modified":"2025-10-06T10:49:57","modified_gmt":"2025-10-06T17:49:57","slug":"spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering","status":"publish","type":"blog","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering","title":{"rendered":"Spatial data science using ArcGIS Notebooks  Blog 3: Data validation and data engineering"},"author":154341,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"open","ping_status":"closed","template":"","format":"standard","meta":{"_acf_changed":false,"_searchwp_excluded":""},"categories":[23341],"tags":[387782,555752,760452,24341,759592],"industry":[],"product":[36841,36561],"class_list":["post-2714642","blog","type-blog","status-publish","format-standard","hentry","category-analytics","tag-arcgis-api-for-python","tag-arcgis-notebooks","tag-data-engineering","tag-python","tag-spatial-data-science","product-api-python","product-arcgis-pro"],"acf":{"authors":[{"ID":154341,"user_firstname":"Nicholas","user_lastname":"Giner","nickname":"Nick Giner","user_nicename":"nginer","display_name":"Nicholas Giner","user_email":"NGiner@esri.com","user_url":"","user_registered":"2021-01-07 14:31:25","user_description":"Nick Giner is a Product Manager for Spatial Analysis and Data Science.  Prior to joining Esri in 2014, he completed Bachelor\u2019s and PhD degrees in Geography from Penn State University and Clark University, respectively. In his spare time, he likes to play guitar, golf, cook, cut the grass, and read\/watch shows about history.","user_avatar":"<img data-del=\"avatar\" src='https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2021\/01\/headshot-e1610030307989-213x200.jpeg' class='avatar pp-user-avatar avatar-96 photo ' height='96' width='96'\/>"},{"ID":358272,"user_firstname":"Halle","user_lastname":"Martinucci","nickname":"Halle Martinucci","user_nicename":"hmartinucci","display_name":"Halle Martinucci","user_email":"hmartinucci@esri.com","user_url":"","user_registered":"2024-06-18 18:43:42","user_description":"Halle is an Associate Product Marketing Manager on Esri\u2019s Spatial Analytics &amp; Data Science team.","user_avatar":"<img data-del=\"avatar\" src='https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/01\/halle-martinucci-3z7a7050-213x200.jpg' class='avatar pp-user-avatar avatar-96 photo ' height='96' width='96'\/>"}],"short_description":"Use an ArcGIS Pro Notebook for a complete spatial data science workflow to map gentrification in US cities.","flexible_content":[{"acf_fc_layout":"content","content":"<h2>Introduction<\/h2>\n<p>After all the data integration and wrangling steps in the <strong><a href=\"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-2-data-fusion-wrangling-and-preparation\">previous blog<\/a><\/strong>, it\u2019s time to have a good hard look at the data and determine whether it is going to serve the purpose I need it for.\u00a0 I want to find out:<\/p>\n<ul>\n<li>Are all of my columns of the expected data type?<\/li>\n<li>Do any of my columns contain missing values?<\/li>\n<li>Are the columns named consistently?<\/li>\n<li>Are the columns named in a way that makes sense to me or others who will use the dataset?<\/li>\n<li>Are there any unnecessary or non-useful columns?<\/li>\n<li>Do any of the columns appear to have unusually high or low values for basic descriptive statistics such as minimum, average, or maximum?<\/li>\n<li>Do similar-themed columns from two different datasets have identical or very similar values (e.g. population for a given year from the US census vs. from the American Community Survey)?<\/li>\n<\/ul>\n"},{"acf_fc_layout":"content","content":"<h2>Data validation and data engineering steps<\/h2>\n<p><em><strong>Data validation<\/strong><\/em><\/p>\n<p>Many of these types of questions can be answered using just a few simple Pandas functions.\u00a0 For example, the <strong><a href=\"https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.DataFrame.info.html\"><em>.info<\/em><\/a><\/strong> method prints out every column in a DataFrame, along with a count of the column\u2019s non-null values and data type.\u00a0 This is incredibly useful for determining which columns contain missing values.<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2714672,"id":2714672,"title":"info1","filename":"info1.jpg","filesize":68300,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/info1.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/info1","alt":"","author":"154341","description":"","caption":"","name":"info1","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 15:25:21","modified":"2025-03-04 15:25:21","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":598,"height":440,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/info1-213x200.jpg","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/info1.jpg","medium-width":355,"medium-height":261,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/info1.jpg","medium_large-width":598,"medium_large-height":440,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/info1.jpg","large-width":598,"large-height":440,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/info1.jpg","1536x1536-width":598,"1536x1536-height":440,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/info1.jpg","2048x2048-width":598,"2048x2048-height":440,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/info1.jpg","card_image-width":598,"card_image-height":440,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/info1.jpg","wide_image-width":598,"wide_image-height":440}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"image","image":{"ID":2714682,"id":2714682,"title":"info2","filename":"info2.jpg","filesize":69993,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/info2.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/info2","alt":"","author":"154341","description":"","caption":"The null values in the two highlighted columns will need to be addressed later in the analysis.","name":"info2","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 15:25:38","modified":"2025-03-04 15:26:23","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":571,"height":343,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/info2-213x200.jpg","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/info2.jpg","medium-width":434,"medium-height":261,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/info2.jpg","medium_large-width":571,"medium_large-height":343,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/info2.jpg","large-width":571,"large-height":343,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/info2.jpg","1536x1536-width":571,"1536x1536-height":343,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/info2.jpg","2048x2048-width":571,"2048x2048-height":343,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/info2.jpg","card_image-width":571,"card_image-height":343,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/info2.jpg","wide_image-width":571,"wide_image-height":343}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>Another small, yet important data cleaning task was converting all column names from capital letters to lowercase.\u00a0 Here, I used the <strong><a href=\"https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.Series.str.lower.html\"><em>str.lower<\/em><\/a><\/strong> function within a list comprehension to convert all column names in the DataFrame from uppercase to lowercase.<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2714702,"id":2714702,"title":"lowercase","filename":"lowercase.jpg","filesize":88482,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/lowercase.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/lowercase","alt":"","author":"154341","description":"","caption":"","name":"lowercase","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 15:27:28","modified":"2025-03-04 15:27:28","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":617,"height":488,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/lowercase-213x200.jpg","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/lowercase.jpg","medium-width":330,"medium-height":261,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/lowercase.jpg","medium_large-width":617,"medium_large-height":488,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/lowercase.jpg","large-width":617,"large-height":488,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/lowercase.jpg","1536x1536-width":617,"1536x1536-height":488,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/lowercase.jpg","2048x2048-width":617,"2048x2048-height":488,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/lowercase-588x465.jpg","card_image-width":588,"card_image-height":465,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/lowercase.jpg","wide_image-width":617,"wide_image-height":488}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>There was one column, however, that did need to be converted back to all capital letters.\u00a0 The \u201cshape\u201d column is actually a <em>geometry<\/em> type, meaning that it contains a spatial geometry.\u00a0 For it to be understood as such by the SeDF, it needs to be named \u201cSHAPE\u201d.\u00a0 Here, we use the <strong><a href=\"https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.DataFrame.rename.html\"><em>.rename<\/em><\/a><\/strong> function to convert this column to all uppercase.<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2714712,"id":2714712,"title":"rename","filename":"rename.jpg","filesize":172996,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/rename.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/rename-4","alt":"","author":"154341","description":"","caption":"","name":"rename-4","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 15:29:57","modified":"2025-03-04 15:29:57","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":929,"height":583,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/rename-213x200.jpg","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/rename.jpg","medium-width":416,"medium-height":261,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/rename.jpg","medium_large-width":768,"medium_large-height":482,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/rename.jpg","large-width":929,"large-height":583,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/rename.jpg","1536x1536-width":929,"1536x1536-height":583,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/rename.jpg","2048x2048-width":929,"2048x2048-height":583,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/rename-741x465.jpg","card_image-width":741,"card_image-height":465,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/rename.jpg","wide_image-width":929,"wide_image-height":583}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>Next, I wanted to learn a bit more about the distribution of values with each column. \u00a0The <strong><a href=\"https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.DataFrame.describe.html\"><em>.describe<\/em><\/a><\/strong> method tells you which columns are missing values, and includes summary statistics for each column representing measures of centrality and dispersion.\u00a0 When I have many columns to look at, I often add the <strong><a href=\"https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.DataFrame.transpose.html\"><em>.transpose<\/em><\/a><\/strong> method to flip the summary statistics such that the columns are listed as rows.\u00a0 Here is an example of the descriptive summary statistics for several of the education-related variables.<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2714752,"id":2714752,"title":"describe_transpose","filename":"describe_transpose.jpg","filesize":98567,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/describe_transpose.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/describe_transpose","alt":"","author":"154341","description":"","caption":"The two highlighted columns represent the population 25 years or older with a 4-year college degree in 2000 (b69ac2000) and 2022 (educationalattainment_acsbachdeg), respectively.","name":"describe_transpose","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 15:45:25","modified":"2025-03-04 15:46:29","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":890,"height":378,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/describe_transpose-213x200.jpg","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/describe_transpose.jpg","medium-width":464,"medium-height":197,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/describe_transpose.jpg","medium_large-width":768,"medium_large-height":326,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/describe_transpose.jpg","large-width":890,"large-height":378,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/describe_transpose.jpg","1536x1536-width":890,"1536x1536-height":378,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/describe_transpose.jpg","2048x2048-width":890,"2048x2048-height":378,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/describe_transpose-826x351.jpg","card_image-width":826,"card_image-height":351,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/describe_transpose.jpg","wide_image-width":890,"wide_image-height":378}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>After a bit of manual inspection, I determined that these missing values occurred for two main reasons:<\/p>\n<ol>\n<li>Some large census tracts in 2000 were subdivided into new, smaller census tracts in 2010<\/li>\n<li>Some small census tracts in 2000 were aggregated into new, larger census tracts in 2010<\/li>\n<\/ol>\n<p>&nbsp;<\/p>\n<p>Bottom line: not all census tracts persist between two decennial censuses, and boundaries change over time.\u00a0 Fortunately, the NHGIS has developed a methodology to crosswalk census data over time (e.g. 2000 to 2020) by standardizing data to the 2010 census boundaries.\u00a0 This allows for comparison of one year\u2019s census data for a different year\u2019s geographic boundaries, enabling time series or \u201clongitudinal\u201d analysis. More information about the methodology for standardizing the geographic boundaries can be found <strong><a href=\"https:\/\/www.nhgis.org\/geographic-crosswalks\">here<\/a><\/strong>.<\/p>\n<p><strong><em>Dealing with missing values (Imputation)<\/em><\/strong><\/p>\n<p>Truth be told, there are <a href=\"https:\/\/pro.arcgis.com\/en\/pro-app\/latest\/tool-reference\/space-time-pattern-mining\/learnmorefillmissingvalues.htm\"><strong>many strategies available<\/strong><\/a> for dealing with missing values in data.\u00a0 Depending on how many missing values are in a column, you might decide to drop the entire column or delete only the rows missing the values from the dataset.\u00a0 By choosing either of these options, however, you run the risk of removing valuable information from your analysis.<\/p>\n<p>If you choose not to remove data and want to replace it instead, you can fill missing values with global statistics of the column of interest, such as the mean\/median for numeric columns and mode for categorical columns.\u00a0 If you have time series data, you have an entirely different set of options for replacing missing values along the timeline, which relies on recent\/future values or even temporal trends to fill in the gaps.<\/p>\n<p>Because my three columns of interest that are missing data are numeric, I will first use each column\u2019s global statistics (mean and median) to replace its missing values.\u00a0 This is achieved using the pandas <strong><a href=\"https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.DataFrame.fillna.html\"><em>.fillna<\/em><\/a><\/strong>, <strong><a href=\"https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.DataFrame.mean.html\"><em>.mean<\/em><\/a><\/strong>, and <strong><a href=\"https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.DataFrame.median.html\"><em>.median<\/em><\/a><\/strong> methods, respectively.<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2715062,"id":2715062,"title":"mean_fill","filename":"mean_fill-2.jpg","filesize":145519,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/mean_fill-2.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/mean_fill-3","alt":"","author":"154341","description":"","caption":"The highlighted code snippets show the three columns of interest that we filled with the mean value of the column.  ","name":"mean_fill-3","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 16:52:42","modified":"2025-03-04 16:52:48","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":755,"height":686,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/mean_fill-2-213x200.jpg","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/mean_fill-2.jpg","medium-width":287,"medium-height":261,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/mean_fill-2.jpg","medium_large-width":755,"medium_large-height":686,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/mean_fill-2.jpg","large-width":755,"large-height":686,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/mean_fill-2.jpg","1536x1536-width":755,"1536x1536-height":686,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/mean_fill-2.jpg","2048x2048-width":755,"2048x2048-height":686,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/mean_fill-2-512x465.jpg","card_image-width":512,"card_image-height":465,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/mean_fill-2.jpg","wide_image-width":755,"wide_image-height":686}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"image","image":{"ID":2715072,"id":2715072,"title":"median_fill","filename":"median_fill-2.jpg","filesize":154192,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/median_fill-2.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/median_fill-3","alt":"","author":"154341","description":"","caption":"The highlighted code snippets show the three columns of interest that we filled with the median value of the column.  ","name":"median_fill-3","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 16:53:04","modified":"2025-03-04 16:53:10","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":785,"height":683,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/median_fill-2-213x200.jpg","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/median_fill-2.jpg","medium-width":300,"medium-height":261,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/median_fill-2.jpg","medium_large-width":768,"medium_large-height":668,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/median_fill-2.jpg","large-width":785,"large-height":683,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/median_fill-2.jpg","1536x1536-width":785,"1536x1536-height":683,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/median_fill-2.jpg","2048x2048-width":785,"2048x2048-height":683,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/median_fill-2-534x465.jpg","card_image-width":534,"card_image-height":465,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/median_fill-2.jpg","wide_image-width":785,"wide_image-height":683}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>We can also have a look at the summary statistics for the three columns of interest and compare them to the summary statistics for their imputed versions (mean and median).<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2715092,"id":2715092,"title":"filled_stats","filename":"filled_stats-1.jpg","filesize":141891,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/filled_stats-1.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/filled_stats-2","alt":"","author":"154341","description":"","caption":"","name":"filled_stats-2","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 16:54:51","modified":"2025-03-04 16:54:51","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":949,"height":498,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/filled_stats-1-213x200.jpg","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/filled_stats-1.jpg","medium-width":464,"medium-height":243,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/filled_stats-1.jpg","medium_large-width":768,"medium_large-height":403,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/filled_stats-1.jpg","large-width":949,"large-height":498,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/filled_stats-1.jpg","1536x1536-width":949,"1536x1536-height":498,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/filled_stats-1.jpg","2048x2048-width":949,"2048x2048-height":498,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/filled_stats-1-826x433.jpg","card_image-width":826,"card_image-height":433,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/filled_stats-1.jpg","wide_image-width":949,"wide_image-height":498}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>In the example above, we filled missing values in three of our data columns with <em>global statistics<\/em>, e.g. the mean and median of all the values in each column.\u00a0 Let\u2019s not forget that we are working with spatial data, however, so it may also be worth looking into filling our data columns with <em>local statistics<\/em>, e.g. the mean and median of <u>only the features that are spatial neighbors of the missing values<\/u>.\u00a0 To do this, we\u2019ll use the <a href=\"https:\/\/pro.arcgis.com\/en\/pro-app\/latest\/tool-reference\/space-time-pattern-mining\/fillmissingvalues.htm\"><strong>Fill Missing Values<\/strong><\/a> tool from the <a href=\"https:\/\/pro.arcgis.com\/en\/pro-app\/latest\/tool-reference\/space-time-pattern-mining\/an-overview-of-the-space-time-pattern-mining-toolbox.htm\"><strong>Space Time Pattern Mining toolbox<\/strong><\/a>.<\/p>\n<p>It is important to remember that when imputing or filling missing values in data, you are essentially estimating data that <em>you don\u2019t actually have<\/em>.\u00a0 As such, it is vitally important to think carefully about how many values you are filling, what imputation method you choose, how the filled results will be used in further analysis, and how you will communicate the fact that you chose to impute data in the first place.\u00a0 There are a number of best practices and things to consider when filling missing values, which you can read more about <a href=\"https:\/\/pro.arcgis.com\/en\/pro-app\/latest\/tool-reference\/space-time-pattern-mining\/learnmorefillmissingvalues.htm\"><strong>here<\/strong><\/a>.<\/p>\n<p>To use the <strong>Fill Missing Values<\/strong> geoprocessing tool, we need to convert our Spatially Enabled DataFrame to a feature class within the geodatabase using the ArcGIS API for Python\u2019s <strong><a href=\"https:\/\/developers.arcgis.com\/python\/guide\/part3-data-io-writing-data\/#write-to-a-local-file\"><em>.to_featureclass<\/em><\/a><\/strong> method.<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2715102,"id":2715102,"title":"to_featureclass","filename":"to_featureclass-1.jpg","filesize":28349,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/to_featureclass-1.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/to_featureclass-2","alt":"","author":"154341","description":"","caption":"","name":"to_featureclass-2","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 16:55:23","modified":"2025-03-04 16:55:23","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":721,"height":90,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/to_featureclass-1-213x90.jpg","thumbnail-width":213,"thumbnail-height":90,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/to_featureclass-1.jpg","medium-width":464,"medium-height":58,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/to_featureclass-1.jpg","medium_large-width":721,"medium_large-height":90,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/to_featureclass-1.jpg","large-width":721,"large-height":90,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/to_featureclass-1.jpg","1536x1536-width":721,"1536x1536-height":90,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/to_featureclass-1.jpg","2048x2048-width":721,"2048x2048-height":90,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/to_featureclass-1.jpg","card_image-width":721,"card_image-height":90,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/to_featureclass-1.jpg","wide_image-width":721,"wide_image-height":90}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>We\u2019ll then run the <strong>Fill Missing Values<\/strong> tool twice\u2014once to impute the local mean and once to impute the local median.\u00a0 Here is a bit more information on some of the important tool parameters:<\/p>\n<ul>\n<li><em>fields_to_fill \u2013 <\/em>the columns in your dataset containing missing values. In my case, it is three columns: \u201cb69ac2000\u201d (year 2000, age &gt; 24, college educated), \u201cab2aa2000\u201d (year 2000, median family income), \u201chp3001\u201d (year 2000, median gross rent).<\/li>\n<\/ul>\n<ul>\n<li><em>fill_method \u2013 <\/em>statistic that is applied to missing values. For spatial data, you can impute missing values based on spatial neighbors, time series values, or space-time neighbors (e.g. <em>local statistics<\/em>).\u00a0 For standalone tables, you can impute with <em>global statistics<\/em>, as I did in the previous section.\u00a0 In this workflow, I\u2019ve chosen to fill missing values with the local mean and local median, for comparison purposes.<\/li>\n<\/ul>\n<ul>\n<li><em>conceptualization_of_spatial_relationships \u2013<\/em> how you define which features become the spatial neighbors (e.g. the local neighborhood). These can be based on distance, contiguity\/adjacency, or manually defined via a spatial weights matrix.\u00a0 You can read more details about each of these different conceptualizations along with best practices <a href=\"https:\/\/pro.arcgis.com\/en\/pro-app\/latest\/tool-reference\/spatial-statistics\/modeling-spatial-relationships.htm\"><strong>here<\/strong><\/a>.\u00a0\u00a0 In this example, I am using <strong>Contiguity edges corners<\/strong> (e.g. Queen\u2019s case), which means that each missing polygon value will be filled with the mean or median of the neighboring polygons that it shares an edge or corner with.<\/li>\n<\/ul>\n<ul>\n<li><em>distance_band &#8211; <\/em>the cutoff distance for the Conceptualization of Spatial Relationships parameter&#8217;s <strong>Fixed distance.<\/strong> Not applicable in my case because I am using <strong>Contiguity edges corners.<\/strong><\/li>\n<\/ul>\n<ul>\n<li><em>number_of_spatial_neighbors \u2013 <\/em>the number of nearest neighbors included in the local calculations. For <strong>Fixed distance<\/strong>, <strong>Contiguity edges only<\/strong>, or <strong>Contiguity edges corners<\/strong>, this number is the minimum number of neighbors.<\/li>\n<\/ul>\n"},{"acf_fc_layout":"image","image":{"ID":2715122,"id":2715122,"title":"local_mean_fill","filename":"local_mean_fill-1.jpg","filesize":79531,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/local_mean_fill-1.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/local_mean_fill-2","alt":"","author":"154341","description":"","caption":"","name":"local_mean_fill-2","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 16:59:09","modified":"2025-03-04 16:59:09","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":696,"height":480,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/local_mean_fill-1-213x200.jpg","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/local_mean_fill-1.jpg","medium-width":378,"medium-height":261,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/local_mean_fill-1.jpg","medium_large-width":696,"medium_large-height":480,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/local_mean_fill-1.jpg","large-width":696,"large-height":480,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/local_mean_fill-1.jpg","1536x1536-width":696,"1536x1536-height":480,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/local_mean_fill-1.jpg","2048x2048-width":696,"2048x2048-height":480,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/local_mean_fill-1-674x465.jpg","card_image-width":674,"card_image-height":465,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/local_mean_fill-1.jpg","wide_image-width":696,"wide_image-height":480}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"image","image":{"ID":2715132,"id":2715132,"title":"local_median_fill","filename":"local_median_fill-1.jpg","filesize":81365,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/local_median_fill-1.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/local_median_fill-2","alt":"","author":"154341","description":"","caption":"Python syntax for the Fill Missing Values tool using local mean (top) and local median).","name":"local_median_fill-2","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 16:59:26","modified":"2025-03-04 16:59:31","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":696,"height":477,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/local_median_fill-1-213x200.jpg","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/local_median_fill-1.jpg","medium-width":381,"medium-height":261,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/local_median_fill-1.jpg","medium_large-width":696,"medium_large-height":477,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/local_median_fill-1.jpg","large-width":696,"large-height":477,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/local_median_fill-1.jpg","1536x1536-width":696,"1536x1536-height":477,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/local_median_fill-1.jpg","2048x2048-width":696,"2048x2048-height":477,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/local_median_fill-1-678x465.jpg","card_image-width":678,"card_image-height":465,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/local_median_fill-1.jpg","wide_image-width":696,"wide_image-height":477}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"sidebar","content":"<p><strong>Note:<\/strong>\u00a0 At ArcGIS Pro 3.2, we added the <a href=\"https:\/\/pro.arcgis.com\/en\/pro-app\/latest\/tool-reference\/spatial-statistics\/introduction-to-neighborhood-explorer.htm\"><strong>Neighborhood Explorer<\/strong><\/a>, which is a new analysis experience that helps you explore and understand spatial relationships in your data.\u00a0 You can use it to interactively visualize neighborhoods in your data, which can help you when choosing your conceptualization of spatial relationships in many tools, including <strong>Fill Missing Values<\/strong>.\u00a0 Check out the <a href=\"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/introducing-neighborhood-explorer-in-arcgis-pro\/\"><strong>blog<\/strong><\/a> and <a href=\"https:\/\/pro.arcgis.com\/en\/pro-app\/latest\/tool-reference\/spatial-statistics\/introduction-to-neighborhood-explorer.htm\"><strong>help documentation<\/strong><\/a> for more information!<\/p>\n","image_reference":false,"layout":"standard","image_reference_figure":"","snippet":"","spotlight_name":"","section_title":"","position":"Center","spotlight_image":false},{"acf_fc_layout":"content","content":"<p>Now that we have filled missing values in three variables with both their global and local mean and median, we will do a quick assessment of the results.<\/p>\n<p>First, I\u2019ll convert my feature class back to a Spatially Enabled DataFrame using the ArcGIS API for Python\u2019s\u00a0 <a href=\"https:\/\/developers.arcgis.com\/python\/guide\/part2-data-io-reading-data\/#read-in-local-gis-data\"><strong><em>.from_featureclass<\/em><\/strong><\/a> method.\u00a0 This allows me to take advantage of all the data manipulation functionality of Pandas.<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2715152,"id":2715152,"title":"sedf_filled","filename":"sedf_filled-1.jpg","filesize":112780,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/sedf_filled-1.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/sedf_filled-2","alt":"","author":"154341","description":"","caption":"","name":"sedf_filled-2","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 17:00:31","modified":"2025-03-04 17:00:31","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":940,"height":690,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/sedf_filled-1-213x200.jpg","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/sedf_filled-1.jpg","medium-width":356,"medium-height":261,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/sedf_filled-1.jpg","medium_large-width":768,"medium_large-height":564,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/sedf_filled-1.jpg","large-width":940,"large-height":690,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/sedf_filled-1.jpg","1536x1536-width":940,"1536x1536-height":690,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/sedf_filled-1.jpg","2048x2048-width":940,"2048x2048-height":690,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/sedf_filled-1-633x465.jpg","card_image-width":633,"card_image-height":465,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/sedf_filled-1.jpg","wide_image-width":940,"wide_image-height":690}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>Next, a best practice is to examine and compare the data distribution and summary statistics of the variables before and after filling the missing values.\u00a0 I\u2019ll use the <strong><a href=\"https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.DataFrame.describe.html\"><em>.describe<\/em><\/a><\/strong> method to generate summary statistics of the three variables before and after imputation.\u00a0 Here, I\u2019ve also added the <strong><a href=\"https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.DataFrame.transpose.html\"><em>.transpose<\/em><\/a><\/strong> method to flip the summary statistics such that the columns are listed as rows.<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2715232,"id":2715232,"title":"sedf_stats","filename":"sedf_stats-1.jpg","filesize":200683,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/sedf_stats-1.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/sedf_stats-2","alt":"","author":"154341","description":"","caption":"Summary statistics for the three variables (b69ac2000, ab2aa2000, hp3001) both before and after imputation with both global and local statistics.  The highlighted rows represent the original, unfilled variables and summary statistics.","name":"sedf_stats-2","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 17:20:38","modified":"2025-03-04 17:20:45","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":958,"height":724,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/sedf_stats-1-213x200.jpg","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/sedf_stats-1.jpg","medium-width":345,"medium-height":261,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/sedf_stats-1.jpg","medium_large-width":768,"medium_large-height":580,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/sedf_stats-1.jpg","large-width":958,"large-height":724,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/sedf_stats-1.jpg","1536x1536-width":958,"1536x1536-height":724,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/sedf_stats-1.jpg","2048x2048-width":958,"2048x2048-height":724,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/sedf_stats-1-615x465.jpg","card_image-width":615,"card_image-height":465,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/sedf_stats-1.jpg","wide_image-width":958,"wide_image-height":724}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>I can get this same information directly within ArcGIS Pro via the <a href=\"https:\/\/pro.arcgis.com\/en\/pro-app\/latest\/help\/analysis\/geoprocessing\/data-engineering\/what-is-data-engineering.htm\"><strong>Data Engineering<\/strong><\/a> view, as well as in the tool messages of the <strong>Fill Missing Values<\/strong> tool.\u00a0 The other great thing about the Data Engineering view is that it gives you a chart preview based on the field data type.\u00a0 The example below shows the summary statistics and histograms for the median family income variable (ab2aa2000) before and after imputation.<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2715242,"id":2715242,"title":"DE_histos","filename":"DE_histos-1.jpg","filesize":72322,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/DE_histos-1.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/de_histos-2","alt":"","author":"154341","description":"","caption":"","name":"de_histos-2","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 17:21:11","modified":"2025-03-04 17:21:11","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":1186,"height":201,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/DE_histos-1-213x200.jpg","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/DE_histos-1.jpg","medium-width":464,"medium-height":79,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/DE_histos-1.jpg","medium_large-width":768,"medium_large-height":130,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/DE_histos-1.jpg","large-width":1186,"large-height":201,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/DE_histos-1.jpg","1536x1536-width":1186,"1536x1536-height":201,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/DE_histos-1.jpg","2048x2048-width":1186,"2048x2048-height":201,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/DE_histos-1-826x140.jpg","card_image-width":826,"card_image-height":140,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/DE_histos-1.jpg","wide_image-width":1186,"wide_image-height":201}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>In addition to exploring the summary statistics prior to and after imputation, I also wanted to explore the data visually on the map.<\/p>\n<p>The first graphic below shows the median family income variable for 2000 (ab2aa2000) displayed on the map and visualized as a histogram.\u00a0 The pink census tracts contain missing values for this variable.<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2715372,"id":2715372,"title":"income_missing_map","filename":"income_missing_map.jpg","filesize":294535,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/income_missing_map.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/income_missing_map","alt":"","author":"154341","description":"","caption":"","name":"income_missing_map","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 18:23:37","modified":"2025-03-04 18:23:37","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":1847,"height":707,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/income_missing_map-213x200.jpg","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/income_missing_map.jpg","medium-width":464,"medium-height":178,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/income_missing_map.jpg","medium_large-width":768,"medium_large-height":294,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/income_missing_map.jpg","large-width":1847,"large-height":707,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/income_missing_map-1536x588.jpg","1536x1536-width":1536,"1536x1536-height":588,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/income_missing_map.jpg","2048x2048-width":1847,"2048x2048-height":707,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/income_missing_map-826x316.jpg","card_image-width":826,"card_image-height":316,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/income_missing_map.jpg","wide_image-width":1847,"wide_image-height":707}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>The following two graphics show the same variable under the four different imputation scenarios:\u00a0 Imputation with the global mean, global median, local mean, and local median.\u00a0 The dark black outlined census tracts contained missing values and have now been filled.\u00a0 The histograms show the distribution of the now-filled dataset, with the black highlighted bars corresponding to the black outlined polygons on the map.<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2715382,"id":2715382,"title":"global_filled_maps","filename":"global_filled_maps.jpg","filesize":491260,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/global_filled_maps.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/global_filled_maps","alt":"","author":"154341","description":"","caption":"","name":"global_filled_maps","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 18:26:59","modified":"2025-03-04 18:26:59","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":1633,"height":1029,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/global_filled_maps-213x200.jpg","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/global_filled_maps.jpg","medium-width":414,"medium-height":261,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/global_filled_maps.jpg","medium_large-width":768,"medium_large-height":484,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/global_filled_maps.jpg","large-width":1633,"large-height":1029,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/global_filled_maps-1536x968.jpg","1536x1536-width":1536,"1536x1536-height":968,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/global_filled_maps.jpg","2048x2048-width":1633,"2048x2048-height":1029,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/global_filled_maps-738x465.jpg","card_image-width":738,"card_image-height":465,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/global_filled_maps.jpg","wide_image-width":1633,"wide_image-height":1029}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"image","image":{"ID":2715392,"id":2715392,"title":"local_filled_maps","filename":"local_filled_maps.jpg","filesize":485265,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/local_filled_maps.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/local_filled_maps","alt":"","author":"154341","description":"","caption":"","name":"local_filled_maps","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 18:27:24","modified":"2025-03-04 18:27:24","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":1633,"height":1031,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/local_filled_maps-213x200.jpg","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/local_filled_maps.jpg","medium-width":413,"medium-height":261,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/local_filled_maps.jpg","medium_large-width":768,"medium_large-height":485,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/local_filled_maps.jpg","large-width":1633,"large-height":1031,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/local_filled_maps-1536x970.jpg","1536x1536-width":1536,"1536x1536-height":970,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/local_filled_maps.jpg","2048x2048-width":1633,"2048x2048-height":1031,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/local_filled_maps-737x465.jpg","card_image-width":737,"card_image-height":465,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/local_filled_maps.jpg","wide_image-width":1633,"wide_image-height":1031}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>As mentioned earlier, there is no hard and fast rule for which method to choose when filling missing values, but there are a number of <a href=\"https:\/\/pro.arcgis.com\/en\/pro-app\/latest\/tool-reference\/space-time-pattern-mining\/learnmorefillmissingvalues.htm#ESRI_SECTION1_3B1EA53065C845529A4EFCABD683F783\"><strong>best practices<\/strong><\/a> to consider.\u00a0 After mapping my variable of interest before and after imputation, I\u2019m confident that there is no distinct spatial pattern in the location of the census tracts with missing values, or in the median family income at those locations (e.g. both are relatively random).<\/p>\n<p>Looking at the maps, summary statistics, and the histograms, it appears that the local imputation strategies produced the most similar descriptive statistics and data distributions to the original dataset.\u00a0 For example, filling missing values with the global mean or median noticeably pulls many census tracts into the $40,000-$50,000 median income range, while the local imputation methods have a much more similar distribution in each histogram bin to the original data.<\/p>\n<p>Based on all the above reasons, and my knowledge of the dataset and problem, I decided to proceed with using the local mean imputation strategy.<\/p>\n"},{"acf_fc_layout":"sidebar","content":"<p><strong>Note:<\/strong> For the median family income variable in 2000, there are three census tracts that still contain missing values after filling with the local mean.\u00a0 After a quick examination, I recognized that these three census tracts had no median family income value in 2000.\u00a0 These will be removed from the analysis in a subsequent step.<\/p>\n","image_reference":false,"layout":"standard","image_reference_figure":"","snippet":"","spotlight_name":"","section_title":"","position":"Center","spotlight_image":false},{"acf_fc_layout":"content","content":"<p><strong><em>Calculate the five gentrification indicators for the years 2000 and 2020<\/em><\/strong><\/p>\n<p>The next step in the data engineering process is to create the five gentrification indicators for the years 2000 and 2020.\u00a0 The socio-demographic composition, education level, and age indicators were created by normalizing the raw count of each individual variable by the total count of that variable for each year.\u00a0 The median family income and median gross rent variables were used as is.<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2715432,"id":2715432,"title":"variable_calculations","filename":"variable_calculations.jpg","filesize":153139,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/variable_calculations.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/variable_calculations","alt":"","author":"154341","description":"","caption":"Example calculation of the college-educated indicator.  For each year (2000 and 2020), we divide the population that is older than 24 years with at least a 4-year college degree in a census tract by the total population in that census tract.  Note: the \u201ccollege-educated\u201d variable was one of the three variables missing data in 2000.  This is why we are using the imputed version in the calculations.","name":"variable_calculations","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 18:37:53","modified":"2025-03-04 18:38:16","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":1144,"height":680,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/variable_calculations-213x200.jpg","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/variable_calculations.jpg","medium-width":439,"medium-height":261,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/variable_calculations.jpg","medium_large-width":768,"medium_large-height":457,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/variable_calculations.jpg","large-width":1144,"large-height":680,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/variable_calculations.jpg","1536x1536-width":1144,"1536x1536-height":680,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/variable_calculations.jpg","2048x2048-width":1144,"2048x2048-height":680,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/variable_calculations-782x465.jpg","card_image-width":782,"card_image-height":465,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/variable_calculations.jpg","wide_image-width":1144,"wide_image-height":680}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p><strong><em>Calculate relative change in the five gentrification indicators between 2000 and 2020<\/em><\/strong><\/p>\n<p>Gentrification is a process that causes change in a location over time, so the final data engineering step using Pandas in the ArcGIS Notebook was to calculate relative change between each of the five gentrification indicators between 2000 and 2020.\u00a0 For this, we used the simple relative change calculation of:<\/p>\n<pre>(Value at time 2 \/ Value at time 1) \u2013 1<\/pre>\n"},{"acf_fc_layout":"image","image":{"ID":2715442,"id":2715442,"title":"variable_calculations_v2","filename":"variable_calculations_v2.jpg","filesize":179726,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/variable_calculations_v2.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/variable_calculations_v2","alt":"","author":"154341","description":"","caption":"Example calculation of change in college-educated population between 2000 and 2020, and the resulting columns.","name":"variable_calculations_v2","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 18:40:29","modified":"2025-03-04 18:41:13","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":1168,"height":690,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/variable_calculations_v2-213x200.jpg","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/variable_calculations_v2.jpg","medium-width":442,"medium-height":261,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/variable_calculations_v2.jpg","medium_large-width":768,"medium_large-height":454,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/variable_calculations_v2.jpg","large-width":1168,"large-height":690,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/variable_calculations_v2.jpg","1536x1536-width":1168,"1536x1536-height":690,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/variable_calculations_v2.jpg","2048x2048-width":1168,"2048x2048-height":690,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/variable_calculations_v2-787x465.jpg","card_image-width":787,"card_image-height":465,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/variable_calculations_v2.jpg","wide_image-width":1168,"wide_image-height":690}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p><strong><em>Data validation checks<\/em><\/strong><\/p>\n<p>You may have noticed that there are \u201cNaN\u201d and \u201cinf\u201d values in several of the new columns that represent the relative change in the gentrification indicators between 2000 and 2020.\u00a0\u00a0 \u201cNaN\u201d values stand for \u201cNot a Number\u201d, and often represent undefined or missing values.\u00a0 In our case, NaN values for the gentrification change columns result from the presence of NaN (missing) values in either or both of the 2000 or 2020 columns for the given gentrification indicator.<\/p>\n<p>\u201cInf\u201d values stand for \u201cpositive infinity\u201d.\u00a0 In our dataset, we get inf values in the gentrification change columns when we have data in a census tract in the year 2020, but a value of zero in that census tract in 2000.\u00a0 In other words, the &#8220;inf&#8221; values are the result of a \u201cdivide by zero\u201d error.<\/p>\n<p>We can use the Pandas <strong>.<a href=\"https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.isnull.html\"><em>isnull<\/em><\/a><\/strong> function to identify the missing values in each gentrification change column, and then use the <strong><a href=\"https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.DataFrame.isin.html\"><em>.isin<\/em><\/a><\/strong> function to find every row in a column that contains a certain value.\u00a0 In our case, we\u2019ll pass in the <a href=\"https:\/\/numpy.org\/devdocs\/reference\/constants.html#numpy.inf\"><em><strong>np.inf<\/strong> <\/em>and <em><strong>-np.inf<\/strong><\/em><\/a> constants to find positive and negative infinity values, respectively.<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2715482,"id":2715482,"title":"nulls_infs","filename":"nulls_infs.jpg","filesize":144472,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/nulls_infs.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/nulls_infs","alt":"","author":"154341","description":"","caption":"Locating null values (red boxes) and infinity values (blue boxes) in one of the gentrification change columns.  Note that \"N\/A\" values occur when you divide a value in 2020 by \"N\/A\" and \"inf\" values occur when you divide a value in 2020 by zero.","name":"nulls_infs","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 18:54:54","modified":"2025-03-07 14:54:43","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":1154,"height":638,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/nulls_infs-213x200.jpg","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/nulls_infs.jpg","medium-width":464,"medium-height":257,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/nulls_infs.jpg","medium_large-width":768,"medium_large-height":425,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/nulls_infs.jpg","large-width":1154,"large-height":638,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/nulls_infs.jpg","1536x1536-width":1154,"1536x1536-height":638,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/nulls_infs.jpg","2048x2048-width":1154,"2048x2048-height":638,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/nulls_infs-826x457.jpg","card_image-width":826,"card_image-height":457,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/nulls_infs.jpg","wide_image-width":1154,"wide_image-height":638}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>Below is a table showing the number of \u201cNaN\u201d and \u201cinf\u201d values for each of the five gentrification change indicators.\u00a0 Don\u2019t worry too much about these right now.\u00a0 There will be several steps later in the workflow to remove these from the analysis, but for now, the most important thing is that we have a good understanding of <em>why<\/em> these values are occurring.<\/p>\n"},{"acf_fc_layout":"content","content":"<table style=\"height: 569px\" width=\"989\">\n<tbody>\n<tr>\n<td width=\"242\"><strong>Gentrification indicator<\/strong><\/td>\n<td width=\"173\"><strong>NaN \/ NA (missing values)<\/strong><\/td>\n<td width=\"138\"><strong>inf (infinity values)<\/strong><\/td>\n<td width=\"85\"><strong>Total<\/strong><\/td>\n<\/tr>\n<tr>\n<td width=\"242\">Socio-demographic composition, e.g. Non-Hispanic White population (<em>chg_nonhispwhite_2000<\/em>)<\/td>\n<td width=\"173\">25<\/td>\n<td width=\"138\">2<\/td>\n<td width=\"85\">27<\/td>\n<\/tr>\n<tr>\n<td width=\"242\">College-education population<\/p>\n<p>(<em>chg_college_ed_2000_local_mean<\/em>)<\/td>\n<td width=\"173\">36<\/td>\n<td width=\"138\">15<\/td>\n<td width=\"85\">51<\/td>\n<\/tr>\n<tr>\n<td width=\"242\">Population aged 20-34<\/p>\n<p>(<em>chg_twenties_thirties_2000<\/em>)<\/td>\n<td width=\"173\">24<\/td>\n<td width=\"138\">3<\/td>\n<td width=\"85\">27<\/td>\n<\/tr>\n<tr>\n<td width=\"242\">Median family income<\/p>\n<p>(<em>chg_fam_income_2000_local_mean<\/em>)<\/td>\n<td width=\"173\">37<\/td>\n<td width=\"138\">10<\/td>\n<td width=\"85\">47<\/td>\n<\/tr>\n<tr>\n<td width=\"242\">Median gross rent<\/p>\n<p>(<em>chg_rent_2000_local_mean<\/em>)<\/td>\n<td width=\"173\">13<\/td>\n<td width=\"138\">0<\/td>\n<td width=\"85\">13<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n"},{"acf_fc_layout":"content","content":"<p><strong><em>Write out cleaned feature class to use in ArcGIS Pro<\/em><\/strong><\/p>\n<p>Now that we\u2019ve finished exploring, cleaning, and engineering new variables in our data using Pandas, we\u2019ll use the ArcGIS API for Python to convert our Spatially Enabled DataFrame to a feature class within the geodatabase.<\/p>\n<p>We first use the <strong><a href=\"https:\/\/developers.arcgis.com\/python\/latest\/api-reference\/arcgis.features.toc.html#arcgis.features.GeoAccessor.from_df\"><em>.from_df<\/em><\/a><\/strong> method on the ArcGIS API for Python\u2019s <a href=\"https:\/\/developers.arcgis.com\/python\/latest\/api-reference\/arcgis.features.toc.html#geoaccessor\"><strong>GeoAccessor<\/strong><\/a> class, passing the \u201cSHAPE\u201d field into the <em>geometry_column<\/em> parameter.\u00a0 This ensures that we are working with a true Spatially Enabled DataFrame, not a regular Pandas DataFrame.<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2715532,"id":2715532,"title":"sedf_to_fc","filename":"sedf_to_fc.jpg","filesize":18560,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/sedf_to_fc.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/sedf_to_fc-3","alt":"","author":"154341","description":"","caption":"","name":"sedf_to_fc-3","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 18:59:17","modified":"2025-03-04 18:59:17","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":707,"height":68,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/sedf_to_fc-213x68.jpg","thumbnail-width":213,"thumbnail-height":68,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/sedf_to_fc.jpg","medium-width":464,"medium-height":45,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/sedf_to_fc.jpg","medium_large-width":707,"medium_large-height":68,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/sedf_to_fc.jpg","large-width":707,"large-height":68,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/sedf_to_fc.jpg","1536x1536-width":707,"1536x1536-height":68,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/sedf_to_fc.jpg","2048x2048-width":707,"2048x2048-height":68,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/sedf_to_fc.jpg","card_image-width":707,"card_image-height":68,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/sedf_to_fc.jpg","wide_image-width":707,"wide_image-height":68}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>We then use the <strong><a href=\"https:\/\/developers.arcgis.com\/python\/guide\/part3-data-io-writing-data\/#write-to-a-local-file\"><em>.to_featureclass<\/em><\/a><\/strong>\u00a0method to export the final, cleaned SeDF as a feature class in a geodatabase so we can use it in further analysis in ArcGIS Pro.<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2715562,"id":2715562,"title":"clean_fc","filename":"clean_fc.jpg","filesize":23209,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/clean_fc.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/clean_fc","alt":"","author":"154341","description":"","caption":"","name":"clean_fc","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 19:00:31","modified":"2025-03-04 19:00:31","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":766,"height":89,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/clean_fc-213x89.jpg","thumbnail-width":213,"thumbnail-height":89,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/clean_fc.jpg","medium-width":464,"medium-height":54,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/clean_fc.jpg","medium_large-width":766,"medium_large-height":89,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/clean_fc.jpg","large-width":766,"large-height":89,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/clean_fc.jpg","1536x1536-width":766,"1536x1536-height":89,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/clean_fc.jpg","2048x2048-width":766,"2048x2048-height":89,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/clean_fc.jpg","card_image-width":766,"card_image-height":89,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/clean_fc.jpg","wide_image-width":766,"wide_image-height":89}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>This feature class contains the columns representing the five gentrification change indicators:<\/p>\n<ul>\n<li><em>chg_nonhispwhite_2000<\/em>: Change in socio-demographic composition (non-Hispanic white population) between 2000 and 2020<\/li>\n<li><em>chg_college_ed_2000_local_mean<\/em>: Change in college-educated population between 2000 and 2020<\/li>\n<li><em>chg_twenties_thirties_2000<\/em>: Change in population aged 20-34 between 2000 and 2020<\/li>\n<li><em>chg_fam_income_2000_local_mean<\/em>: Change in median family income between 2000 and 2020<\/li>\n<li><em>chg_rent_2000_local_mean<\/em>: Change in median gross rent 2000 and 2020<\/li>\n<\/ul>\n"},{"acf_fc_layout":"image","image":{"ID":2715582,"id":2715582,"title":"five_indicators","filename":"five_indicators.jpg","filesize":78744,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/five_indicators.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/five_indicators","alt":"","author":"154341","description":"","caption":"","name":"five_indicators","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 19:01:46","modified":"2025-03-04 19:01:46","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":1079,"height":220,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/five_indicators-213x200.jpg","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/five_indicators.jpg","medium-width":464,"medium-height":95,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/five_indicators.jpg","medium_large-width":768,"medium_large-height":157,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/five_indicators.jpg","large-width":1079,"large-height":220,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/five_indicators.jpg","1536x1536-width":1079,"1536x1536-height":220,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/five_indicators.jpg","2048x2048-width":1079,"2048x2048-height":220,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/five_indicators-826x168.jpg","card_image-width":826,"card_image-height":168,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/five_indicators.jpg","wide_image-width":1079,"wide_image-height":220}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p><strong><em>Final data cleaning steps<\/em><\/strong><\/p>\n<p>Recall from the previous section that several of the gentrification indicator columns contain \u201cnull\u201d or \u201cinf\u201d values.\u00a0 In ArcGIS Pro, these are combined and treated as &lt;Null&gt; values.<\/p>\n<p>To explore these values further, I used the <a href=\"https:\/\/pro.arcgis.com\/en\/pro-app\/latest\/tool-reference\/data-management\/select-layer-by-attribute.htm\"><strong>Select Layer By Attribute<\/strong><\/a> tool to run a simple query that selected all census tracts from any of the five gentrification indicators that contained a null value.\u00a0 In my case, I was able to perform this multiple-value selection using the OR query operator.\u00a0 This initial query returned 62\/2,164 census tracts with null values <em>in at least one<\/em> of the five indicators.\u00a0 Generally speaking, we can interpret these null values as resulting from the presence of missing values in either or both of the 2000 or 2020 columns for the given gentrification indicator.<\/p>\n<p>In the original paper, the authors also removed any census tract labeled as \u201cpark-cemetery\u201d, which was easy for us to replicate because we spatially joined the New York City neighborhood boundaries to our census tracts in one of the early steps in this analysis.<\/p>\n<p>The graphic below shows the full attribute query (via the tool interface and SQL) selecting any null value for a gentrification indicator, plus any census tract that is considered a park or cemetery.<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2715622,"id":2715622,"title":"final_query","filename":"final_query.jpg","filesize":125194,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/final_query.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/final_query","alt":"","author":"154341","description":"","caption":"","name":"final_query","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 19:18:41","modified":"2025-03-04 19:18:41","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":1237,"height":666,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/final_query-213x200.jpg","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/final_query.jpg","medium-width":464,"medium-height":250,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/final_query.jpg","medium_large-width":768,"medium_large-height":413,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/final_query.jpg","large-width":1237,"large-height":666,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/final_query.jpg","1536x1536-width":1237,"1536x1536-height":666,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/final_query.jpg","2048x2048-width":1237,"2048x2048-height":666,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/final_query-826x445.jpg","card_image-width":826,"card_image-height":445,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/final_query.jpg","wide_image-width":1237,"wide_image-height":666}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>Switching the selection and exporting the feature class resulted in a dataset of 2,093 census tracts.<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2715632,"id":2715632,"title":"nyc_finaltracts_withaerial","filename":"nyc_finaltracts_withaerial.jpg","filesize":215767,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/nyc_finaltracts_withaerial.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/nyc_finaltracts_withaerial","alt":"","author":"154341","description":"","caption":"Maps showing the census tract dataset.  On the left map, census tracts outlined in black and\/or filled in with green (e.g. parks, cemeteries) will be removed from the analysis.  The right map shows the final, cleaned census tracts (n=2,093) overlaid on aerial imagery.","name":"nyc_finaltracts_withaerial","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 19:20:14","modified":"2025-03-04 19:20:36","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":1238,"height":522,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/nyc_finaltracts_withaerial-213x200.jpg","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/nyc_finaltracts_withaerial.jpg","medium-width":464,"medium-height":196,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/nyc_finaltracts_withaerial.jpg","medium_large-width":768,"medium_large-height":324,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/nyc_finaltracts_withaerial.jpg","large-width":1238,"large-height":522,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/nyc_finaltracts_withaerial.jpg","1536x1536-width":1238,"1536x1536-height":522,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/nyc_finaltracts_withaerial.jpg","2048x2048-width":1238,"2048x2048-height":522,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/nyc_finaltracts_withaerial-826x348.jpg","card_image-width":826,"card_image-height":348,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/nyc_finaltracts_withaerial.jpg","wide_image-width":1238,"wide_image-height":522}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"image","image":{"ID":2715652,"id":2715652,"title":"cleaned_tract_zooms","filename":"cleaned_tract_zooms.jpg","filesize":483291,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/cleaned_tract_zooms.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/cleaned_tract_zooms","alt":"","author":"154341","description":"","caption":"Zoomed examples of census tracts that were removed from the analysis.  A) JFK Airport, Floyd Bennet Field, Jamaica Bay Wildlife Refuge, B) Highland Park Cemetery, Forest Park Golf Course C) LaGuardia Airport, Rikers Island, Willets Point, D) Central Park, Randall\u2019s Island, Roosevelt Island.","name":"cleaned_tract_zooms","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 19:20:57","modified":"2025-03-04 19:21:23","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":1461,"height":1038,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/cleaned_tract_zooms-213x200.jpg","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/cleaned_tract_zooms.jpg","medium-width":367,"medium-height":261,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/cleaned_tract_zooms.jpg","medium_large-width":768,"medium_large-height":546,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/cleaned_tract_zooms.jpg","large-width":1461,"large-height":1038,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/cleaned_tract_zooms.jpg","1536x1536-width":1461,"1536x1536-height":1038,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/cleaned_tract_zooms.jpg","2048x2048-width":1461,"2048x2048-height":1038,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/cleaned_tract_zooms-654x465.jpg","card_image-width":654,"card_image-height":465,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/cleaned_tract_zooms.jpg","wide_image-width":1461,"wide_image-height":1038}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p><strong><em>Data exploration <\/em><\/strong><\/p>\n<p>Before we complete the final analysis, there are a few more data preparation\/engineering steps to do.\u00a0 When we open the Data Engineering view of our cleaned census tract feature class, we see that all five of the gentrification change indicators are right-skewed (positively skewed) to some degree.\u00a0 Positive skew occurs when there is a relatively small number of large values in a data distribution and is characterized by a longer tail to the right of the histogram.<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2715692,"id":2715692,"title":"DE_histos_clean","filename":"DE_histos_clean.jpg","filesize":69851,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/DE_histos_clean.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/de_histos_clean","alt":"","author":"154341","description":"","caption":"","name":"de_histos_clean","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 19:25:42","modified":"2025-03-04 19:25:42","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":948,"height":218,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/DE_histos_clean-213x200.jpg","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/DE_histos_clean.jpg","medium-width":464,"medium-height":107,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/DE_histos_clean.jpg","medium_large-width":768,"medium_large-height":177,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/DE_histos_clean.jpg","large-width":948,"large-height":218,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/DE_histos_clean.jpg","1536x1536-width":948,"1536x1536-height":218,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/DE_histos_clean.jpg","2048x2048-width":948,"2048x2048-height":218,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/DE_histos_clean-826x190.jpg","card_image-width":826,"card_image-height":190,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/DE_histos_clean.jpg","wide_image-width":948,"wide_image-height":218}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>Scrolling to the right of the Data Engineering view also gives us <strong><a href=\"https:\/\/pro.arcgis.com\/en\/pro-app\/latest\/help\/analysis\/geoprocessing\/data-engineering\/view-statistics.htm#ESRI_SECTION1_5308FA9653A84AB3940B7F0A83DB75B9\">additional statistics<\/a><\/strong> that help us quantify the shape in a data distribution.\u00a0 Skewness determines if the data distribution is symmetrical (e.g. \u201cnormal\u201d), and kurtosis measures the heaviness of the distribution\u2019s tails.<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2715702,"id":2715702,"title":"skewness_kurtosis","filename":"skewness_kurtosis.jpg","filesize":60081,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/skewness_kurtosis.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/skewness_kurtosis","alt":"","author":"154341","description":"","caption":"","name":"skewness_kurtosis","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 19:26:48","modified":"2025-03-04 19:26:48","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":948,"height":216,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/skewness_kurtosis-213x200.jpg","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/skewness_kurtosis.jpg","medium-width":464,"medium-height":106,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/skewness_kurtosis.jpg","medium_large-width":768,"medium_large-height":175,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/skewness_kurtosis.jpg","large-width":948,"large-height":216,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/skewness_kurtosis.jpg","1536x1536-width":948,"1536x1536-height":216,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/skewness_kurtosis.jpg","2048x2048-width":948,"2048x2048-height":216,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/skewness_kurtosis-826x188.jpg","card_image-width":826,"card_image-height":188,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/skewness_kurtosis.jpg","wide_image-width":948,"wide_image-height":216}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>Skewness values close to zero suggest that a data distribution is symmetrical or normal, and values outside of the range of -1 to +1 indicate negative (left) or positive (right) skew, respectively.\u00a0 Kurtosis values less than 3 describe a data distribution with a flat peak and lighter tails with few extreme values, while a kurtosis of greater than 3 suggests heavier tails with more extreme values than a normal distribution.<\/p>\n<p>The shape of the histograms combined with the skewness and kurtosis values for each of the five gentrification change indicators confirm that they are each positively skewed.\u00a0 This aligns with what the authors reported in the original paper, so our next step in replicating their analysis is data transformation.<\/p>\n<p>To accomplish this, we\u2019ll use the <strong><a href=\"https:\/\/pro.arcgis.com\/en\/pro-app\/latest\/tool-reference\/data-management\/transform-field.htm\">Transform Field<\/a><\/strong> tool.\u00a0 In the tool syntax, we pass in the input feature class and specify the attribute fields that represent the five skewed gentrification change indicators.\u00a0 The <em>method<\/em> parameter allows you to choose from several different transformation methods (e.g. Square root, Log, Box-Cox, etc.).\u00a0 The <em>shift<\/em> parameter allows you to input a constant value that is used to adjust the entire data distribution to the right (positive) or left (negative) and is often used when a data variable contains either zero or negative values.<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2715742,"id":2715742,"title":"transform_field","filename":"transform_field.jpg","filesize":43313,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/transform_field.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/transform_field","alt":"","author":"154341","description":"","caption":"","name":"transform_field","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 19:29:43","modified":"2025-03-04 19:29:43","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":1169,"height":219,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/transform_field-213x200.jpg","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/transform_field.jpg","medium-width":464,"medium-height":87,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/transform_field.jpg","medium_large-width":768,"medium_large-height":144,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/transform_field.jpg","large-width":1169,"large-height":219,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/transform_field.jpg","1536x1536-width":1169,"1536x1536-height":219,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/transform_field.jpg","2048x2048-width":1169,"2048x2048-height":219,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/transform_field-826x155.jpg","card_image-width":826,"card_image-height":155,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/transform_field.jpg","wide_image-width":1169,"wide_image-height":219}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>Following the steps in the paper, we chose a log transformation with a shift of 1.01 to account for any zero or negative rate changes in each of the five gentrification change indicators.<\/p>\n<p>The resulting changes are shown in the Data Engineering view below.<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2715762,"id":2715762,"title":"DE_transforms","filename":"DE_transforms.jpg","filesize":114355,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/DE_transforms.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/de_transforms","alt":"","author":"154341","description":"","caption":"","name":"de_transforms","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 19:31:21","modified":"2025-03-04 19:31:21","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":1083,"height":377,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/DE_transforms-213x200.jpg","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/DE_transforms.jpg","medium-width":464,"medium-height":162,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/DE_transforms.jpg","medium_large-width":768,"medium_large-height":267,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/DE_transforms.jpg","large-width":1083,"large-height":377,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/DE_transforms.jpg","1536x1536-width":1083,"1536x1536-height":377,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/DE_transforms.jpg","2048x2048-width":1083,"2048x2048-height":377,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/DE_transforms-826x288.jpg","card_image-width":826,"card_image-height":288,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/DE_transforms.jpg","wide_image-width":1083,"wide_image-height":377}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>Here, we can see that applying the log transformation improved the shape of two of the variables (<em>chg_nonhispwhite_2000<\/em>, <em>chg_college_ed_2000_local_mean<\/em>) but actually caused the other three to become negatively (left) skewed.<\/p>\n<p>We can dig into this a little more by creating a <strong><a href=\"https:\/\/pro.arcgis.com\/en\/pro-app\/latest\/help\/analysis\/geoprocessing\/charts\/scatter-plot-matrix.htm\">scatter plot matrix<\/a><\/strong> containing the five gentrification indicator variables.\u00a0 This allows us to not only visualize the strength and direction of the relationships between and among the indicators, but is also a useful way to identify outliers.<\/p>\n<p>In the graphic below, the left scatter plot matrix shows the cleaned feature class containing 2,093 census tracts.\u00a0 Note the red box, which shows the presence of outliers in the scatter plot.\u00a0 The right scatter plot matrix shows the scatter plots and associated Pearson\u2019s correlation coefficients (<em>r<\/em>) after removing these five selected outliers.\u00a0 The scatter plot also shows that nearly all of the correlation coefficients between any two variables are around 0.5 or lower, which means that we don\u2019t have high redundancy in our gentrification indicators.<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2715802,"id":2715802,"title":"outlier_scatterplots","filename":"outlier_scatterplots.jpg","filesize":89245,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/outlier_scatterplots.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/outlier_scatterplots","alt":"","author":"154341","description":"","caption":"Scatter plots showing the relationships between the five gentrification change indicators and corresponding Pearson\u2019s correlation coefficients (r) prior to and following outlier removal.  The preview scatter plot in the top right corner corresponds to the highlighted box on the scatter plot matrix, which in this case is the log of change in non-Hispanic white population (x) vs. the log of change in college-educated population (y) between 2000 and 2020.","name":"outlier_scatterplots","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 19:34:07","modified":"2025-03-07 15:18:17","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":1284,"height":544,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/outlier_scatterplots-213x200.jpg","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/outlier_scatterplots.jpg","medium-width":464,"medium-height":197,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/outlier_scatterplots.jpg","medium_large-width":768,"medium_large-height":325,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/outlier_scatterplots.jpg","large-width":1284,"large-height":544,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/outlier_scatterplots.jpg","1536x1536-width":1284,"1536x1536-height":544,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/outlier_scatterplots.jpg","2048x2048-width":1284,"2048x2048-height":544,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/outlier_scatterplots-826x350.jpg","card_image-width":826,"card_image-height":350,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/outlier_scatterplots.jpg","wide_image-width":1284,"wide_image-height":544}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>After a bit more experimentation and exploration, I made the decision to remove these five additional outliers, which gave me a final dataset of 2,088 census tracts.\u00a0 Based on the distribution of each of the variables, the final five chosen for the analysis are the logarithms of <em>chg_nonhispwhite_2000<\/em>, <em>chg_college_ed_2000_local_mean<\/em>, and <em>chg_twenties_thirties_2000<\/em>, and the non-transformed versions of <em>chg_fam_income_2000_local_mean<\/em> and <em>chg_rent_2000_local_mean<\/em>.<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2715822,"id":2715822,"title":"final_variables_v2","filename":"final_variables_v2.jpg","filesize":96500,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/final_variables_v2.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/final_variables_v2","alt":"","author":"154341","description":"","caption":"","name":"final_variables_v2","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 19:37:18","modified":"2025-03-04 19:37:18","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":898,"height":429,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/final_variables_v2-213x200.jpg","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/final_variables_v2.jpg","medium-width":464,"medium-height":222,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/final_variables_v2.jpg","medium_large-width":768,"medium_large-height":367,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/final_variables_v2.jpg","large-width":898,"large-height":429,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/final_variables_v2.jpg","1536x1536-width":898,"1536x1536-height":429,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/final_variables_v2.jpg","2048x2048-width":898,"2048x2048-height":429,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/final_variables_v2-826x395.jpg","card_image-width":826,"card_image-height":395,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/final_variables_v2.jpg","wide_image-width":898,"wide_image-height":429}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"image","image":{"ID":2715832,"id":2715832,"title":"mapped_change_indicators","filename":"mapped_change_indicators.jpg","filesize":568616,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/mapped_change_indicators.jpg","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\/mapped_change_indicators","alt":"","author":"154341","description":"","caption":"Maps showing the raw relative change (non-transformed) in the five gentrification indicators between 2000 and 2020.  Note that all five indicators generally show an increase over time, which is consistent with what we would expect as part of the gentrification process.","name":"mapped_change_indicators","status":"inherit","uploaded_to":2714642,"date":"2025-03-04 19:37:55","modified":"2025-03-04 19:38:15","menu_order":0,"mime_type":"image\/jpeg","type":"image","subtype":"jpeg","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":1716,"height":1076,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/mapped_change_indicators-213x200.jpg","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/mapped_change_indicators.jpg","medium-width":416,"medium-height":261,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/mapped_change_indicators.jpg","medium_large-width":768,"medium_large-height":482,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/mapped_change_indicators.jpg","large-width":1716,"large-height":1076,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/mapped_change_indicators-1536x963.jpg","1536x1536-width":1536,"1536x1536-height":963,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/mapped_change_indicators.jpg","2048x2048-width":1716,"2048x2048-height":1076,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/mapped_change_indicators-742x465.jpg","card_image-width":742,"card_image-height":465,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/mapped_change_indicators.jpg","wide_image-width":1716,"wide_image-height":1076}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>To close out this blog, I do want to reiterate\u00a0that more often than not, there is no concrete set of steps for doing these data exploration and engineering steps.\u00a0 There are certainly some <strong><a href=\"https:\/\/pro.arcgis.com\/en\/pro-app\/latest\/help\/analysis\/geoprocessing\/data-engineering\/view-statistics.htm#ESRI_SECTION1_5308FA9653A84AB3940B7F0A83DB75B9\">best practices and rules-of-thumb<\/a><\/strong> for filling missing values, doing data transformations, detecting\/removing outliers, etc. but these decisions <em><u>will almost always<\/u><\/em> depend on your analysis goal, your chosen methods, and the characteristics of your data.\u00a0 In this sense, much of this process is <em>equally as much art as it is science<\/em>!<\/p>\n<p>In the <strong><a href=\"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-4-creating-a-gentrification-index-using-principal-components-analysis-pca\">next blog<\/a><\/strong>, we\u2019ll finally be able to use all of this good data to actually map gentrification across New York City!<\/p>\n"},{"acf_fc_layout":"sidebar","content":"<h2>Spatial data science in ArcGIS Pro Notebooks<\/h2>\n<p>Here are the links to all the blogs in the series:<\/p>\n<ul>\n<li><a href=\"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks\"><strong>Part 1.<\/strong><\/a> Mapping gentrification in US Cities<\/li>\n<li><strong><a href=\"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-2-data-fusion-wrangling-and-preparation\">Part 2.<\/a><\/strong> Data fusion, wrangling, and preparation<\/li>\n<li><strong><a href=\"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\">Part 3.<\/a><\/strong> Data validation and data engineering<\/li>\n<li><strong><a href=\"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-4-creating-a-gentrification-index-using-principal-components-analysis-pca\">Part 4.<\/a><\/strong> Creating a gentrification index using Principal Components Analysis (PCA)<\/li>\n<li>Part 5. (Coming soon!) Creating a gentrification index using Calculate Composite Index<\/li>\n<\/ul>\n","image_reference":false,"layout":"standard","image_reference_figure":"","snippet":"","spotlight_name":"","section_title":"","position":"Center","spotlight_image":false}],"related_articles":"","show_article_image":false,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/blog-3-card.png","wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/blog-3-wide.png"},"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Data Validation and Data Engineering using ArcGIS Notebooks<\/title>\n<meta name=\"description\" content=\"This is the third installment in a series of blog articles focused on using ArcGIS Notebooks to map gentrification in US cities.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Spatial data science using ArcGIS Notebooks Blog 3: Data validation and data engineering\" \/>\n<meta property=\"og:description\" content=\"This is the third installment in a series of blog articles focused on using ArcGIS Notebooks to map gentrification in US cities.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\" \/>\n<meta property=\"og:site_name\" content=\"ArcGIS Blog\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/esrigis\/\" \/>\n<meta property=\"article:modified_time\" content=\"2025-10-06T17:49:57+00:00\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:site\" content=\"@ESRI\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"23 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":[\"Article\",\"BlogPosting\"],\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\"},\"author\":{\"name\":\"Nicholas Giner\",\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/#\/schema\/person\/2dc4741deea59d3274cfa775e52501b2\"},\"headline\":\"Spatial data science using ArcGIS Notebooks Blog 3: Data validation and data engineering\",\"datePublished\":\"2025-03-10T13:00:51+00:00\",\"dateModified\":\"2025-10-06T17:49:57+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\"},\"wordCount\":12,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/#organization\"},\"keywords\":[\"ArcGIS API for Python\",\"ArcGIS Notebooks\",\"Data Engineering\",\"python\",\"spatial data science\"],\"articleSection\":[\"Analytics\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\",\"url\":\"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\",\"name\":\"Data Validation and Data Engineering using ArcGIS Notebooks\",\"isPartOf\":{\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/#website\"},\"datePublished\":\"2025-03-10T13:00:51+00:00\",\"dateModified\":\"2025-10-06T17:49:57+00:00\",\"description\":\"This is the third installment in a series of blog articles focused on using ArcGIS Notebooks to map gentrification in US cities.\",\"breadcrumb\":{\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.esri.com\/arcgis-blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Spatial data science using ArcGIS Notebooks Blog 3: Data validation and data engineering\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/#website\",\"url\":\"https:\/\/www.esri.com\/arcgis-blog\/\",\"name\":\"ArcGIS Blog\",\"description\":\"Get insider info from Esri product teams\",\"publisher\":{\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.esri.com\/arcgis-blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/#organization\",\"name\":\"Esri\",\"url\":\"https:\/\/www.esri.com\/arcgis-blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2018\/04\/Esri.png\",\"contentUrl\":\"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2018\/04\/Esri.png\",\"width\":400,\"height\":400,\"caption\":\"Esri\"},\"image\":{\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/esrigis\/\",\"https:\/\/x.com\/ESRI\",\"https:\/\/www.linkedin.com\/company\/5311\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/#\/schema\/person\/2dc4741deea59d3274cfa775e52501b2\",\"name\":\"Nicholas Giner\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2021\/01\/headshot-e1610030307989-213x200.jpeg\",\"contentUrl\":\"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2021\/01\/headshot-e1610030307989-213x200.jpeg\",\"caption\":\"Nicholas Giner\"},\"description\":\"Nick Giner is a Product Manager for Spatial Analysis and Data Science. Prior to joining Esri in 2014, he completed Bachelor\u2019s and PhD degrees in Geography from Penn State University and Clark University, respectively. In his spare time, he likes to play guitar, golf, cook, cut the grass, and read\/watch shows about history.\",\"sameAs\":[\"www.linkedin.com\/in\/nicholas-giner-0282966b\",\"https:\/\/x.com\/NickGiner\"],\"url\":\"https:\/\/www.esri.com\/arcgis-blog\/author\/nginer\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Data Validation and Data Engineering using ArcGIS Notebooks","description":"This is the third installment in a series of blog articles focused on using ArcGIS Notebooks to map gentrification in US cities.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering","og_locale":"en_US","og_type":"article","og_title":"Spatial data science using ArcGIS Notebooks Blog 3: Data validation and data engineering","og_description":"This is the third installment in a series of blog articles focused on using ArcGIS Notebooks to map gentrification in US cities.","og_url":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering","og_site_name":"ArcGIS Blog","article_publisher":"https:\/\/www.facebook.com\/esrigis\/","article_modified_time":"2025-10-06T17:49:57+00:00","twitter_card":"summary_large_image","twitter_site":"@ESRI","twitter_misc":{"Est. reading time":"23 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":["Article","BlogPosting"],"@id":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering#article","isPartOf":{"@id":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering"},"author":{"name":"Nicholas Giner","@id":"https:\/\/www.esri.com\/arcgis-blog\/#\/schema\/person\/2dc4741deea59d3274cfa775e52501b2"},"headline":"Spatial data science using ArcGIS Notebooks Blog 3: Data validation and data engineering","datePublished":"2025-03-10T13:00:51+00:00","dateModified":"2025-10-06T17:49:57+00:00","mainEntityOfPage":{"@id":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering"},"wordCount":12,"commentCount":0,"publisher":{"@id":"https:\/\/www.esri.com\/arcgis-blog\/#organization"},"keywords":["ArcGIS API for Python","ArcGIS Notebooks","Data Engineering","python","spatial data science"],"articleSection":["Analytics"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering","url":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering","name":"Data Validation and Data Engineering using ArcGIS Notebooks","isPartOf":{"@id":"https:\/\/www.esri.com\/arcgis-blog\/#website"},"datePublished":"2025-03-10T13:00:51+00:00","dateModified":"2025-10-06T17:49:57+00:00","description":"This is the third installment in a series of blog articles focused on using ArcGIS Notebooks to map gentrification in US cities.","breadcrumb":{"@id":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.esri.com\/arcgis-blog\/"},{"@type":"ListItem","position":2,"name":"Spatial data science using ArcGIS Notebooks Blog 3: Data validation and data engineering"}]},{"@type":"WebSite","@id":"https:\/\/www.esri.com\/arcgis-blog\/#website","url":"https:\/\/www.esri.com\/arcgis-blog\/","name":"ArcGIS Blog","description":"Get insider info from Esri product teams","publisher":{"@id":"https:\/\/www.esri.com\/arcgis-blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.esri.com\/arcgis-blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.esri.com\/arcgis-blog\/#organization","name":"Esri","url":"https:\/\/www.esri.com\/arcgis-blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.esri.com\/arcgis-blog\/#\/schema\/logo\/image\/","url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2018\/04\/Esri.png","contentUrl":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2018\/04\/Esri.png","width":400,"height":400,"caption":"Esri"},"image":{"@id":"https:\/\/www.esri.com\/arcgis-blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/esrigis\/","https:\/\/x.com\/ESRI","https:\/\/www.linkedin.com\/company\/5311\/"]},{"@type":"Person","@id":"https:\/\/www.esri.com\/arcgis-blog\/#\/schema\/person\/2dc4741deea59d3274cfa775e52501b2","name":"Nicholas Giner","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.esri.com\/arcgis-blog\/#\/schema\/person\/image\/","url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2021\/01\/headshot-e1610030307989-213x200.jpeg","contentUrl":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2021\/01\/headshot-e1610030307989-213x200.jpeg","caption":"Nicholas Giner"},"description":"Nick Giner is a Product Manager for Spatial Analysis and Data Science. Prior to joining Esri in 2014, he completed Bachelor\u2019s and PhD degrees in Geography from Penn State University and Clark University, respectively. In his spare time, he likes to play guitar, golf, cook, cut the grass, and read\/watch shows about history.","sameAs":["www.linkedin.com\/in\/nicholas-giner-0282966b","https:\/\/x.com\/NickGiner"],"url":"https:\/\/www.esri.com\/arcgis-blog\/author\/nginer"}]}},"text_date":"March 10, 2025","author_name":"Multiple Authors","author_page":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/analytics\/spatial-data-science-using-arcgis-notebooks-blog-3-data-validation-and-data-engineering","custom_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/blog-3-wide.png","primary_product":"ArcGIS Pro","tag_data":[{"term_id":387782,"name":"ArcGIS API for Python","slug":"arcgis-api-for-python","term_group":0,"term_taxonomy_id":387782,"taxonomy":"post_tag","description":"","parent":0,"count":44,"filter":"raw"},{"term_id":555752,"name":"ArcGIS Notebooks","slug":"arcgis-notebooks","term_group":0,"term_taxonomy_id":555752,"taxonomy":"post_tag","description":"","parent":0,"count":38,"filter":"raw"},{"term_id":760452,"name":"Data Engineering","slug":"data-engineering","term_group":0,"term_taxonomy_id":760452,"taxonomy":"post_tag","description":"","parent":0,"count":34,"filter":"raw"},{"term_id":24341,"name":"python","slug":"python","term_group":0,"term_taxonomy_id":24341,"taxonomy":"post_tag","description":"","parent":0,"count":171,"filter":"raw"},{"term_id":759592,"name":"spatial data science","slug":"spatial-data-science","term_group":0,"term_taxonomy_id":759592,"taxonomy":"post_tag","description":"","parent":0,"count":17,"filter":"raw"}],"category_data":[{"term_id":23341,"name":"Analytics","slug":"analytics","term_group":0,"term_taxonomy_id":23341,"taxonomy":"category","description":"","parent":0,"count":1328,"filter":"raw"}],"product_data":[{"term_id":36841,"name":"ArcGIS API for Python","slug":"api-python","term_group":0,"term_taxonomy_id":36841,"taxonomy":"product","description":"","parent":36601,"count":151,"filter":"raw"},{"term_id":36561,"name":"ArcGIS Pro","slug":"arcgis-pro","term_group":0,"term_taxonomy_id":36561,"taxonomy":"product","description":"","parent":0,"count":2036,"filter":"raw"}],"primary_product_link":"https:\/\/www.esri.com\/arcgis-blog\/?s=#&products=arcgis-pro","_links":{"self":[{"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/blog\/2714642","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/blog"}],"about":[{"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/types\/blog"}],"author":[{"embeddable":true,"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/users\/154341"}],"replies":[{"embeddable":true,"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/comments?post=2714642"}],"version-history":[{"count":0,"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/blog\/2714642\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/media?parent=2714642"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/categories?post=2714642"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/tags?post=2714642"},{"taxonomy":"industry","embeddable":true,"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/industry?post=2714642"},{"taxonomy":"product","embeddable":true,"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/product?post=2714642"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}