{"id":2941215,"date":"2025-10-01T13:06:50","date_gmt":"2025-10-01T20:06:50","guid":{"rendered":"https:\/\/www.esri.com\/arcgis-blog\/?post_type=blog&#038;p=2941215"},"modified":"2025-10-01T13:06:50","modified_gmt":"2025-10-01T20:06:50","slug":"streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine","status":"publish","type":"blog","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/geoanalytics-engine\/analytics\/streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine","title":{"rendered":"Streamlining Spatial Data Discovery and Analytics on Amazon EMR with AWS Glue Data Catalog and ArcGIS GeoAnalytics Engine"},"author":323502,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"open","ping_status":"closed","template":"","format":"standard","meta":{"_acf_changed":false,"_searchwp_excluded":""},"categories":[23341,23851,37511],"tags":[],"industry":[],"product":[765842,36551],"class_list":["post-2941215","blog","type-blog","status-publish","format-standard","hentry","category-analytics","category-data-management","category-sharing-collaboration","product-geoanalytics-engine","product-arcgis-online"],"acf":{"authors":[{"ID":323502,"user_firstname":"Arif","user_lastname":"Masrur","nickname":"Arif Masrur","user_nicename":"amasrur","display_name":"Arif Masrur","user_email":"amasrur@esri.com","user_url":"","user_registered":"2022-12-01 19:41:13","user_description":"Arif Masrur is a Sr. Solutions Engineer at Esri with expertise in spatial data science and GeoAI. He has a PhD in Geography (GIScience and Data Analytics) from Penn State, and loves transforming space-time data into actionable intelligence.","user_avatar":"<img data-del=\"avatar\" src='https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/10\/g2513150-dev-2022-portraits-1042-scaled-e1759350705582-213x200.jpg' class='avatar pp-user-avatar avatar-96 photo ' height='96' width='96'\/>"},{"ID":370532,"user_firstname":"Kyunam","user_lastname":"Kim","nickname":"Kyunam Kim","user_nicename":"kyunam_kim","display_name":"Kyunam Kim","user_email":"kyunam_kim@esri.com","user_url":"","user_registered":"2025-04-07 20:07:32","user_description":"Kyunam Kim is a Spatial Data Science Manager at Esri with 20+ years of experience in geospatial analytics, AI, and big data. A true doer in the trenches, he leads teams delivering advanced Spatial Analytics, GeoAI, and Generative AI solutions, while driving large-scale geospatial innovation on cloud platforms.","user_avatar":"<img data-del=\"avatar\" src='https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/10\/bio-467x465.jpg' class='avatar pp-user-avatar avatar-96 photo ' height='96' width='96'\/>"}],"short_description":"Streamline spatial analysis on EMR using AWS Glue Data Catalog and ArcGIS GeoAnalytics Engine for easy data discovery and reuse.","flexible_content":[{"acf_fc_layout":"content","content":"<p>Spatial data scientists and GIS analysts increasingly deal with massive, distributed datasets stored in cloud data lakes. While <a href=\"https:\/\/www.esri.com\/en-us\/arcgis\/products\/arcgis-geoanalytics-engine\/overview\">ArcGIS GeoAnalytics Engine<\/a> brings powerful, scalable spatial analysis to Spark environments like <a href=\"https:\/\/aws.amazon.com\/emr\/\">Amazon EMR<\/a>, many users still face challenges around finding, managing, and reusing spatial datasets efficiently in shared environments.<\/p>\n<p>This is where the <a href=\"https:\/\/docs.aws.amazon.com\/prescriptive-guidance\/latest\/serverless-etl-aws-glue\/aws-glue-data-catalog.html\">AWS Glue Data Catalog<\/a> paired with GeoAnalytics Engine becomes increasingly valuable in a modern spatial data science workflow. As a centralized metadata repository, the Glue Data Catalog helps data science teams discover, organize, and manage the schema of their datasets stored in Amazon S3, RDS, Redshift, and more. When Spark on EMR is configured to use the AWS Glue Data Catalog for table metadata, you can:<\/p>\n<ul>\n<li>Get a unified view of data sources directly in your PySpark notebook \u2013 making it easier to discover, access, and analyze spatial data with varied geometry types (e.g., points, multipoints, linestrings, polygons) shared and managed across teams and projects.<\/li>\n<li>Ensure schema consistency for repeatable spatial analysis workflows.<\/li>\n<li>Automate dataset registration using crawlers or <code>.saveAsTable<\/code>.<\/li>\n<li>Instantly apply GeoAnalytics Engine\u2019s SQL functions + tools on discoverable datasets within your PySpark notebook.<\/li>\n<\/ul>\n<p>In this blog, we&#8217;ll show how you can leverage the AWS Glue Data Catalog with ArcGIS GeoAnalytics Engine running on your EMR cluster to streamline spatial data <strong>discovery<\/strong> and <strong>analytics<\/strong> workflow. Our example use case focuses on a county-level Covid-19 dataset from the United States, containing over 2.8 million records, to help understand outbreak patterns. As we explore this dataset, we\u2019ll specifically walk you through the steps of:<\/p>\n<ol>\n<li>How to catalog your spatial data in AWS Glue.<\/li>\n<li>How to configure your EMR Cluster to use AWS Glue Data Catalog as an external metastore that can be accessed by GeoAnalytics Engine.<\/li>\n<li>How to discover, read, analyze, and write S3 tables using AWS Glue Data Catalog and GeoAnalytics Engine.<\/li>\n<\/ol>\n<p>&nbsp;<\/p>\n<p>In the AWS Glue Data Catalog, metadata is organized into tables, where each table represents the structure and properties of a specific data source. You can populate the Data Catalog using a crawler, which automatically scans data sources and extracts metadata. Crawlers can connect to both AWS-native and external data sources. In this example, we&#8217;ll use Amazon S3 as our data source. The diagram below presents a high-level architecture that integrates Amazon S3, the AWS Glue Data Catalog, and ArcGIS GeoAnalytics Engine running on Amazon EMR.<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2941469,"id":2941469,"title":"AWS S3 Glue EMR","filename":"AWS-S3-Glue-EMR.png","filesize":40870,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/AWS-S3-Glue-EMR.png","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/geoanalytics-engine\/analytics\/streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine\/aws-s3-glue-emr","alt":"","author":"323502","description":"","caption":"","name":"aws-s3-glue-emr","status":"inherit","uploaded_to":2941215,"date":"2025-09-29 19:24:23","modified":"2025-09-29 19:24:23","menu_order":0,"mime_type":"image\/png","type":"image","subtype":"png","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":930,"height":440,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/AWS-S3-Glue-EMR-213x200.png","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/AWS-S3-Glue-EMR.png","medium-width":464,"medium-height":220,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/AWS-S3-Glue-EMR.png","medium_large-width":768,"medium_large-height":363,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/AWS-S3-Glue-EMR.png","large-width":930,"large-height":440,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/AWS-S3-Glue-EMR.png","1536x1536-width":930,"1536x1536-height":440,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/AWS-S3-Glue-EMR.png","2048x2048-width":930,"2048x2048-height":440,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/AWS-S3-Glue-EMR-826x391.png","card_image-width":826,"card_image-height":391,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/AWS-S3-Glue-EMR.png","wide_image-width":930,"wide_image-height":440}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<h3><strong>1. Catalog Your Data in AWS Glue<\/strong><\/h3>\n<p>To catalog your data in S3 using AWS Glue Data Catalog:<\/p>\n<ul>\n<li>Have your data in S3 and an <a href=\"https:\/\/aws.amazon.com\/iam\/?trk=c1035b48-9a6c-481e-b416-b4439e04cfbc&amp;sc_channel=ps&amp;ef_id=Cj0KCQjwoP_FBhDFARIsANPG24PANQL5L4lUN-D2islYGYa_xSNatoTVjjJBmKMy2U_SFNtsXEnr658aAsEAEALw_wcB:G:s&amp;s_kwcid=AL!4422!3!652240143562!e!!g!!iam!19878797467!148973348604&amp;gad_campaignid=19878797467&amp;gbraid=0AAAAADjHtp-dyL7hMkBndomAJ1qgTxFbI&amp;gclid=Cj0KCQjwoP_FBhDFARIsANPG24PANQL5L4lUN-D2islYGYa_xSNatoTVjjJBmKMy2U_SFNtsXEnr658aAsEAEALw_wcB\">Identity and Access Management (<\/a>IAM) role with access to both S3 and Glue.<\/li>\n<li>In the Glue Console, optionally create a database to organize your tables.<\/li>\n<li>Create a <a href=\"https:\/\/docs.aws.amazon.com\/glue\/latest\/webapi\/API_Crawler.html\">crawler<\/a>, point it to your S3 path, assign the IAM role, and choose your target database that you just created.<\/li>\n<li>Configure it to run on demand or on schedule.<\/li>\n<li>After running the crawler, you\u2019ll find the table metadata under the Tables section \u2013 ready for use with EMR, Glue jobs, Athena, and more.<\/li>\n<\/ul>\n<p>Follow these steps as shown here, if needed:<\/p>\n<div style=\"width: 640px;\" class=\"wp-video\"><video class=\"wp-video-shortcode\" id=\"video-2941215-1\" width=\"640\" height=\"360\" preload=\"metadata\" controls=\"controls\"><source type=\"video\/mp4\" src=\"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/glue-cataloging-GIF.mp4?_=1\" \/><a href=\"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/glue-cataloging-GIF.mp4\">https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/glue-cataloging-GIF.mp4<\/a><\/video><\/div>\n<h3><strong>2. Configure EMR Cluster to Use AWS Glue Data Catalog as External Metastore<\/strong><\/h3>\n<p>Navigate to Amazon EMR and click\u00a0<a href=\"https:\/\/docs.aws.amazon.com\/emr\/latest\/ManagementGuide\/emr-gs.html#emr-getting-started-launch-sample-cluster\"><strong>Create cluster<\/strong><\/a>. Next, follow this <a href=\"https:\/\/developers.arcgis.com\/geoanalytics\/install\/aws_emr\/\">step-by-step workflow to install<\/a> ArcGIS GeoAnalytics Engine in the cluster. To begin with, choose an appropriate Amazon EMR release and custom application bundle that are compatible with your version of the GeoAnalytics Engine distribution (see <a href=\"https:\/\/developers.arcgis.com\/geoanalytics\/install\/aws_emr\/#:~:text=The%20table%20below%20summarizes%20the%20Amazon%20EMR%20runtimes%20supported%20by%20each%20version%20of%20GeoAnalytics%20Engine\">here<\/a>).<\/p>\n<p>By default, Spark on EMR uses the local Hive metastore, which is ephemeral \u2013 all metadata is lost when the cluster is terminated. Starting with EMR release 5.8.0, you can configure Spark to use the AWS Glue Data Catalog as its Hive-compatible metastore. Selecting \u201cUse for Spark table metadata\u201d, as shown below, lets Spark use the AWS Glue Data Catalog as its metastore, allowing your tables created with Spark SQL or DataFrames to be persistently stored and shared across clusters and other AWS services. \u201cUse for Hive table metadata\u201d does the same for Hives but does not affect Spark unless the Spark setting is also checked. You can enable both options to share table metadata across Spark, Hive, and other services.<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2941472,"id":2941472,"title":"EMR_Cluster_Config","filename":"EMR_Cluster_Config.png","filesize":68048,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/EMR_Cluster_Config.png","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/geoanalytics-engine\/analytics\/streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine\/emr_cluster_config","alt":"","author":"323502","description":"","caption":"","name":"emr_cluster_config","status":"inherit","uploaded_to":2941215,"date":"2025-09-29 19:31:51","modified":"2025-09-29 19:31:51","menu_order":0,"mime_type":"image\/png","type":"image","subtype":"png","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":750,"height":534,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/EMR_Cluster_Config-213x200.png","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/EMR_Cluster_Config.png","medium-width":367,"medium-height":261,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/EMR_Cluster_Config.png","medium_large-width":750,"medium_large-height":534,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/EMR_Cluster_Config.png","large-width":750,"large-height":534,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/EMR_Cluster_Config.png","1536x1536-width":750,"1536x1536-height":534,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/EMR_Cluster_Config.png","2048x2048-width":750,"2048x2048-height":534,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/EMR_Cluster_Config-653x465.png","card_image-width":653,"card_image-height":465,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/EMR_Cluster_Config.png","wide_image-width":750,"wide_image-height":534}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>As for the IAM permission, if you use the default EC2 instance profile, no further action is required. Because the \u00a0AmazonElasticMapReduceforEC2Role\u00a0managed policy that is attached to the\u00a0EMR_EC2_DefaultRole\u00a0allows all necessary AWS Glue actions. However, if you choose to specify a custom EC2 instance profile and permissions, follow the guideline <a href=\"https:\/\/docs.aws.amazon.com\/emr\/latest\/ReleaseGuide\/emr-spark-glue.html#:~:text=used%20for%20encryption.-,Permissions%20for%20AWS%20Glue%20actions,-If%20you%20use\">here<\/a>.<\/p>\n<h3><strong>3. Discover, Read, Analyze, and Write S3 Tables Using AWS Glue Data Catalog and ArcGIS GeoAnalytics Engine<\/strong><\/h3>\n<p>After you have successfully created the cluster, launch your Workspace (notebook) and attach it to the cluster. In your notebook cell, you can begin by listing all available Glue databases.<\/p>\n<pre><code>\r\nspark.sql(\"show databases\").<span style=\"color: #d73a49\">show<\/span>()\r\n<\/code><\/pre>\n"},{"acf_fc_layout":"image","image":{"ID":2941474,"id":2941474,"title":"glue_databases","filename":"glue_databases.png","filesize":1399,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/glue_databases.png","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/geoanalytics-engine\/analytics\/streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine\/glue_databases","alt":"","author":"323502","description":"","caption":"","name":"glue_databases","status":"inherit","uploaded_to":2941215,"date":"2025-09-29 19:47:12","modified":"2025-09-29 19:47:12","menu_order":0,"mime_type":"image\/png","type":"image","subtype":"png","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":114,"height":144,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/glue_databases.png","thumbnail-width":114,"thumbnail-height":144,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/glue_databases.png","medium-width":114,"medium-height":144,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/glue_databases.png","medium_large-width":114,"medium_large-height":144,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/glue_databases.png","large-width":114,"large-height":144,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/glue_databases.png","1536x1536-width":114,"1536x1536-height":144,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/glue_databases.png","2048x2048-width":114,"2048x2048-height":144,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/glue_databases.png","card_image-width":114,"card_image-height":144,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/glue_databases.png","wide_image-width":114,"wide_image-height":144}},"image_position":"left-center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>Let\u2019s set the active database for your Spark session to <code>crawl_s3<\/code> (the one you created in the previous step by following the video demo) and discover what tables are within it.<\/p>\n<pre><code>\r\nspark.catalog.setCurrentDatabase(\"crawl_s3\")\r\nspark.sql(\"show tables\").<span style=\"color: #d73a49\">show<\/span>()\r\n<\/code><\/pre>\n"},{"acf_fc_layout":"image","image":{"ID":2941475,"id":2941475,"title":"show_tables","filename":"show_tables.png","filesize":4479,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/show_tables.png","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/geoanalytics-engine\/analytics\/streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine\/show_tables","alt":"","author":"323502","description":"","caption":"","name":"show_tables","status":"inherit","uploaded_to":2941215,"date":"2025-09-29 19:49:13","modified":"2025-09-29 19:49:13","menu_order":0,"mime_type":"image\/png","type":"image","subtype":"png","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":297,"height":162,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/show_tables-213x162.png","thumbnail-width":213,"thumbnail-height":162,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/show_tables.png","medium-width":297,"medium-height":162,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/show_tables.png","medium_large-width":297,"medium_large-height":162,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/show_tables.png","large-width":297,"large-height":162,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/show_tables.png","1536x1536-width":297,"1536x1536-height":162,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/show_tables.png","2048x2048-width":297,"2048x2048-height":162,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/show_tables.png","card_image-width":297,"card_image-height":162,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/show_tables.png","wide_image-width":297,"wide_image-height":162}},"image_position":"left-center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>Suppose you&#8217;re interested in analyzing county-level Covid-19 case data stored in a table named <code>covid<\/code>. But before diving into the analysis, let&#8217;s take a short detour: what if you weren&#8217;t sure which tables under <code>crawl_s3<\/code> actually contain geospatial data? Theoretically, there could be hundreds of tables under <code>crawl_s3<\/code> database. While the Glue Catalog currently does not natively distinguish between spatial and non-spatial tables, you can add custom metadata to help with discovery. For example, you can tag spatial tables using table properties and here we&#8217;re doing this for our <code>covid<\/code> table (since it has some location information):<\/p>\n<pre><code>\r\nspark.sql(\"\"\"\r\n  ALTER TABLE covid\r\n  SET TBLPROPERTIES (\r\n    'has_spatial' = 'true'\r\n  )\r\n\"\"\")\r\n<\/code><\/pre>\n<p>This simple metadata tagging makes it easier to identify and manage spatial datasets at scale. Like this:<\/p>\n<pre><code>\r\nhas_spatial <span style=\"color: #005cc5\">=<\/span> dict(spark.sql(\"SHOW TBLPROPERTIES `covid`\").<span style=\"color: #d73a49\">collect<\/span>()).<span style=\"color: #d73a49\">get<\/span>(\"has_spatial\")\r\nprint(f\"has_spatial: {has_spatial}\")\r\n<\/code><\/pre>\n<pre><code>has_spatial: true<\/code><\/pre>\n<p>Now, let\u2019s read and analyze data records from the <code>covid<\/code>\u00a0table (via the Glue Catalog, rather than directly from S3) as a Spark DataFrame:<\/p>\n<pre><code>\r\ndf <span style=\"color: #005cc5\">=<\/span> spark.sql(\"SELECT * FROM covid\")\r\ndf.show()\r\n<\/code><\/pre>\n"},{"acf_fc_layout":"image","image":{"ID":2941510,"id":2941510,"title":"covid_df","filename":"covid_df-1.png","filesize":7614,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/covid_df-1.png","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/geoanalytics-engine\/analytics\/streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine\/covid_df-2","alt":"","author":"323502","description":"","caption":"","name":"covid_df-2","status":"inherit","uploaded_to":2941215,"date":"2025-09-29 21:12:32","modified":"2025-09-29 21:12:32","menu_order":0,"mime_type":"image\/png","type":"image","subtype":"png","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":473,"height":140,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/covid_df-1-213x140.png","thumbnail-width":213,"thumbnail-height":140,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/covid_df-1.png","medium-width":464,"medium-height":137,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/covid_df-1.png","medium_large-width":473,"medium_large-height":140,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/covid_df-1.png","large-width":473,"large-height":140,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/covid_df-1.png","1536x1536-width":473,"1536x1536-height":140,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/covid_df-1.png","2048x2048-width":473,"2048x2048-height":140,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/covid_df-1.png","card_image-width":473,"card_image-height":140,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/covid_df-1.png","wide_image-width":473,"wide_image-height":140}},"image_position":"left-center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>While reading from a Glue Catalog table is not inherently faster than a direct S3 read, it provides important advantages like automatic schema handling and partition awareness, which can improve usability and performance in real-world workflows. What that means is that the Data Catalog automatically captures and manages the schema of your data sources \u2013 including <a href=\"https:\/\/docs.aws.amazon.com\/glue\/latest\/dg\/catalog-and-crawler.html#:~:text=Data%20Catalog%20.-,Schema%20management,-The%20Data%20Catalog\">schema inference, evolution, and versioning<\/a> \u2013 and you can update your schema and <a href=\"https:\/\/docs.aws.amazon.com\/glue\/latest\/dg\/zero-etl-data-partitioning.html#partitioning-overview\">partitions<\/a> in the Data Catalog using AWS Glue ETL jobs.<\/p>\n<p>While the <code>covid<\/code>\u00a0table includes county-level identifiers (like FIPS codes), it does not contain actual geometry data required for spatial analysis. To enable mapping and spatial operations, first we need to enrich this dataset with geospatial features. For that, using ArcGIS GeoAnalytics Engine, we load a USA Census County <a href=\"https:\/\/developers.arcgis.com\/geoanalytics\/data\/data-sources\/feature-service\/\">feature service<\/a> as a Spark DataFrame and then using Spark SQL we perform the following steps:<\/p>\n<ol>\n<li>Perform an inner join between Covid-19 and County tables on the FIPS code.<\/li>\n<li>Extract year and month separately from the date column using YEAR() and MONTH() functions, enabling temporal aggregation at the monthly level.<\/li>\n<li>Aggregate Covid-19 data using SUM(cases) and SUM(deaths) by county, state, month, and year to get the total number of reported cases and deaths, respectively, for each county per month.<\/li>\n<li>Convert the geometry column <code>shape<\/code> to <a href=\"https:\/\/en.wikipedia.org\/wiki\/Well-known_text_representation_of_geometry\">Well-Known Binary (WKB)<\/a> format using the <a href=\"https:\/\/developers.arcgis.com\/geoanalytics\/sql-functions\/st_as_binary\/\">ST_AsBinary<\/a> function from the GeoAnalytics Engine. This step is necessary because AWS Glue does not natively support geometry or geography spatial types. Hence, we use standard text or binary formats to represent our spatial data. Converting to <code>WKB<\/code> allows the aggregated data to be cataloged in Glue and later interpreted correctly by the GeoAnalytics Engine for geospatial queries and analysis.<\/li>\n<\/ol>\n<pre><code>\r\nURL <span style=\"color: #005cc5\">=<\/span> \"https:\/\/services.arcgis.com\/P3ePLMYs2RVChkJx\/arcgis\/rest\/services\/USA_Census_Counties\/FeatureServer\/0\"\r\n\r\nspark.read.format(\"feature-service\").load(URL).<span style=\"color: #d73a49\">select<\/span>(\"FIPS\", \"NAME\", \"shape\").createOrReplaceTempView(\"county\")\r\n\r\ndf <span style=\"color: #005cc5\">=<\/span> spark.sql(\"\"\"\r\nSELECT \r\n    county,\r\n    state,\r\n    YEAR(date) AS year,\r\n    MONTH(date) AS month,\r\n    SUM(cases) AS total_cases,\r\n    SUM(deaths) AS total_deaths,\r\n    ST_asBinary(shape) AS WKB\r\nFROM covid \r\nINNER JOIN county \r\n    ON covid.fips = county.FIPS\r\nGROUP BY \r\n    county, \r\n    state, \r\n    YEAR(date),\r\n    MONTH(date),\r\n    shape\r\n\"\"\")\r\n\r\ndf.show()\r\n\r\n<\/code><\/pre>\n"},{"acf_fc_layout":"image","image":{"ID":2941477,"id":2941477,"title":"output_table_binary","filename":"output_table_binary.png","filesize":12227,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/output_table_binary.png","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/geoanalytics-engine\/analytics\/streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine\/output_table_binary","alt":"","author":"323502","description":"","caption":"","name":"output_table_binary","status":"inherit","uploaded_to":2941215,"date":"2025-09-29 19:55:18","modified":"2025-09-29 19:55:18","menu_order":0,"mime_type":"image\/png","type":"image","subtype":"png","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":572,"height":175,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/output_table_binary-213x175.png","thumbnail-width":213,"thumbnail-height":175,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/output_table_binary.png","medium-width":464,"medium-height":142,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/output_table_binary.png","medium_large-width":572,"medium_large-height":175,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/output_table_binary.png","large-width":572,"large-height":175,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/output_table_binary.png","1536x1536-width":572,"1536x1536-height":175,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/output_table_binary.png","2048x2048-width":572,"2048x2048-height":175,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/output_table_binary.png","card_image-width":572,"card_image-height":175,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/output_table_binary.png","wide_image-width":572,"wide_image-height":175}},"image_position":"left-center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>At this point, you can optionally <a href=\"https:\/\/developers.arcgis.com\/geoanalytics\/tutorials\/data\/write-to-feature-services\/\">export the resulting DataFrame to ArcGIS Online as a hosted feature service layer<\/a> and then create a time slider map to analyze the Covid-19 outbreak patterns across U.S. counties.<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2941478,"id":2941478,"title":"covid-19 timeslider","filename":"covid-19-timeslider.gif","filesize":5898189,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/covid-19-timeslider.gif","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/geoanalytics-engine\/analytics\/streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine\/covid-19-timeslider","alt":"","author":"323502","description":"","caption":"Fig. County-level aggregated Covid-19 case counts in the United States (2020-2022 July).","name":"covid-19-timeslider","status":"inherit","uploaded_to":2941215,"date":"2025-09-29 19:56:14","modified":"2025-09-29 19:56:50","menu_order":0,"mime_type":"image\/gif","type":"image","subtype":"gif","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":1918,"height":948,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/covid-19-timeslider-213x200.gif","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/covid-19-timeslider.gif","medium-width":464,"medium-height":229,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/covid-19-timeslider.gif","medium_large-width":768,"medium_large-height":380,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/covid-19-timeslider.gif","large-width":1918,"large-height":948,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/covid-19-timeslider-1536x759.gif","1536x1536-width":1536,"1536x1536-height":759,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/covid-19-timeslider.gif","2048x2048-width":1918,"2048x2048-height":948,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/covid-19-timeslider-826x408.gif","card_image-width":826,"card_image-height":408,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/covid-19-timeslider.gif","wide_image-width":1918,"wide_image-height":948}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>Finally, let\u2019s write the DataFrame as <code>parquet<\/code> format to a designated S3 location and register it as a table named <code>Covid_ByCounty<\/code>. This approach enables the dataset to be cataloged as a standard binary column while retaining spatial information. By registering the output in the AWS Glue Data Catalog, the geospatial data becomes immediately accessible for future querying and analysis using tools like GeoAnalytics Engine or <a href=\"https:\/\/docs.aws.amazon.com\/athena\/latest\/ug\/querying-geospatial-data.html\">Amazon Athena<\/a>\u2014without\u00a0 needing to reload from the original S3 source.<\/p>\n<pre><code>\r\noutput_path <span style=\"color: #005cc5\">=<\/span> \"s3:\/\/ga-engine\/data_for_glue_cataloging\/Outputs\/\"\r\n\r\ndf.write.format(\"parquet\") \\\r\n.mode(\"overwrite\") \\\r\n.option(\"path\", output_path) \\\r\n.saveAsTable(\"Covid_ByCounty\")\r\n<\/code><\/pre>\n<p>You can verify the registration by listing your Glue database tables:<\/p>\n<pre><code>\r\nspark.sql(\"show tables\").<span style=\"color: #d73a49\">show<\/span>()\r\n<\/code><\/pre>\n"},{"acf_fc_layout":"image","image":{"ID":2941479,"id":2941479,"title":"show_out_tables","filename":"show_out_tables.png","filesize":6102,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/show_out_tables.png","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/geoanalytics-engine\/analytics\/streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine\/show_out_tables","alt":"","author":"323502","description":"","caption":"","name":"show_out_tables","status":"inherit","uploaded_to":2941215,"date":"2025-09-29 19:58:53","modified":"2025-09-29 19:58:53","menu_order":0,"mime_type":"image\/png","type":"image","subtype":"png","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":300,"height":181,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/show_out_tables-213x181.png","thumbnail-width":213,"thumbnail-height":181,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/show_out_tables.png","medium-width":300,"medium-height":181,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/show_out_tables.png","medium_large-width":300,"medium_large-height":181,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/show_out_tables.png","large-width":300,"large-height":181,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/show_out_tables.png","1536x1536-width":300,"1536x1536-height":181,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/show_out_tables.png","2048x2048-width":300,"2048x2048-height":181,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/show_out_tables.png","card_image-width":300,"card_image-height":181,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/09\/show_out_tables.png","wide_image-width":300,"wide_image-height":181}},"image_position":"left-center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>By writing \u201cgeo-informed\u201d DataFrames to S3 and registering them in the AWS Glue Data Catalog, spatial data teams gain centralized schema management and seamless interoperability across AWS analytics services.\u00a0This approach enhances discoverability, reusability, and governance, making your geospatial and non-geospatial data workflows more scalable and collaborative.<\/p>\n"}],"related_articles":"","show_article_image":false,"card_image":false,"wide_image":false},"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Streamlining Spatial Data Discovery and Analytics on Amazon EMR with AWS Glue Data Catalog and ArcGIS GeoAnalytics Engine<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.esri.com\/arcgis-blog\/products\/geoanalytics-engine\/analytics\/streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Streamlining Spatial Data Discovery and Analytics on Amazon EMR with AWS Glue Data Catalog and ArcGIS GeoAnalytics Engine\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.esri.com\/arcgis-blog\/products\/geoanalytics-engine\/analytics\/streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine\" \/>\n<meta property=\"og:site_name\" content=\"ArcGIS Blog\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/esrigis\/\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:site\" content=\"@ESRI\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"8 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":[\"Article\",\"BlogPosting\"],\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/products\/geoanalytics-engine\/analytics\/streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/products\/geoanalytics-engine\/analytics\/streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine\"},\"author\":{\"name\":\"Arif Masrur\",\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/#\/schema\/person\/2c8aee17d2bb73b7dcd3e3817a88b72a\"},\"headline\":\"Streamlining Spatial Data Discovery and Analytics on Amazon EMR with AWS Glue Data Catalog and ArcGIS GeoAnalytics Engine\",\"datePublished\":\"2025-10-01T20:06:50+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/products\/geoanalytics-engine\/analytics\/streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine\"},\"wordCount\":18,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/#organization\"},\"articleSection\":[\"Analytics\",\"Data Management\",\"Sharing and Collaboration\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/www.esri.com\/arcgis-blog\/products\/geoanalytics-engine\/analytics\/streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/products\/geoanalytics-engine\/analytics\/streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine\",\"url\":\"https:\/\/www.esri.com\/arcgis-blog\/products\/geoanalytics-engine\/analytics\/streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine\",\"name\":\"Streamlining Spatial Data Discovery and Analytics on Amazon EMR with AWS Glue Data Catalog and ArcGIS GeoAnalytics Engine\",\"isPartOf\":{\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/#website\"},\"datePublished\":\"2025-10-01T20:06:50+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/products\/geoanalytics-engine\/analytics\/streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.esri.com\/arcgis-blog\/products\/geoanalytics-engine\/analytics\/streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/products\/geoanalytics-engine\/analytics\/streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.esri.com\/arcgis-blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Streamlining Spatial Data Discovery and Analytics on Amazon EMR with AWS Glue Data Catalog and ArcGIS GeoAnalytics Engine\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/#website\",\"url\":\"https:\/\/www.esri.com\/arcgis-blog\/\",\"name\":\"ArcGIS Blog\",\"description\":\"Get insider info from Esri product teams\",\"publisher\":{\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.esri.com\/arcgis-blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/#organization\",\"name\":\"Esri\",\"url\":\"https:\/\/www.esri.com\/arcgis-blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2018\/04\/Esri.png\",\"contentUrl\":\"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2018\/04\/Esri.png\",\"width\":400,\"height\":400,\"caption\":\"Esri\"},\"image\":{\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/esrigis\/\",\"https:\/\/x.com\/ESRI\",\"https:\/\/www.linkedin.com\/company\/5311\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/#\/schema\/person\/2c8aee17d2bb73b7dcd3e3817a88b72a\",\"name\":\"Arif Masrur\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/10\/g2513150-dev-2022-portraits-1042-scaled-e1759350705582-213x200.jpg\",\"contentUrl\":\"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/10\/g2513150-dev-2022-portraits-1042-scaled-e1759350705582-213x200.jpg\",\"caption\":\"Arif Masrur\"},\"description\":\"Arif Masrur is a Sr. Solutions Engineer at Esri with expertise in spatial data science and GeoAI. He has a PhD in Geography (GIScience and Data Analytics) from Penn State, and loves transforming space-time data into actionable intelligence.\",\"sameAs\":[\"https:\/\/x.com\/Arif_Masrur\"],\"url\":\"https:\/\/www.esri.com\/arcgis-blog\/author\/amasrur\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Streamlining Spatial Data Discovery and Analytics on Amazon EMR with AWS Glue Data Catalog and ArcGIS GeoAnalytics Engine","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.esri.com\/arcgis-blog\/products\/geoanalytics-engine\/analytics\/streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine","og_locale":"en_US","og_type":"article","og_title":"Streamlining Spatial Data Discovery and Analytics on Amazon EMR with AWS Glue Data Catalog and ArcGIS GeoAnalytics Engine","og_url":"https:\/\/www.esri.com\/arcgis-blog\/products\/geoanalytics-engine\/analytics\/streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine","og_site_name":"ArcGIS Blog","article_publisher":"https:\/\/www.facebook.com\/esrigis\/","twitter_card":"summary_large_image","twitter_site":"@ESRI","twitter_misc":{"Est. reading time":"8 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":["Article","BlogPosting"],"@id":"https:\/\/www.esri.com\/arcgis-blog\/products\/geoanalytics-engine\/analytics\/streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine#article","isPartOf":{"@id":"https:\/\/www.esri.com\/arcgis-blog\/products\/geoanalytics-engine\/analytics\/streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine"},"author":{"name":"Arif Masrur","@id":"https:\/\/www.esri.com\/arcgis-blog\/#\/schema\/person\/2c8aee17d2bb73b7dcd3e3817a88b72a"},"headline":"Streamlining Spatial Data Discovery and Analytics on Amazon EMR with AWS Glue Data Catalog and ArcGIS GeoAnalytics Engine","datePublished":"2025-10-01T20:06:50+00:00","mainEntityOfPage":{"@id":"https:\/\/www.esri.com\/arcgis-blog\/products\/geoanalytics-engine\/analytics\/streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine"},"wordCount":18,"commentCount":0,"publisher":{"@id":"https:\/\/www.esri.com\/arcgis-blog\/#organization"},"articleSection":["Analytics","Data Management","Sharing and Collaboration"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.esri.com\/arcgis-blog\/products\/geoanalytics-engine\/analytics\/streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.esri.com\/arcgis-blog\/products\/geoanalytics-engine\/analytics\/streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine","url":"https:\/\/www.esri.com\/arcgis-blog\/products\/geoanalytics-engine\/analytics\/streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine","name":"Streamlining Spatial Data Discovery and Analytics on Amazon EMR with AWS Glue Data Catalog and ArcGIS GeoAnalytics Engine","isPartOf":{"@id":"https:\/\/www.esri.com\/arcgis-blog\/#website"},"datePublished":"2025-10-01T20:06:50+00:00","breadcrumb":{"@id":"https:\/\/www.esri.com\/arcgis-blog\/products\/geoanalytics-engine\/analytics\/streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.esri.com\/arcgis-blog\/products\/geoanalytics-engine\/analytics\/streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.esri.com\/arcgis-blog\/products\/geoanalytics-engine\/analytics\/streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.esri.com\/arcgis-blog\/"},{"@type":"ListItem","position":2,"name":"Streamlining Spatial Data Discovery and Analytics on Amazon EMR with AWS Glue Data Catalog and ArcGIS GeoAnalytics Engine"}]},{"@type":"WebSite","@id":"https:\/\/www.esri.com\/arcgis-blog\/#website","url":"https:\/\/www.esri.com\/arcgis-blog\/","name":"ArcGIS Blog","description":"Get insider info from Esri product teams","publisher":{"@id":"https:\/\/www.esri.com\/arcgis-blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.esri.com\/arcgis-blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.esri.com\/arcgis-blog\/#organization","name":"Esri","url":"https:\/\/www.esri.com\/arcgis-blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.esri.com\/arcgis-blog\/#\/schema\/logo\/image\/","url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2018\/04\/Esri.png","contentUrl":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2018\/04\/Esri.png","width":400,"height":400,"caption":"Esri"},"image":{"@id":"https:\/\/www.esri.com\/arcgis-blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/esrigis\/","https:\/\/x.com\/ESRI","https:\/\/www.linkedin.com\/company\/5311\/"]},{"@type":"Person","@id":"https:\/\/www.esri.com\/arcgis-blog\/#\/schema\/person\/2c8aee17d2bb73b7dcd3e3817a88b72a","name":"Arif Masrur","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.esri.com\/arcgis-blog\/#\/schema\/person\/image\/","url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/10\/g2513150-dev-2022-portraits-1042-scaled-e1759350705582-213x200.jpg","contentUrl":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/10\/g2513150-dev-2022-portraits-1042-scaled-e1759350705582-213x200.jpg","caption":"Arif Masrur"},"description":"Arif Masrur is a Sr. Solutions Engineer at Esri with expertise in spatial data science and GeoAI. He has a PhD in Geography (GIScience and Data Analytics) from Penn State, and loves transforming space-time data into actionable intelligence.","sameAs":["https:\/\/x.com\/Arif_Masrur"],"url":"https:\/\/www.esri.com\/arcgis-blog\/author\/amasrur"}]}},"text_date":"October 1, 2025","author_name":"Multiple Authors","author_page":"https:\/\/www.esri.com\/arcgis-blog\/products\/geoanalytics-engine\/analytics\/streamlining-spatial-data-discovery-and-analytics-on-amazon-emr-with-aws-glue-data-catalog-and-arcgis-geoanalytics-engine","custom_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/08\/Newsroom-Keyart-Wide-1920-x-1080.jpg","primary_product":"ArcGIS GeoAnalytics Engine","tag_data":[],"category_data":[{"term_id":23341,"name":"Analytics","slug":"analytics","term_group":0,"term_taxonomy_id":23341,"taxonomy":"category","description":"","parent":0,"count":1329,"filter":"raw"},{"term_id":23851,"name":"Data Management","slug":"data-management","term_group":0,"term_taxonomy_id":23851,"taxonomy":"category","description":"","parent":0,"count":920,"filter":"raw"},{"term_id":37511,"name":"Sharing and Collaboration","slug":"sharing-collaboration","term_group":0,"term_taxonomy_id":37511,"taxonomy":"category","description":"","parent":0,"count":424,"filter":"raw"}],"product_data":[{"term_id":765842,"name":"ArcGIS GeoAnalytics Engine","slug":"geoanalytics-engine","term_group":0,"term_taxonomy_id":765842,"taxonomy":"product","description":"","parent":36601,"count":23,"filter":"raw"},{"term_id":36551,"name":"ArcGIS Online","slug":"arcgis-online","term_group":0,"term_taxonomy_id":36551,"taxonomy":"product","description":"","parent":0,"count":2427,"filter":"raw"}],"primary_product_link":"https:\/\/www.esri.com\/arcgis-blog\/?s=#&products=geoanalytics-engine","_links":{"self":[{"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/blog\/2941215","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/blog"}],"about":[{"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/types\/blog"}],"author":[{"embeddable":true,"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/users\/323502"}],"replies":[{"embeddable":true,"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/comments?post=2941215"}],"version-history":[{"count":0,"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/blog\/2941215\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/media?parent=2941215"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/categories?post=2941215"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/tags?post=2941215"},{"taxonomy":"industry","embeddable":true,"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/industry?post=2941215"},{"taxonomy":"product","embeddable":true,"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/product?post=2941215"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}