Nov 15, 2023

Leverage Apache Arrow in ArcGIS Pro

By Hannes Ziegler

ArcGIS continues to grow as a geospatial data science platform, incorporating specialized geospatial data science tools with open-source components. With this ever-increasing network and the ever-increasing volume of data we need some way to efficiently connect between these various components. Apache Arrow may help.

Introduction to Apache Arrow

Apache Arrow is a burgeoning, ambitious, open-source project by Wes McKinley and partners. For some time now it has been slowly finding its way into various popular data and analytics platforms. In short, Apache Arrow is an in-memory, columnar, cross-platform, cross-language, and open-source data representation that allows you to efficiently transfer data between components. It is intended to sit low in the stack:

[Apache Arrow] is designed to both improve the performance of analytical algorithms and the efficiency of moving data from one system or programming language to another.

The Apache Organization

In other words, unlike user-facing Pandas and Spark data frames, Apache Arrow’s data representation is intended to sit behind the scenes at a lower level, efficiently running the logistics regardless of platform or language.

One of the most powerful promises of Arrow is to serve as a sort of Esperanto (or common language) for data transport—a super-efficient, often zero-copy vehicle that can thread the interfaces between various platforms, including ArcGIS Pro.

In this blog, you’ll learn how to leverage Apache Arrow to improve your workflows across components like Pandas (including Spatially Enabled Data Frames and Geopandas), Spark, Parquet, and ArcPy.

Leverage Apache Arrow in ArcPy

We added support for reading and writing Arrow Tables to ArcPy at ArcGIS Pro 2.9. With the release of ArcGIS Pro 3.2, we improved upon this feature by adding support for additional data types and geometry encodings. This allows you to connect your ArcGIS Pro workflows with other data and analytics platforms by transporting your geospatial data using Arrow Tables. As Apache Arrow grows in popularity and adoption, support for it will expand on other platforms. So, if you’re searching for an efficient path for bringing your geospatial data from other projects into ArcGIS (or vice versa), leveraging ArcPy’s integration with Arrow may, in some cases, offer the best solution.

Apache Arrow in Python

Apache Arrow’s interface for Python is provided by the PyArrow library.

Arrow Tables

Arrow Tables are a tabular data representation composed of columns, in which each column has a field name, data type, and the data itself (as well as optional metadata, more on this later).

Write an Arrow Table from ArcPy

To convert a Featureclass to an Arrow Table, you can use the arcpy.da.TableToArrowTable function.

The geometry column in the resulting Arrow Table will be encoded in the EsriShape binary format. This format is efficient and lossless, but it is also incompatible with most other analytics platforms. When you need the exported geometry data to be compatible with another platform, you can choose a different geometry encoding with the optional geometry_encoding parameter, which supports the additional geometry encodings EsriJSON, GeoJSON, WKT, and WKB. These are publicly documented formats for representing geometries, which you can read about at the following sites:

Most geospatial analytics and data platforms will support reading or writing at least one of these formats.

Read an Arrow Table to ArcPy

Reading an Arrow Table into ArcGIS Pro is done by passing the Arrow Table into arcpy.management.CopyRows (for Tables) or arcpy.management.CopyFeatures (for Featureclasses). In fact, geoprocessing tools accept Arrow Tables as a data source, so you can directly use an Arrow Table as input to a geoprocessing tool.

ArcPy can read Arrow Tables with all the same geometry encodings it can write (EsriShape, EsriJSON, GeoJSON, WKT, and WKB). However, when the Arrow Table did not originate from ArcPy, you may need to do some additional prep work to ensure the table can be successfully read by ArcPy. You’ll learn about that in the next section.

Interoperability with other analytics components

ArcPy uses metadata keys embedded with the Arrow Table columns to determine how to interpret the data. The metadata is stored as part of a table’s schema. When using Arrow Tables as a vehicle for moving data between different geospatial data and analytics platforms, it is important to understand the schema specification for Apache Arrow Tables that ArcPy supports. You can view an Arrow Table’s schema using its schema method.

<Arrow Table Object>.schema

For the Arrow Table from the previous example, 'nsowlnests_at', which contains the columns OID, Shape, NEST_ID, TREE_SPECIES, NEST_HEIGHT_M, LAST_ACTIVE_YEAR, and NOTES, the schema looks like this:

Note the metadata attached to the Shape field. The esri.sr_wkt key defines the coordinate system of the geometry stored in this column using Well-known-text of Coordinate Reference Systems (WKT CRS). The esri.encoding key specifies the geometry encoding of the data, in this case EsriShape. The data type of the Shape field is binary. Note that different geometry encodings may require different field data types. For example, if the Shape field held GeoJSON encoded geometry instead, it would need to be of data type string.

You can find additional information about the required schema and the mappings and metadata for the supported field data types in the Type conversions section of the Apache Arrow in ArcGIS documentation.

The schema profile for Apache Arrow Tables supported by ArcPy is not the only Arrow Table schema profile for geospatial data, there is also the GeoArrow specification. ArcPy also supports reading Arrow Tables with a GeoArrow schema. However, ArcPy will not create Arrow Tables with the GeoArrow schema.

Some platforms may not preserve an Arrow Table’s original schema or produce an Arrow Table with a schema ArcPy understands. In cases where they don’t, you will need to reconstruct the schema either from scratch or using the original schema.

Parquet

While Apache Arrow is an efficient but temporary in-memory data structure for fast operations, Apache Parquet is an on-disk data structure for space efficient long-term storage. In short, Apache Arrow is for processing and moving of data, and Apache Parquet is for storage. The two formats are optimized for compatibility. This compatibility means that the schema will be preserved when writing an Arrow Table to a parquet file for long-term storage.

Here’s how you can move geospatial data between Parquet files and ArcGIS Pro using Apache Arrow:

Pandas DataFrames

The Pandas DataFrame is a table-like in-memory data structure with an interface for data analysis. The Pandas team plans to completely back Pandas with Apache Arrow (instead of NumPy) when Pandas 3.0 is released. With the recently released Pandas 2.0, backing a DataFrame with Apache Arrow is optional. ArcGIS Pro 3.2 ships with Pandas version 2.0.2, so you can try this out yourself.

In the following example, we will use Arrow to move geospatial data between a Pandas DataFrame and ArcGIS Pro, and leverage the new Arrow backed data types in Pandas:

import pandas as pd
import pyarrow as pa

# Using the `nsowlnests_at` created previously from ArcPy:

# Store the Arrow Table’s schema in `schema` for later, because
# it will not be preserved during the conversion to a Pandas DataFrame.
schema = nsowlnests_at.schema

# Define a data type mapping (to arrow data types) for Pandas to use.
dtype_mapping = {
    pa.int8(): pd.core.arrays.arrow.dtype.ArrowDtype(pa.int8()),
    pa.int16(): pd.core.arrays.arrow.dtype.ArrowDtype(pa.int16()),
    pa.int32(): pd.core.arrays.arrow.dtype.ArrowDtype(pa.int32()),
    pa.int64(): pd.core.arrays.arrow.dtype.ArrowDtype(pa.int64()),
    pa.uint8(): pd.core.arrays.arrow.dtype.ArrowDtype(pa.uint8()),
    pa.uint16(): pd.core.arrays.arrow.dtype.ArrowDtype(pa.uint16()),
    pa.uint32(): pd.core.arrays.arrow.dtype.ArrowDtype(pa.uint32()),
    pa.uint64(): pd.core.arrays.arrow.dtype.ArrowDtype(pa.uint64()),
    pa.float32(): pd.core.arrays.arrow.dtype.ArrowDtype(pa.float32()),
    pa.float64(): pd.core.arrays.arrow.dtype.ArrowDtype(pa.float64()),
    pa.float64(): pd.core.arrays.arrow.dtype.ArrowDtype(pa.float64()),
    pa.bool_(): pd.core.arrays.arrow.dtype.ArrowDtype(pa.bool_()),
    pa.binary():pd.core.arrays.arrow.dtype.ArrowDtype(pa.binary()),
    pa.string():pd.core.arrays.arrow.dtype.ArrowDtype(pa.string())
}

# Convert the Arrow Table to a Pandas DataFrame using `dtype_mapping`.
nsowlnests_pdf = nsowlnests_at.to_pandas(types_mapper=dtype_mapping.get)  
# _pdf for Pandas DataFrame

# After some processing performed on the Pandas DataFrame...

# Convert the Pandas DataFrame back to an Arrow Table,
# applying the schema stored earlier.
retrieved_at = pa.Table.from_pandas(nsowlnests_pdf, schema=schema)

# Now use `retrieved_at` in ArcPy.

In testing, the from_pandas operation sees a significant performance boost of roughly 40 percent from the Pandas DataFrame being backed with Arrow data types rather than NumPy, but your mileage may vary. Pandas 3.0 is expected to standardize this once it is released.

While moving data between ArcGIS and Pandas can be useful, Pandas has no inherent geospatial data processing and analysis capabilities. For this, you will need to look to the ArcGIS API Spatially Enabled DataFrame in the next section.

Spatially Enabled DataFrames

The ArcGIS API for Python’s Spatially Enabled DataFrame (SEDF) is built on top of Pandas. Essentially, it extends the Pandas DataFrame with geospatial capabilities, with interoperability between SEDF and ArcPy. An SEDF can be created from a Featureclass using the ArcGIS API, and ArcPy can directly read the SEDF format as input to geoprocessing tools, so you don’t necessarily need to use Arrow. However, you can use Arrow in this transaction as well. By converting the SEDF to an Arrow Table first, and then using the Arrow Table with ArcPy instead of the SEDF, testing resulted in roughly a 14 percent boost in performance (again, your mileage may vary).

The below code shows how you can leverage Arrow to move geospatial data between the ArcGIS API SEDF and ArcGIS Pro to gain a slight performance boost:

Note that the Arrow Table that results from the spatial.to_arrow method adheres to the GeoArrow specification instead of Esri’s schema profile for Apache Arrow Tables.

Geopandas

The Geopandas GeoDataframe is also built on top of Pandas and, like SEDF, extends the Pandas DataFrame with geospatial capabilities. You can convert a Featureclass to a GeoDataFrame using geopandas.read_file. However, converting a GeoDataFrame to a Featureclass is not directly supported. You can go one of two routes here: either convert the GeoDataFrame to an SEDF using pd.DataFrame.spatial.from_geodataframe, or leverage Arrow.

In this example, we will use Arrow to move geospatial data between a Geopandas GeoDataFrame and ArcGIS Pro:

import geopandas  # Must install into environment before import
import pyarrow as pa

# Read a featureclass to a Geopandas GeoDataFrame
ws = arcpy.env.workspace
nsowlnests_gdf = geopandas.read_file(ws, layer="northern_spotted_owl_nests")
# _gdf for GeoDataFrame

# After some geospatial processing performed on the Geopandas GeoDataFrame...

# The GeodataFrame geometry format is incompatible with ArcPy,
# convert it to WKB.
nsowlnests_gdf2 = gdf.to_wkb()

# Create (from scratch) the schema for the Arrow Table. 
# The schema must adhere to Esri’s schema profile 
# for Apache Arrow Tables.

# You can grab the spatial reference from the original layer,
# ("northern_spotted_owl_nests").
sr = arcpy.Describe("northern_spotted_owl_nests").spatialReference.exportToString()

# To help with determining the Arrow data types, use nsowlnests_gdf2.dtypes 
# to view the existing DataFrame data types.
# The table below shows the mapping chosen for this table:
# ColumnName    Pandas dtype  ->  Arrow dtype
# ------------------------------------------
# NEST_ID              int64  ->  int64
# TREE_SPECIES        object  ->  string
# NEST_HEIGHT_M        int64  ->  uint8
# LAST_ACTIVE_YEAR     int64  ->  uint16
# NOTES               object  ->  string
# geometry            object  ->  binary
# You will have to decide the appropriate Arrow data types to map 
# to the Pandas data types.
fields = [
    pa.field("NEST_ID", pa.int64()),
    pa.field("TREE_SPECIES", pa.string()),
    pa.field("NEST_HEIGHT_M", pa.uint8()),
    pa.field("LAST_ACTIVE_YEAR", pa.uint16()),
    pa.field("NOTES", pa.string()),
    pa.field(
        "geometry",
        pa.binary(),
        metadata={b'esri.encoding': "WKB", b'esri.sr_wkt': sr}
    )
]
schema = pa.schema(fields)

retrieved_at = pa.Table.from_pandas(nsowlnests_gdf2, schema=schema)

# Now use `retrieved_at ` in ArcPy.

Because you must create the ArcPy compatible schema for the Arrow Table from scratch, this workflow is quite a bit more involved than simply converting the GeoDataFrame to an SEDF. Consider it an example of moving geospatial data by brute force. In this case, a better alternative exists by first converting to SEDF, but other third-party analytics components may not offer such integrations, so an approach like this may come in handy.

Apache Spark

Apache Spark is a scalable distributed data processing and analytics engine. It can also be run locally, but the real benefit of using Spark comes from its ability to parallel-process large data distributed over clusters of computers.

The following example shows how you can leverage Arrow to move geospatial data between a Spark DataFrame and ArcGIS Pro.

# Running Spark requires setting up an environment with additional packages.
#
# Prior to starting, 
# Run the following commands from the Python Command Prompt:
#   conda create --clone arcgispro-py3 -n arcgispro-py3-spark --pinned
#   proswap arcgispro-py3-spark
#   conda install deep-learning-essentials
#   conda install openjdk
#
# Now you are ready to start a basic (local) Spark session.
from pyspark.sql import SparkSession  # Must have PySpark & Java, see above
import arcpy
import pandas as pd
import pyarrow as pa

# Start a SparkSession
spark = SparkSession \
    .builder \
    .appName("Moving Data With Arrow") \
    .config("spark.sql.execution.arrow.enabled", "true") \  # enable Arrow
    .getOrCreate()

# Convert a Featureclass to an Arrow Table with WKB encoded geometry
nsowlnests_at = arcpy.da.TableToArrowTable(
    "northern_spotted_owl_nests",
    geometry_encoding="WKB"
)

# Store the Arrow Table’s schema in `schema` for later, because
# it will not be preserved during the conversion to a Pandas DataFrame.
schema = nsowlnests_at.schema

# Define a data type mapping (to arrow data types) for Pandas to use.
dtype_mapping = {
    pa.int8(): pd.core.arrays.arrow.dtype.ArrowDtype(pa.int8()),
    pa.int16(): pd.core.arrays.arrow.dtype.ArrowDtype(pa.int16()),
    pa.int32(): pd.core.arrays.arrow.dtype.ArrowDtype(pa.int32()),
    pa.int64(): pd.core.arrays.arrow.dtype.ArrowDtype(pa.int64()),
    pa.uint8(): pd.core.arrays.arrow.dtype.ArrowDtype(pa.uint8()),
    pa.uint16(): pd.core.arrays.arrow.dtype.ArrowDtype(pa.uint16()),
    pa.uint32(): pd.core.arrays.arrow.dtype.ArrowDtype(pa.uint32()),
    pa.uint64(): pd.core.arrays.arrow.dtype.ArrowDtype(pa.uint64()),
    pa.float32(): pd.core.arrays.arrow.dtype.ArrowDtype(pa.float32()),
    pa.float64(): pd.core.arrays.arrow.dtype.ArrowDtype(pa.float64()),
    pa.float64(): pd.core.arrays.arrow.dtype.ArrowDtype(pa.float64()),
    pa.bool_(): pd.core.arrays.arrow.dtype.ArrowDtype(pa.bool_()),
    pa.binary():pd.core.arrays.arrow.dtype.ArrowDtype(pa.binary()),
    pa.string():pd.core.arrays.arrow.dtype.ArrowDtype(pa.string())
}

# Convert the Arrow Table to a Pandas DataFrame using `dtype_mapping`.
nsowlnests_pdf = nsowlnests_at.to_pandas(types_mapper=dtype_mapping.get)

# Convert the Pandas DataFrame to a Spark DataFrame
nsowlnests_sdf = spark.createDataFrame(nsowlnests_pdf)  # _sdf for Spark DataFrame

# After some processing performed on the Spark DataFrame...

# Convert the Spark DataFrame back to a Pandas DataFrame
nsowlnests_pdf2 = sdf.select("*").toPandas()

# Convert the Pandas DataFrame back to an Arrow Table
nsowlnests_at2 = pa.Table.from_pandas(nsowlnests_pdf2, schema=schema)

# Now use `retrieved_at2` in ArcPy.
arcpy.management.CopyFeatures(nsowlnests_at2, "TestPoint_Copy")

This is one of many ways to prepare your data for use in Spark. Engines typically have user-friendly interfaces for common data transport operations. To perform geospatial analysis on your Spark DataFrame, you will need to look to geospatial analytics engines. For example, Esri’s Geoanalytics Engine includes over 100 functions and tools that operate on Spark DataFrames to manage, enrich, summarize, or analyze entire geospatial datasets. However, discussing these engines in detail is beyond the scope of this blog.

Conclusion

Integrating with the Apache Arrow ecosystem opens the door for you to transport geospatial data from other participating components into ArcGIS Pro, and vice versa. The Apache Arrow story is still developing, and ArcGIS integration will grow alongside it. As the Arrow platform becomes more integrated in other components, it will continue removing barriers to allow you to leverage data from various open-source geospatial data and analytics components with the ArcGIS Pro platform.

Hannes Ziegler

Hannes is a product engineer on the Python team. He has five years of experience streamlining spatial data analysis workflows in the public and private sectors, and has been with Esri since 2019, where he focuses on the design, evaluation, and documentation of new and existing Python functionality.

Article Discussion:

0 Comments

Oldest

Newest

Inline Feedbacks

View all comments