ArcGIS continues to grow as a geospatial data science platform, incorporating specialized geospatial data science tools with open-source components. With this ever-increasing network and the ever-increasing volume of data we need some way to efficiently connect between these various components. Apache Arrow may help.
Introduction to Apache Arrow
Apache Arrow is a burgeoning, ambitious, open-source project by Wes McKinley and partners. For some time now it has been slowly finding its way into various popular data and analytics platforms. In short, Apache Arrow is an in-memory, columnar, cross-platform, cross-language, and open-source data representation that allows you to efficiently transfer data between components. It is intended to sit low in the stack:
[Apache Arrow] is designed to both improve the performance of analytical algorithms and the efficiency of moving data from one system or programming language to another.
In other words, unlike user-facing Pandas and Spark data frames, Apache Arrow’s data representation is intended to sit behind the scenes at a lower level, efficiently running the logistics regardless of platform or language.
One of the most powerful promises of Arrow is to serve as a sort of Esperanto (or common language) for data transport—a super-efficient, often zero-copy vehicle that can thread the interfaces between various platforms, including ArcGIS Pro.
In this blog, you’ll learn how to leverage Apache Arrow to improve your workflows across components like Pandas (including Spatially Enabled Data Frames and Geopandas), Spark, Parquet, and ArcPy.
Leverage Apache Arrow in ArcPy
We added support for reading and writing Arrow Tables to ArcPy at ArcGIS Pro 2.9. With the release of ArcGIS Pro 3.2, we improved upon this feature by adding support for additional data types and geometry encodings. This allows you to connect your ArcGIS Pro workflows with other data and analytics platforms by transporting your geospatial data using Arrow Tables. As Apache Arrow grows in popularity and adoption, support for it will expand on other platforms. So, if you’re searching for an efficient path for bringing your geospatial data from other projects into ArcGIS (or vice versa), leveraging ArcPy’s integration with Arrow may, in some cases, offer the best solution.
Apache Arrow in Python
Apache Arrow’s interface for Python is provided by the PyArrow library.
Arrow Tables are a tabular data representation composed of columns, in which each column has a field name, data type, and the data itself (as well as optional metadata, more on this later).
Write an Arrow Table from ArcPy
To convert a Featureclass to an Arrow Table, you can use the
The geometry column in the resulting Arrow Table will be encoded in the EsriShape binary format. This format is efficient and lossless, but it is also incompatible with most other analytics platforms. When you need the exported geometry data to be compatible with another platform, you can choose a different geometry encoding with the optional
geometry_encoding parameter, which supports the additional geometry encodings EsriJSON, GeoJSON, WKT, and WKB. These are publicly documented formats for representing geometries, which you can read about at the following sites:
- EsriJSON Specification
- GeoJSON specification
- Well-known Text (WKT) and Well-known Byte (WKB) Specification
Most geospatial analytics and data platforms will support reading or writing at least one of these formats.
Read an Arrow Table to ArcPy
Reading an Arrow Table into ArcGIS Pro is done by passing the Arrow Table into
arcpy.management.CopyRows (for Tables) or
arcpy.management.CopyFeatures (for Featureclasses). In fact, geoprocessing tools accept Arrow Tables as a data source, so you can directly use an Arrow Table as input to a geoprocessing tool.
ArcPy can read Arrow Tables with all the same geometry encodings it can write (EsriShape, EsriJSON, GeoJSON, WKT, and WKB). However, when the Arrow Table did not originate from ArcPy, you may need to do some additional prep work to ensure the table can be successfully read by ArcPy. You’ll learn about that in the next section.
Interoperability with other analytics components
ArcPy uses metadata keys embedded with the Arrow Table columns to determine how to interpret the data. The metadata is stored as part of a table’s schema. When using Arrow Tables as a vehicle for moving data between different geospatial data and analytics platforms, it is important to understand the schema specification for Apache Arrow Tables that ArcPy supports. You can view an Arrow Table’s schema using its schema method.
<Arrow Table Object>.schema
For the Arrow Table from the previous example,
'nsowlnests_at', which contains the columns OID, Shape, NEST_ID, TREE_SPECIES, NEST_HEIGHT_M, LAST_ACTIVE_YEAR, and NOTES, the schema looks like this:
Note the metadata attached to the Shape field. The
esri.sr_wkt key defines the coordinate system of the geometry stored in this column using Well-known-text of Coordinate Reference Systems (WKT CRS). The
esri.encoding key specifies the geometry encoding of the data, in this case EsriShape. The data type of the Shape field is binary. Note that different geometry encodings may require different field data types. For example, if the Shape field held GeoJSON encoded geometry instead, it would need to be of data type string.
You can find additional information about the required schema and the mappings and metadata for the supported field data types in the Type conversions section of the Apache Arrow in ArcGIS documentation.
The schema profile for Apache Arrow Tables supported by ArcPy is not the only Arrow Table schema profile for geospatial data, there is also the GeoArrow specification. ArcPy also supports reading Arrow Tables with a GeoArrow schema. However, ArcPy will not create Arrow Tables with the GeoArrow schema.
Some platforms may not preserve an Arrow Table’s original schema or produce an Arrow Table with a schema ArcPy understands. In cases where they don’t, you will need to reconstruct the schema either from scratch or using the original schema.
While Apache Arrow is an efficient but temporary in-memory data structure for fast operations, Apache Parquet is an on-disk data structure for space efficient long-term storage. In short, Apache Arrow is for processing and moving of data, and Apache Parquet is for storage. The two formats are optimized for compatibility. This compatibility means that the schema will be preserved when writing an Arrow Table to a parquet file for long-term storage.
Here’s how you can move geospatial data between Parquet files and ArcGIS Pro using Apache Arrow:
The Pandas DataFrame is a table-like in-memory data structure with an interface for data analysis. The Pandas team plans to completely back Pandas with Apache Arrow (instead of NumPy) when Pandas 3.0 is released. With the recently released Pandas 2.0, backing a DataFrame with Apache Arrow is optional. ArcGIS Pro 3.2 ships with Pandas version 2.0.2, so you can try this out yourself.
In the following example, we will use Arrow to move geospatial data between a Pandas DataFrame and ArcGIS Pro, and leverage the new Arrow backed data types in Pandas:
In testing, the
from_pandas operation sees a significant performance boost of roughly 40 percent from the Pandas DataFrame being backed with Arrow data types rather than NumPy, but your mileage may vary. Pandas 3.0 is expected to standardize this once it is released.
While moving data between ArcGIS and Pandas can be useful, Pandas has no inherent geospatial data processing and analysis capabilities. For this, you will need to look to the ArcGIS API Spatially Enabled DataFrame in the next section.
Spatially Enabled DataFrames
The ArcGIS API for Python’s Spatially Enabled DataFrame (SEDF) is built on top of Pandas. Essentially, it extends the Pandas DataFrame with geospatial capabilities, with interoperability between SEDF and ArcPy. An SEDF can be created from a Featureclass using the ArcGIS API, and ArcPy can directly read the SEDF format as input to geoprocessing tools, so you don’t necessarily need to use Arrow. However, you can use Arrow in this transaction as well. By converting the SEDF to an Arrow Table first, and then using the Arrow Table with ArcPy instead of the SEDF, testing resulted in roughly a 14 percent boost in performance (again, your mileage may vary).
The below code shows how you can leverage Arrow to move geospatial data between the ArcGIS API SEDF and ArcGIS Pro to gain a slight performance boost:
Note that the Arrow Table that results from the
spatial.to_arrow method adheres to the GeoArrow specification instead of Esri’s schema profile for Apache Arrow Tables.
The Geopandas GeoDataframe is also built on top of Pandas and, like SEDF, extends the Pandas DataFrame with geospatial capabilities. You can convert a Featureclass to a GeoDataFrame using
geopandas.read_file. However, converting a GeoDataFrame to a Featureclass is not directly supported. You can go one of two routes here: either convert the GeoDataFrame to an SEDF using
pd.DataFrame.spatial.from_geodataframe, or leverage Arrow.
In this example, we will use Arrow to move geospatial data between a Geopandas GeoDataFrame and ArcGIS Pro:
Because you must create the ArcPy compatible schema for the Arrow Table from scratch, this workflow is quite a bit more involved than simply converting the GeoDataFrame to an SEDF. Consider it an example of moving geospatial data by brute force. In this case, a better alternative exists by first converting to SEDF, but other third-party analytics components may not offer such integrations, so an approach like this may come in handy.
Apache Spark is a scalable distributed data processing and analytics engine. It can also be run locally, but the real benefit of using Spark comes from its ability to parallel-process large data distributed over clusters of computers.
The following example shows how you can leverage Arrow to move geospatial data between a Spark DataFrame and ArcGIS Pro.
This is one of many ways to prepare your data for use in Spark. Engines typically have user-friendly interfaces for common data transport operations. To perform geospatial analysis on your Spark DataFrame, you will need to look to geospatial analytics engines. For example, Esri’s Geoanalytics Engine includes over 100 functions and tools that operate on Spark DataFrames to manage, enrich, summarize, or analyze entire geospatial datasets. However, discussing these engines in detail is beyond the scope of this blog.
Integrating with the Apache Arrow ecosystem opens the door for you to transport geospatial data from other participating components into ArcGIS Pro, and vice versa. The Apache Arrow story is still developing, and ArcGIS integration will grow alongside it. As the Arrow platform becomes more integrated in other components, it will continue removing barriers to allow you to leverage data from various open-source geospatial data and analytics components with the ArcGIS Pro platform.