Aug 29, 2011

Python Multiprocessing – Approaches and Considerations

By shitijmehta

The multiprocessing Python module provides functionality for distributing work between multiple processes, taking advantage of multiple CPU cores and larger amounts of available system memory. When analyzing or working with large amounts of data in ArcGIS, there are scenarios where multiprocessing can improve performance and scalability. However, there are many cases where multiprocessing can negatively affect performance, and even some instances where it should not be used.

There are two approaches to using multiprocessing for improving performance or scalability:

Processing many individual datasets
Processing datasets with many features

The goal of this article is to share simple coding patterns for effectively performing multiprocessing for geoprocessing. The article will cover relevant considerations and limitations, which are important when attempting to implement multiprocessing.

1. Processing large numbers of datasets

The first example performs a specific operation on a large number of datasets, in a workspace or set of workspaces. In cases where there are large numbers of datasets, taking advantage of multiprocessing can help get the job done faster. The following code demonstrates a multiprocessing module used to define a projection, add a field, and calculate the field for a large list of shapefiles. This Python code will create a pool of processes equal to the number of CPUs or CPU cores available. This pool of processes will then be used to processes the feature classes.

import os
import re
import multiprocessing
import arcpy

def update_shapefiles(shapefile):

# Define the projection to wgs84 — factory code is 4326.

arcpy.management.DefineProjection(shapefile, 4326)

# Add a field named CITY of type TEXT.

arcpy.management.AddField(shapefile, ‘CITY’, ‘TEXT’)

# Calculate field ‘CITY’ stripping ‘_base’ from the shapefile name.

city_name = shapefile.split(‘_base’)[0]
city_name = re.sub(‘_’, ‘ ‘, city_name)
arcpy.management.CalculateField(shapefile, ‘CITY’, ‘”{0}”‘.format(city_name.upper()), ‘PYTHON’)

# End update_shapefiles

def main():

# Create a pool class and run the jobs–the number of jobs is equal to the number of shapefiles

workspace = r’C:GISDataUSAusa’
arcpy.env.workspace = workspace

fcs = arcpy.ListFeatureClasses(‘*’)

fc_list = [os.path.join(workspace, fc) for fc in fcs]

pool = multiprocessing.Pool()

pool.map(update_shapefiles, fc_list)

# Synchronize the main process with the job processes to ensure proper cleanup.

pool.close()

pool.join()

# End main

if __name__ == ‘__main__’:
main()

2. Processing a individual dataset with a lot of features and records

This second example looks at geoprocessing tools analyzing an individual dataset with a lot of features and records. In this situation, we can benefit from multiprocessing by splitting data into groups to be processed simultaneously. For example, finding identical features may be faster when you split a large feature class into groups, based on spatial extents. The following code uses a pre-defined fishnet of polygons covering the extent of 1 million points (Figure 1).

Figure 1: A fishnet of polygons covering the extent of one million points.

import multiprocessing
import arcpy

def find_identical(oid):

# Create a feature layer for the tile in the fishnet.

tile = arcpy.management.MakeFeatureLayer(r’c:testingtesting.gdbfishnet’, ‘layer{0}’.format(oid[0]),

“”OID = {0}”””.format((oid[0])))

# Get the extent of the feature layer and set the extent environment.

tile_row = arcpy.SearchCursor(tile)

geometry = tile_row.next().shape

arcpy.env.extent = geometry.extent

# Execute Find Identical

identical_table = arcpy.management.FindIdentical(r’c:testingtesting.gdbrandom1mil’, r’c:cursortestingidentical{0}.dbf’.format(oid[0]), ‘Shape’)
return identical_table.getOutput(0)

# End find_identical

def main():

# Create a list of OID’s used to chunk the inputs

fishnet_rows = arcpy.SearchCursor(r’c:testingtesting.gdbfishnet’, ”, ”, ‘OID’)

oids = [[row.getValue(‘OID’)] for row in fishnet_rows]

# Create a pool class and run the jobs–the number of jobs is equal to the length of the oids list

pool = multiprocessing.Pool()

result_tables = pool.map(find_identical, oids)

# Merge the all the temporary output tables — this is optional. Omitting this can increase performance.

arcpy.management.Merge(result_tables, r’C:cursortestingctesting.gdbfind_identical’)

# Synchronize the main process with the job processes to ensure proper cleanup.

pool.close()
pool.join()

# End main

if __name__ == ‘__main__’:

main()

There are tools that do not require data be split spatially. The Generate Near Table example below, shows the data processed in groups of 250000 features by selecting them based on object ID ranges.

import multiprocessing
import arcpy

def generate_near_table(oid_range):

i = oid_range[0]

j = oid_range[1]

lyr = arcpy.management.MakeFeatureLayer(r’c:testingtesting.gdbrandom1mil’, ‘layer{0}’.format(i),

“””OID >= {0} AND OID <= {1}”””.format(i, j))

gn_table = arcpy.analysis.GenerateNearTable(lyr, r’c:testingtesting.gdbrandom10000′,

r’c:testingoutnear{0}.dbf’.format(i))
return gn_table.getOutput(0)

# End generate_near_table function

def main():

oid_ranges = [[0, 250000], [250001, 500000], [500001, 750000], [750001, 1000001]]

arcpy.env.overwriteOutput = True

# Create a pool class and run the jobs

pool = multiprocessing.Pool()

result_tables = pool.map(generate_near_table, oid_ranges)

# Merge resulting tables is optional. Can add overhead if not required.

arcpy.management.Merge(result_tables, r’c:cursortestingctesting.gdbgenerate_near_table’)

# Synchronize the main process with the job processes to ensure proper cleanup.

pool.close()
pool.join()

# End main

if __name__ == ‘__main__’:

main()

Considerations

Here are some important considerations before deciding to use multiprocessing:

The scenario demonstrated in the first example, will not work with feature classes in a file geodatabase because each update must acquire a schema lock on the workspace. A schema lock effectively prevents any other process from simultaneously updating the FGDB. This example will work with shapefiles and ArcSDE geodatabase data.

For each process, there is a start-up cost loading the arcpy library (1-3 seconds). Depending on the complexity and size of the data, this can cause the multiprocessing script to take longer to run than a script without multiprocessing. In many cases, the final step in the multiprocessing workflow is to aggregate all results together, which is an additional cost.

Determining if multiprocessing is appropriate for your workflow can often be a trial and error process. This process can invalidate the gains made using multiprocessing in a one off operation; however, the trial and error process may be very valuable if the final workflow is to be run multiple times, or applied to similar workflows using large data. For example, if you are running the Find Identical tool on a weekly basis, and it is running for hours with your data, multiprocessing may be worth the effort.

Whenever possible, take advantage of the “in_memory” workspace for creating temporary data to improve performance. However, depending on the size of data being created in-memory, it may be necessary to write temporary data to disk. Temporary datasets cannot be created in a file geodatabase because of schema locking. Deleting the in-memory dataset when you are finished can prevent out of memory errors.

Summary

These are just a few examples showing how multiprocessing can be used to increase performance and scalability when doing geoprocessing. However, it is important to remember that multiprocessing does not always mean better performance.

The multiprocessing module was included in Python 2.6 and the examples above will work in ArcGIS 10.0. For more information about the multiprocessing module, refer the Python documentation.

Please provide any feedback and comments to this blog posting, and stay tuned for another posting coming soon about “Being successful processing large complex data with the geoprocessing overlay tools”.

This post contributed by Jason Pardy, a product engineer on the Analysis and Geoprocessing team

ArcGIS Blog

Python Multiprocessing – Approaches and Considerations

1. Processing large numbers of datasets

2. Processing a individual dataset with a lot of features and records

Considerations

Summary

Commenting is not enabled for this article.