Using Arcpy with multiprocessing – Part 3

The following post builds upon the script and methods developed in Part 1 and Part 2. ESRI also has a blog post describing the use of Python Multiprocessing with Arcpy which did not work for me in a complex situation involving Network Analyst, but may work in simple applications.

The Arcpy script refers to two GDBs which I have zipped into an archive; it can be downloaded from the right hand side of this page.

Now that the most calculation intensive part of the process has been made into a parallelisable module we can actually implement multiprocessing. Here I will describe basic use cases for two multiprocessing libraries; the inbuilt Python Multiprocessing, and Parallel Python . I include both as neither seems to work for all situations; Parallel Python has caused problems when ArcGIS extensions are checked out and I cannot get Multiprocessing to work when geocoding addresses. If you have odd problems with one library I would suggest trying the other.

Both libraries can use a Python list to which jobs are appended, and from which the result is extracted. Each job is processed in turn by the first available worker (CPU/core). So, if there are six jobs in the list on a system with two CPUs, the first two jobs will start near simultaneously, then as each CPU finishes its task and becomes free it will start working on the next job in the list. This is known as asynchronous processing, as the order and timing of completion are unimportant. Both methods can use all available CPUs on the system, or they can be limited to a predefined number (see below), or they may be limited by the method itself – in this example, as there are only 6 Polygon types, a maximum of only 6 CPUs would be used.

'''
Step 4 of 4: Example of using Multiprocessing/Parallel Python with Arcpy...

Can be run either:
 1. from the command line/a Python IDE (adjust paths to feature classes, as necessary)
 2. as a Script tool within ArcGIS (ensure 'Run Ptyhon script in Process' is NOT checked when importing)

 The Parallel Python library must be installed before it can be used.
'''

import arcpy
import multiprocessing
import time
try:
	import pp
	forceMP = False
except ImportError:
	forceMP = True

def performCalculation(points_fC, polygons_fC, searchDist, typeList, calcPlatform_input=None):
	'''performCalculation(pointsFeatureClass, polygonsFeatureClass, searchDistance, typeList, calcPlatform)

	All inputs are specific.

	calcPlatform is either 'mp', to use the inbuilt Multiprocessing library, or 'pp', to use Parallel Python (if it exists, otherwise defaults to mp)

	'''

	# ---------------------------------------------------------------------------
	## Pre-calculation checks
	# ---------------------------------------------------------------------------

	# Check calculation platform is valid, or resort to default...
	defaultCalcTpye = 'mp'
	calcTypeExplainDict = {'pp':'Parallel Python', 'mp':'Multiprocessing'}
	if forceMP: # unable to import Parallel Python library (it has to be installed seperately)
		arcpy.AddMessage("   WARNING: Cannot load Parallel Python library, forcing Multiprocessing library...")
		calcPlatform = 'mp'
	elif (calcPlatform_input not in ['mp', 'pp']) or (calcPlatform_input == None): # Invalid/no input, resort to default
		arcpy.AddMessage("   WARNING: Input calculation platform '%s' invalid; should be either 'mp' or 'pp'. Defaulting to %s..." % (calcPlatform_input, calcTypeExplainDict[defaultCalcTpye]))
		calcPlatform = defaultCalcTpye
	else:
		calcPlatform = calcPlatform_input

	# ---------------------------------------------------------------------------
	## Data extraction (parallel execution)
	# ---------------------------------------------------------------------------

	searchDist = int(searchDist) # convert search distance to integer...

	# check all datasets are OK; if not, provide some useful output and terminate
	valid_points = arcpy.Exists(points_fC)
	arcpy.AddMessage("Points Feature Class: "+points_fC)
	valid_polygons = arcpy.Exists(polygons_fC)
	arcpy.AddMessage("Polygons Feature Class: "+polygons_fC)
	dataCheck  = valid_points & valid_polygons

	if not dataCheck:
		if not valid_points:
			arcpy.AddError("Points database or feature class, %s,  is invalid..." % (points_fC))
		if not valid_polygons:
			arcpy.AddError("Polygons database or feature class, %s,  is invalid..." % (polygons_fC))

	else: # Inputs are OK, start calculation...

		for type in typeList: # add fields to Points file
			arcpy.AddField_management(points_fC, "polygon_type%s_Sum" % type, "DOUBLE")
			arcpy.CalculateField_management(points_fC, "polygon_type%s_Sum" % type, 0, "PYTHON")
			arcpy.AddField_management(points_fC, "polygon_type%s_Count" % type, "DOUBLE")
			arcpy.CalculateField_management(points_fC, "polygon_type%s_Count" % type, 0, "PYTHON")
			arcpy.AddMessage("    Added polygon_type%s_Sum and polygon_type%s_Count fields to Points." % (type,type))

		pointsDataDict = {} # dictionary: pointsDataDict[pointID][Type]=[sum of Polygon type weightings, count of Ploygons of type]
		jobs = [] # jobs are added to the list, then processed in parallel by the available workers (CPUs)

		if calcPlatform == 'mp':
			arcpy.AddMessage("    Utilising Python Multiprocessing library:")
			pool = multiprocessing.Pool() # initate the processing pool - use all available resources
#			pool = multiprocessing.Pool(2) # Example: limit the processing pool to 2 CPUs...
			for type in typeList: # For this specific example
				arcpy.AddMessage("      Passing %s to processing pool..." % type)
				jobs.append(pool.apply_async(findPolygons, (points_fC, polygons_fC, type, searchDist))) # send jobs to be processed

			for job in jobs: # collect results from the job server (waits until all the processing is complete)
				if len(pointsDataDict.keys()) == 0: # first output becomes the new dictionary
					pointsDataDict.update(job.get())
				else:
					for key in job.get(): # later outputs should be added per key...
						pointsDataDict[key].update(job.get()[key])
			del jobs

		elif calcPlatform == 'pp':
			ppservers=()
			job_server = pp.Server(ppservers=ppservers) # initate the job server - use all available resources
#			job_server = pp.Server(2, ppservers=ppservers) # Example: limit the processing pool to 2 CPUs...
			arcpy.AddMessage("    Utilising Parallel Python library:")
			for type in typeList: # For this specific example, it would only initate three processes anyway...
				arcpy.AddMessage("      Passing %s to processing pool..." % type)
				jobs.append(job_server.submit(findPolygons, (points_fC, polygons_fC, type, searchDist), (), ("arcpy",))) # send jobs to be processed

			for job in jobs: # collect results from the job server (waits until all the processing is complete)
				if len(pointsDataDict.keys()) == 0: # first output becomes the new dictioanry
					pointsDataDict.update(job())
				else:
					for key in job():
						pointsDataDict[key].update(job()[key]) # later outputs should be added per key...
			del jobs

		# ---------------------------------------------------------------------------
		## Writing data back to points file
		# ---------------------------------------------------------------------------

		arcpy.AddMessage("    Values extracted; writing results to Points...")
		pointsRowList = arcpy.UpdateCursor(points_fC)
		for pointsRow in pointsRowList: # write the values for every point
			pointID = pointsRow.getValue("PointID")
			for type in typeList:
				pointsRow.setValue("polygon_type%s_Sum" % type, pointsRow.getValue("polygon_type%s_Sum" % type) + pointsDataDict[pointID][type][0])
				pointsRow.setValue("polygon_type%s_Count" % type, pointsRow.getValue("polygon_type%s_Count" % type) + pointsDataDict[pointID][type][1])

			pointsRowList.updateRow(pointsRow)
			del pointsRow

		del pointsRowList
		# just make sure any locks are cleared...
		del calcPlatform, calcPlatform_input, calcTypeExplainDict, dataCheck, defaultCalcTpye, job, key, pointID, pointsDataDict, points_fC, polygons_fC, type, valid_points, valid_polygons, searchDist, typeList

def findPolygons(points_FC, polygons_FC, Type, search_dist):
	funcTempDict = {}
	arcpy.MakeFeatureLayer_management(polygons_FC, '%s_%s' % (polygons_FC, Type))
	arcpy.MakeFeatureLayer_management(points_FC, '%s_%s' % (points_FC, Type))
	pointsRowList = arcpy.SearchCursor('%s_%s' % (points_FC, Type))
	for pointRow in pointsRowList: # for every origin
		pointID = pointRow.getValue("PointID")

		try:
			funcTempDict[pointID][Type] = [0,0]
		except KeyError: # first time within row
			funcTempDict[pointID] = {}
			funcTempDict[pointID][Type] = [0,0]

		arcpy.SelectLayerByAttribute_management('%s_%s' % (points_FC, Type), 'NEW_SELECTION', '"PointID" = \'%s\'' % pointID)
		arcpy.SelectLayerByLocation_management('%s_%s' % (polygons_FC, Type), 'INTERSECT', '%s_%s' % (points_FC, Type), search_dist, 'NEW_SELECTION')
		arcpy.SelectLayerByAttribute_management('%s_%s' % (polygons_FC, Type), 'SUBSET_SELECTION', '"polyType" = %s' % Type)
		polygonsRowList = arcpy.SearchCursor('%s_%s' % (polygons_FC, Type))
		for polygonsRow in polygonsRowList:
			funcTempDict[pointID][Type][0] += polygonsRow.getValue("polyWeighting")
			funcTempDict[pointID][Type][1] += 1

	return funcTempDict

if __name__ == '__main__':
	# Read the parameter values:
	#  1: points feature class (string to location, e.g. c:\GIS\Data\points.gdb\points01)
	#  2: polygons feature class (string to location)
	#  3: search distance for polygons, integer, e.g 500
	#  4: calculation type ('mp' to use Multiprocessing library, 'pp' to use Parallel Python library (if available, otherwise defaults to mp))
	pointsFC = arcpy.GetParameterAsText(0) # required
	polygonsFC = arcpy.GetParameterAsText(1) # required
	search_Distance = arcpy.GetParameterAsText(2) # required
	calcType = arcpy.GetParameterAsText(3) # optional (will default to MP)

	t_start = time.clock()

	type_list = [1,2,3,4,5,6] # polygon types to search for...

	# run calculation
	if '' in [pointsFC, polygonsFC, search_Distance]:# if not running from Arc, the input parameters all come out as ''
		pointsFC = "c:\\Work\\GIS\\Data\\Points.gdb\\points01"
		polygonsFC = "c:\\Work\\GIS\\Data\\Polygons.gdb\\polygons01"
		search_Distance = 1000
		performCalculation(pointsFC, polygonsFC, search_Distance, type_list)
	else:
		performCalculation(pointsFC, polygonsFC, search_Distance, type_list, calcType)

	arcpy.AddMessage("      ... complete in %s seconds." % (time.clock() - t_start))

When importing to use it as an Arc script tool, ‘Run Python script in Process’ must be unchecked. If you don’t do this, the script will try to open new instances of ArcMap/ArcCatalog and fail.

Execution times:
On my computer, the scripts posted in Part 2 all take about 300 seconds to execute if run from the command line, and about 400 seconds if run as a tool within ArcMap or ArcCatalog. The script above takes about 170 seconds to execute if run from the command line, and 340 seconds if run as a tool within ArcMap or ArcCatalog. The first set of processes through the queue (i.e. two if two CPUs are being used, four if four CPUs are being used) are delayed, but latter processes run a lot faster. This is called overhead, and can often mean that parallelised operations will not run as fast as expected (i.e. going from one process to four processes does not speed up the total execution time by a factor of four).

Feel free to ask questions or point out improvements with regard to any of these posts. The following is a list of tips/issues I have noticed with regards to arcpy scripting, and multiprocessed applications in particular (note: multiporcessing refers to the general concept, Multiprocessing refers to the inbuilt Python library).

Notes on using multiprocessing:

  1. Creating and debugging complex Arcpy scripts is a real pain in the ass, and it is even worse when you are using multiprocesssing. I wouldn’t recommend doing it unless it is really necessary and you will be able to use the code extensively in the future. Errors are a little harder to debug, as error message output isn’t produced until the job result is being accessed.
  2. Python Multiprocessing vs. Parallel Python:There doesn’t seem to be much speed difference between the two, however:
    1. Parallel Python requires an extra package to be installed (Multiprocessing is included with Python)
    2. Parallel Python will fail if you use arcpy.addMessage() or arcpy.addError() within the parallelised function
    3. Parallel Python has issues if you are using tools that must be ‘Checked Out’ before execution (such as Network Analyst)
    4. I have not been able to get Multiprocessing to work with Geocoding
  3. arcpy.Exists() is essential– it should be used to check every input dataset. Not just because it will catch input errors, but without it, Arc will regularly fail to do basic operations for no apparent reason. It may be that checking for existence forces a file refresh…?
  4. Try…Except- I have found that using these statements, as ESRI seem to like doing in the Arcpy help/tutorials, will often cause the script to hang indefinitely when running from Arc. The best thing to do is test rigorously in Python (passing the paths as strings and printing them), where you get useful outputs, then try the script out in Arc. If it ran fine directly, but has problems at this stage, make sure to check that the paths coming out of Arc are what you were actually expecting by using print.
  5. Cancelling a multiprocessing script running in Arc does not kill all the Python processes – unlike hitting Ctrl-C from the command line. Use task manager to end them.

, , , , ,

11 Comments

Parallelised Composite Geocoding script

The code below is an implementation of a Composite Address Locator for geocoding that has multiprocessor parallelisation capabilities and can be resumed if interrupted. I initially developed it because Composite Locators run from Arcpy scripts would crash in ArcGIS 10.1. Documentation of the inputs is provided in the docstring of the parallelCompositeGeocode module. Inputs are modified within the if __name__ == ‘__main__': indented block at the bottom of the script.

Notes:

  • the script will not modify destroy inputs, but will overwrite output locations and delete some of its own temporary files
  • Parallel Python is required if setting the number of processes higher than one (i.e. parallelising/using multiple CPUs)
  • if parallelising, the number of processes should ideally be set equal to or slightly less than the number of logical CPU cores (a CPU with two physical cores and HyperThreading has four logical cores)
  • the script makes use of the Data Access module, and consequently will only work with ArcGIS 10.1
  • the script can accept any number of input address fields, but these fields must correspond to fields accepted by the address locators
  • all address locators must accept the same input fields
  • there is no type or existence checking on the inputs, so any errors will raise an Exception
  • a WindowsError: [Error 32] exception indicates that a temporary or output GDB is open in ArcMap or ArcCatalog; it may be necessary to exit these programs and use Task Manager to kill any rogue Arc processes, including AppROT.exe

Please leave a comment if you find this script useful, if you have any issues with it (it has not been thoroughly tested), or you think the documentation could be improved.

import arcpy
import time
import os
import shutil

def parallelCompositeGeocode(inputGDBpath, inputFCname, inputAddressFields, addressLocators, locatorFields, completedLocators, outputFolder, outputGDBname, outputFCname, outputType, outputSRS, tempFolder, nProcessesTxtPath, nProcessesDefault, numSplits):

	'''
	parallelCompositeGeocode(inputGDBpath, inputFCname, inputAddressFields, addressLocators, locatorFields, completedLocators, outputFolder, outputGDBname, outputFCname, outputType, outputSRS, tempFolder, nProcessesTxtPath, nProcessesDefault, numSplits)

	Inputs:
		- inputGDBpath <str>
			Full or relative path to the input GDB, includes the GDB name.
			Example: 'C:\\test\\Addresses.gdb'

		- inputFCname <str>
			Name of the input addresses Feature Class.
			Example: 'addresses'

		- inputAddressFields <tuple> of field names <str>
			Field names of address fields in the input feature class.
			!! These fields MUST correspond to the fields accepted by the locators.
			Example: ('streetAddress', 'locality', 'town')

		- addressLocators <tuple> of <tuple> of <str>
			Chained address locators, will be run in the order provided.
			The first part of sub-tuple specifies path to locator, second part defines file name of the locator.
			Example: (('C:\\locators', 'locator1'), ('C:\\locators', 'locator2'))

		- locatorFields <tuple> of field names <str>
			Fields used by the address locators. Can be adjusted in the locator properties under Input Address Fields.
			!! These MUST correspond to (same number of, and referring to the same information as) the inputAddressFields input.
			!! All locators MUST accept at least this set of fields.
			Example: ('Address', 'Suburb', 'Zone')

		- completedLocators <tuple>
			Names of completed locators that should not be re-run (useful if adding an additional locator or previous process was interrupted).
			Example (empty): ()
			Example: ('locator1')

		- outputFolder <str>
			Full or relative path of folder for output GDB.
			!! If this folder does not exist it will be created.
			Example: 'C:\\Data\\geocodedAddresses'

		- outputGDBname <str>
			Name of GDB for output.
			!! If this GDB already exists (in the outputFolder) it will be deleted.
			Example: 'GeocodedAddresses.gdb'

		- outputFCname <str>
			Name of the output addresses Feature Class.
			Example: 'geocodedAddresses'

		- outputType <str>; one of TABLE or POINT
			Type of output features, can only be table (includes x,y coordinate fields) or points (creates point features, rather than writing x,y attribute field).
			Example: 'TABLE'

		- outputSRS <arcpy Spatial Reference>
			Required if output type is set to POINTS, otherwise can be blank or None
			Example: arcpy.SpatialReference(2193)

		- tempFolder <str>
			Full or relative path to a temporary folder where some GDBs are stored and written to/from.
			Best placed on an SSD if available. This folder can be manually deleted after use.
			!! If this folder does not exist it will be created.
			Example: 'D:\\Temp\\geocodingTemp'

		- nProcessesTxtPath <str>
			Full or relative path to a text file containing only a single integer value representing the desired number of processes to be used.
			This allows the number of processes being used to be changed while the program is running.
			!! Updates to this file are only propagated when the next locator is initialised.
			Example: '_nCPUs.txt'

		- nProcessesDefault <int>
			Number of processes to use initially, will be used if the text file specified by nProcessesTxtPath does not exist or cannot be read.
			Example: 4

		- numSplits <int>
			Number of parts to split the input addresses into for geocoding; the value should be a multiple of the maximum number of desired processes.
			Increasing the number of splits will decrease maximum memory consumption for the same set of addresses.
			Example: 16


	Notes:
		No existence or type checking is carried out
		Requires ArcGIS 10.1

	Author: StacyR

http://pythongisandstuff.wordpress.com/

	Version: 310713
    '''

	nProcesses = nProcessesDefault

	infoDict = {'nInputRows': 0, 'nOutputRows': 0}

	t0 = time.time()

	inputFC = os.path.join(inputGDBpath, inputFCname)

	# paths/names for temp GDBs, FCs and layers
	address_tableName = 'address'
	addressGeocoded_tableName = 'addresses_geocoded'
	tempAddresses_layerName = 'addresses_temp'
	splitGDBNames = ['Split_%s.gdb' % i for i in range(numSplits)]
	masterAddresses_gdbName = 'masterAddresses.gdb'
	masterAddresses_gdbPath = os.path.join(tempFolder, masterAddresses_gdbName)
	masterAddress_table = os.path.join(masterAddresses_gdbPath, address_tableName)

	# output path and FC name
	outputGDB = os.path.join(outputFolder, outputGDBname)
	outputFC = os.path.join(outputGDB, outputFCname)

	# create tempFolder if it does not currently exist
	if not os.path.exists(tempFolder):
		os.makedirs(tempFolder)

	# create outputFolder if it does not currently exist
	if not os.path.exists(outputFolder):
		os.makedirs(outputFolder)

	if len(completedLocators) == 0:
		# delete/create master address GDB, add fields
		# delete output files for writing
		if os.path.exists(masterAddresses_gdbPath):
			shutil.rmtree(masterAddresses_gdbPath)
			arcpy.AddMessage('  Deleted %s.' % masterAddresses_gdbPath)
		arcpy.CreateFileGDB_management(tempFolder, masterAddresses_gdbName, "CURRENT")
		arcpy.AddMessage('    Created GDB: %s' % masterAddresses_gdbPath)
		arcpy.CreateTable_management(masterAddresses_gdbPath, address_tableName)
		for field in locatorFields:
			arcpy.AddField_management(masterAddress_table, field, 'TEXT')
		arcpy.AddField_management(masterAddress_table, 'matchAddress', 'TEXT')
		arcpy.AddField_management(masterAddress_table, 'x', 'FLOAT')
		arcpy.AddField_management(masterAddress_table, 'y', 'FLOAT')
		arcpy.AddField_management(masterAddress_table, 'locatorNum', 'SHORT')

		arcpy.AddMessage('  Master address table created and fields added in %s.' % timeDisp(time.time() - t0))
		t1 = time.time()
		arcpy.AddMessage('  Extracting data from input address database (%s), adding to master table.' % inputFC)

		inputFields = [field for field in inputAddressFields]

		with arcpy.da.SearchCursor(inputFC, inputFields) as sC:

			masterAddressFields = [field for field in locatorFields]
			masterAddressFields.append('locatorNum')

			with arcpy.da.InsertCursor(masterAddress_table, masterAddressFields) as iC:
				for values in sC:
					listValues = list(values)
					listValues.append(0)
					iC.insertRow(listValues)
					infoDict['nInputRows'] += 1

		arcpy.AddMessage('    %i addresses extracted and added to master table in %s.' % (infoDict['nInputRows'], timeDisp(time.time() - t1)))

	else:
		# overwrite locatorNum in master table (for locators that are not defined as completed) with value 0
		## this means that some addresses previously geocoded may be run again (if their locator is listed as completed)
		with arcpy.da.SearchCursor(masterAddress_table, ('OID@', )) as sC:
			for row in sC:
				infoDict['nInputRows'] += 1
		arcpy.AddMessage('    %i input addresses' % infoDict['nInputRows'])

		for addressLocator in addressLocators:
			if addressLocator[1] not in completedLocators:
				locatorNum = addressLocators.index(addressLocator) + 1

				with arcpy.da.UpdateCursor(masterAddress_table, 'locatorNum', '"locatorNum" = %i' % locatorNum) as uC:
					for row in uC:
						row[0] = 0
						uC.updateRow(row)
		arcpy.AddMessage('    Partially completed locators set to run again in %s.' % timeDisp(time.time() - t0))

	locatorNum = 1
	t2 = time.time()

	for addressLocator in addressLocators:

		if addressLocator[1] not in completedLocators:

			tG0 = time.time()

			arcpy.AddMessage('  Starting geocode with address locator (%i of %i): %s' % (locatorNum, len(addressLocators), os.path.join(*addressLocator)))

			# make view of the non-geocoded addresses from Master table to Temp table
			arcpy.MakeTableView_management(masterAddress_table, tempAddresses_layerName, ' "locatorNum" = 0')

			arcpy.AddField_management(tempAddresses_layerName, 'SplitID', 'LONG')
			with arcpy.da.UpdateCursor(tempAddresses_layerName, ('SplitID', )) as uC:
				splitID = 0
				for row in uC:
					row[0] = splitID
					uC.updateRow(row)
					splitID += 1
					if splitID == numSplits:
						splitID = 0
				del splitID

			# copy splits to split GDBs
			for i in range(numSplits):
				splitGDB_path = os.path.join(tempFolder, splitGDBNames[i])
				if os.path.exists(splitGDB_path):
					shutil.rmtree(splitGDB_path)
				arcpy.CreateFileGDB_management(tempFolder, splitGDBNames[i], "CURRENT")
				splitOutTablePath = os.path.join(splitGDB_path, address_tableName)
				arcpy.SelectLayerByAttribute_management(tempAddresses_layerName, "NEW_SELECTION", ' "SplitID" = %s' % i)
				arcpy.CopyRows_management(tempAddresses_layerName, splitOutTablePath)
			arcpy.Delete_management(tempAddresses_layerName)
			arcpy.AddMessage('    %i split GDBs created and data added in %s' % (numSplits, timeDisp(time.time() - tG0)))

			tG1 = time.time()

			nProcesses, nProcInfo = updProcesses(nProcesses, nProcessesTxtPath)
			if len(nProcInfo) > 0: arcpy.AddMessage('  ' + nProcInfo)
			# set up PP
			if nProcesses > 1:
				import pp
				jobs = []
				job_server = pp.Server(nProcesses, ppservers=())
				arcpy.AddMessage('    Job server started with %i processes.' % nProcesses)

			resultsDict = {}

			# run geocoding
			for i in range(numSplits):
				inputAddresses = os.path.join(tempFolder, splitGDBNames[i], address_tableName)
				outputAddresses = os.path.join(tempFolder, splitGDBNames[i], addressGeocoded_tableName)
				tempSplitAddresses_layerName = 'addresses_temp_%i' % i

				arcpy.MakeTableView_management(inputAddresses, tempSplitAddresses_layerName)
				inputAddresses_count = int(arcpy.GetCount_management(tempSplitAddresses_layerName).getOutput(0))
				arcpy.Delete_management(tempSplitAddresses_layerName)

				if nProcesses > 1:
					arcpy.AddMessage('      Submitting %i addresses in %s to job server' % (inputAddresses_count, inputAddresses))
					jobs.append(job_server.submit(geocodeAddresses, (inputAddresses, inputAddresses_count, outputAddresses, os.path.join(*addressLocator)), (), ('arcpy', 'time')))
				else:
					arcpy.AddMessage('      Geocoding %i addresses in %s' % (inputAddresses_count, inputAddresses))
					resultsDict_part, resultStr, resultTime = geocodeAddresses(inputAddresses, inputAddresses_count, outputAddresses, os.path.join(*addressLocator))
					arcpy.AddMessage('  ' + resultStr + timeDisp(resultTime) + '.')
					try: resultsDict.update(resultsDict_part)
					except TypeError: arcpy.AddMessage('      No output from geocode; %s' % resultStr)
					del resultsDict_part, resultStr, resultTime 

			# get result from PP...
			if nProcesses > 1:
				for job in jobs:
					resultsDict_part, resultStr, resultTime = job()
					arcpy.AddMessage('  ' + resultStr + timeDisp(resultTime) + '.')
					try: resultsDict.update(resultsDict_part)
					except TypeError: arcpy.AddMessage('      No output from geocode; %s' % resultStr)
					del resultsDict_part, resultStr, resultTime 
				job_server.destroy()
				del job, jobs

			# write results to master address table...
			count_thisLocator = len(resultsDict)
			with arcpy.da.UpdateCursor(masterAddress_table, ('OID@', 'matchAddress', 'x', 'y', 'locatorNum')) as uC:
				for row in uC:
					rowID = row[0]
					if rowID in resultsDict:
						row[1] = resultsDict[rowID][2]
						row[2] = resultsDict[rowID][0]
						row[3] = resultsDict[rowID][1]
						row[4] = locatorNum
						uC.updateRow(row)
				del resultsDict
			arcpy.AddMessage('    %i addresses (%.2f percent of all addresses) geocoded by this locator in %s.' % (count_thisLocator, 100.0*(float(count_thisLocator)/float(infoDict['nInputRows'])), timeDisp(time.time() - tG1)))
		else:
			arcpy.AddMessage('    !! Skipping locator (inputs state it has already completed) (%i of %i): %s' % (locatorNum, len(addressLocators), addressLocator[1]))

		arcpy.MakeTableView_management(masterAddress_table, tempAddresses_layerName, ' "locatorNum" > 0')
		allGeocodedAddresses_count = int(arcpy.GetCount_management(tempAddresses_layerName).getOutput(0))
		arcpy.Delete_management(tempAddresses_layerName)
		arcpy.MakeTableView_management(masterAddress_table, tempAddresses_layerName, ' "locatorNum" = 0')
		nonGeocodedAddresses_count = int(arcpy.GetCount_management(tempAddresses_layerName).getOutput(0))
		arcpy.Delete_management(tempAddresses_layerName)

		arcpy.AddMessage('    Total addresses geocoded thus far (all locators): %i; %.2f percent of all addresses.' % (allGeocodedAddresses_count, 100.0*(float(allGeocodedAddresses_count)/float(infoDict['nInputRows']))) )

		locatorNum += 1

	# delete split GDBs
	for i in range(numSplits):
		splitGDB_path = os.path.join(tempFolder, splitGDBNames[i])
		if os.path.exists(splitGDB_path):
			shutil.rmtree(splitGDB_path)

	arcpy.AddMessage('  Completed geocoding in %s' % (timeDisp(time.time() - t2)))
	arcpy.AddMessage('    Input addresses: %i' % infoDict['nInputRows'])
	arcpy.AddMessage('    Final addresses geocoded (all locators): %i (%.2f percent of all addresses)' % (allGeocodedAddresses_count, 100.0*(float(allGeocodedAddresses_count)/float(infoDict['nInputRows']))) )
	arcpy.AddMessage('    Non-geocoded addresses: %i' % nonGeocodedAddresses_count)

	t3 = time.time()
	arcpy.AddMessage('  Starting export of geocoded addresses to %s' % outputGDB)

	# delete output files for writing
	if os.path.exists(outputGDB):
		shutil.rmtree(outputGDB)
		arcpy.AddMessage('  Deleted %s.' % outputGDB)

	# export geocoded addresses
	arcpy.CreateFileGDB_management(outputFolder, outputGDBname, "CURRENT")
	arcpy.AddMessage('    Created GDB: %s' % outputGDB)

	if outputType == 'POINTS':
		arcpy.CreateFeatureclass_management(outputGDB, outputFCname, 'POINT', '', '', '', outputSRS)
	else:
		arcpy.CreateTable_management(outputGDB, outputFCname)

	for field in inputAddressFields:
		arcpy.AddField_management(outputFC, field, 'TEXT')
	arcpy.AddField_management(outputFC, 'matchAddress', 'TEXT')
	arcpy.AddField_management(outputFC, 'locator', 'TEXT')

	outputFields = [field for field in inputAddressFields]
	outputFields.append('matchAddress')

	if outputType == 'POINTS':
		outputFields.extend(['SHAPE@X', 'SHAPE@Y'])
	else:
		arcpy.AddField_management(outputFC, 'x', 'FLOAT')
		arcpy.AddField_management(outputFC, 'y', 'FLOAT')
		outputFields.extend(['x', 'y'])

	outputFields.append('locator')

	copyFields = [field for field in locatorFields]
	copyFields.extend(['matchAddress', 'x', 'y', 'locatorNum'])

	with arcpy.da.InsertCursor(outputFC, outputFields) as iC:
		with arcpy.da.SearchCursor(masterAddress_table, copyFields) as sC:

			for values in sC:

				if values[-1] == 0:
					locatorName = 'None'
				else:
					locatorName = addressLocators[values[-1] -1][1]

				listValues = list(values)
				listValues[-1] = locatorName
				iC.insertRow(listValues)
				del locatorName
				infoDict['nOutputRows'] += 1
		

	arcpy.AddMessage('    Exported %i addresses in %s.' % (infoDict['nOutputRows'], timeDisp(time.time() - t3)))
 


def geocodeAddresses(input_Addresses, input_Addresses_count, output_Addresses, address_Locator):
	'''geocodes a table of addresses and returns it as a Python dictionary
	suitable for parallelisation'''
	t_G0 = time.time()

	arcpy.GeocodeAddresses_geocoding(input_Addresses, address_Locator, '', output_Addresses)

	# select geocoded features, searchcursor through and write to a dictionary
	results_Dict = {}
	with arcpy.da.SearchCursor(output_Addresses, ('OID@', 'X', 'Y', 'Match_addr'), ' "X" > 0') as sC:
		for row in sC:
			results_Dict[row[0]] = (row[1], row[2], row[3])

	output_Addresses_count = len(results_Dict)
	geocodeStr = '      Geocoded addresses: %i of %i (%.2f percent) addresses in ' % (output_Addresses_count, input_Addresses_count, 100*(float(output_Addresses_count)/float(input_Addresses_count)))

	del input_Addresses, input_Addresses_count, output_Addresses, address_Locator
	return results_Dict, geocodeStr, time.time() - t_G0


def timeDisp(t):
	'''timeDisp(t)

	Converts seconds to minutes, hours or days, if appropriate

	Inputs:
		- t: time difference <float>
	Returns:
		- converted string <str>

	Usage example:
		print 'Program complete in %s.' % timeDisp(time.clock() - t0)
	'''
	L = [[24*60.0*60, 'days'], [60.0*60.0, 'hours'], [60.0, 'minutes']]
	for l in L:
		if t > 2*l[0]: return '%.2f %s' % (t/l[0], l[1])
	return '%.2f seconds' % t


def updProcesses(nProc, fPath):
	'''nCPUs_update(nProcesses, fPath)

	Checks a text file for a number, returns if error, number changed or not

	Inputs:
		- nProc: current number of Processes <int>
		- fPath: path to file containing number <str>
	Returns:
		- new processes number <int>
		- text explaining change or error <str>
	'''
	try:
		with open(fPath, 'r') as fObj:
			nProcesses_upd = int(fObj.read())
	except:
		return nProc, ' !! WARNING: cannot read from %s, number of Processes (currently %s) unchanged.' % (fPath, nProc)

	if nProcesses_upd != nProc:
		return nProcesses_upd, ' !! WARNING: number of Processes changed from %s to %s.' % (nProc, nProcesses_upd)
	else:
		return nProc, ''


if __name__ == '__main__':

	tMain = time.time()

	inputGDBpath = 'D:\\_TEMP\\inputAddresses.gdb'
	inputFCname = 'addresses'
	inputAddressFields = ('adds', 'zoneSub', 'Zone')
	addressLocators = (('C:\\_TEMP_data extracts\\Geocoders\\NZ_Geocoder\\', 'USSty_NZAddsSDC_City'), ('C:\\_TEMP_data extracts\\Geocoders\\NZ_Geocoder\\', 'USSty_NZAdds_City'), ('C:\\_TEMP_data extracts\\Geocoders\\NZ_Geocoder\\', 'NZSty_NZAddsSDC_City'))
	locatorFields = ('Address', 'Suburb', 'City')

	# completedLocators = ('USSty_NZAddsSDC_City', 'USSty_NZAdds_City', 'NZSty_NZAddsSDC_City')
	completedLocators = ()

	outputFolder = 'D:\\_TEMP\\output\\'
	outputGDBname = 'Geoadds.gdb'
	outputFCname = 'geoAdds'
	outputType = 'TABLE'

	outputSRS = arcpy.SpatialReference(2193)
	# outputSRS = None

	tempFolder = 'D:\\_TEMP\\geoTemp\\'
	nProcessesTxtPath =  'D:\\_TEMP\\_nCPUs.txt'
	nProcessesDefault = 1
	numSplits = 4

	arcpy.AddMessage('Starting parallel composite geocode...')
	parallelCompositeGeocode(inputGDBpath, inputFCname, inputAddressFields, addressLocators, locatorFields, completedLocators, outputFolder, outputGDBname, outputFCname, outputType, outputSRS, tempFolder, nProcessesTxtPath, nProcessesDefault, numSplits)
	arcpy.AddMessage('Geocode complete and written to output in: %s' % timeDisp(time.time() - tMain))

, , , , ,

Leave a comment

Locating Python, adding to Path and accessing Arcpy

To access the Python interpreter and run Python scripts you need to know the location of python.exe for your installation. To make running scripts easier you may wish to add this location to your system Path, either temporarily or permanently. It is also possible to allow Python installations that were not installed by ArcGIS to access Arcpy functionality. This post describes how to do each of these as well as some general considerations.

Opening the command prompt

You can open a command prompt through the start menu, or by running cmd.exe from the run dialogue. To change directories use ‘cd’ followed by the desired path to change to that directory, for example:

C:>cd \test
C:\test>

To open a command prompt in a folder (for example, a folder where you keep your scripts) right click (Windows XP) or hold down shift and right click (Windows 7) on the folder and select Open command prompt here. I haven’t been able to test this on all versions of Windows XP/7, so it may not work if you have Windows Home edition, for example.

Locating the Python installation

If you manually installed a version of Python it could be anywhere but the default location is something like C:\PythonXX\, where XX represents the major Python version; for example, 26 is the major version of Python 2.6.5.

ArcGIS 10.0 typically installs Python 2.6 to C:\Python26\ArcGIS10.0\ while 10.1 installs Python 2.7 to C:\Python27\ArcGIS10.1\, although either could be installed to another local drive such as D:. If you have installed 64-bit background geoprocessing (described here) on top of 10.1 it will install an additional Python, with a default location of C:\Python27\ArcGISx6410.1\

Accessing the interpreter and running scripts

To access the interpreter you can double click on python.exe.

Running scripts requires the script to be passed to python.exe. In the most basic case this can be done from a command prompt running in the Python directory, for example:
C:\Python27\ArcGIS10.1>python.exe test.py
would run test.py if test.py was also located in the Python directory.

However, it is much more useful to access Python from another location, such as the folder where you keep your scripts. To do this, use the full path to python.exe from a command prompt in the current location. For example, running test.py from a command prompt in D:\Projects\ (where the script is located) with Python installed at C:\Python26\ArcGIS10.0:
D:\Projects>C:\Python26\ArcGIS10.0\python.exe test.py

To save on typing it is preferable to add python.exe to the system Path (this allows it to be accessed from a command prompt anywhere in the system just by calling python).

Adding Python permanently to the system path

The system path can be changed through the Environment Variables dialogue. WARNING: Don’t make any mistakes here (like deleting things), or your system might not work. Also make sure not to insert extra spaces anywhere. To bring up the Environment Variables dialogue in Windows XP:

  • Right click on My Computer, select Properties, go to the Advanced tab, then click Environment Variables

On Windows 7 (and, I presume, 8 and Vista):

  • Right click on Computer, click on Advanced system settings on the left, then click Environment Variables

In the System variables pane, scroll the box down and select the Path variable, then go to Edit. Place the cursor in the box, hit End to get the cursor at the end, add a semi-colon (;) and then enter the path to your Python installation – so, for my installation, I added this:

;C:\Python27\ArcGIS10.1

Just be careful not to insert any additional spaces before or after the path.

Temporarily adding Python to the system path

This is particularly handy for systems where you do not have administrator rights and cannot edit the system path.

The command set lets you alter environment variables from the command prompt. The changes endure only for the instance of the command prompt from which they were invoked and end when it is exited. The best method is to add the path to the desired Python installation to the start of the system Path, like so:
set Path=C:\Python27\ArcGIS10.1;%Path%

If you need to do this regularly you can create a batch script (a file with .bat as the extension). To do this: open Notepad, enter the command as above save the file as setPy.bat and run this within the command prompt before running scripts.

Accessing Arcpy functions from another Python

It is possible for other Python installations to use Arcpy, for example the Python installed by Python(x,y). Note that:

  • the major versions should match (2.6 for ArcGIS 10.0 and 2.7 for ArcGIS 10.1)
  • if Arcs Python is 32-bit, which it will be unless you have installed 64-bit background geoprocessing as describedhere, then the other Python must also be 32-bit; if you have installed 64-bit background geoprocessing, then the other Python must also be 64-bit
  • if you write scripts and use these as script tools within Arc they will use the Python installed by Arc; this means that any libraries you have installed to the other Python, but not installed to Arcs Python, will be inaccessible (causing an ImportError)

To allow other Python installations to access Arcpy a file must be copied from the \Lib\site-packages\ folder within the Arc Python installation and placed in the corresponding folder of the non-Arc Python. If you have not installed 64-bit background geoprocessing the file is Desktop10.1.pth; if you have installed it, the file is DTBGGP64.pth.

Miscellaneous notes

If, when attempting to install an additional library to Python, you receive an error stating that the correct Python version is not found in the registry:

  • Download the package source (usually a .zip file, not specific to any Python version)
  • Unzip the package
  • Open a command prompt within the package folder and run:
    python setup.py install
    where python is the Python you want to install the library to. If it is not on the path (either temporarily or permanently) you must supply the full path to it.

PYTHONPATH is an environment variable (like Path, described above) that is added to the list of places which Python searches for scripts, modules and libraries. It is not the path to or of the Python installation. If you wish to add folders to the PYTHONPATH, open the Environment Variables dialogue, as described above, click New, enter the name PYTHONPATH and then add folder paths as required. If adding multiple locations, they must be separated by a ;.

As far as I know it is not possible to move the Python that is installed by ArcGIS. Uninstalling and reinstalling ArcGIS should allow you to change the Python install location, except in the case of some silent network installs.

, , , , , , ,

3 Comments

Upcoming Posts page

If you have suggestions or questions relating to the general blog content (rather than specific posts), or requests for new tutorials/posts check out the Upcoming Posts page. I will use comments posted there to help direct new content for the blog.

8 Comments

Arcpy: sending a result email

This was actually surprisingly easy to do. The hardest part is logging in to your SMTP server (well, finding all the specific details you need to know to do so); fortunately it’s pretty straightforward for Gmail. If you are bound to another service, search around on the net to see what you can find out about it.

This code has been written to use a Python module/function (third line: def…), which may be unfamiliar to many arcpy users. A module is a piece of code that is loaded into memory as it is read, which is why you have to put it at the top, and can then be called later. It isn’t necessary for this code, but comes in handy where you are doing the same thing more than once; for example, if there were three slow operations being performed in this code, and you wanted to send an email after each part was complete, having the module would significantly shorten and de-clutter the code (vs. repeating the sending script three times). The main reason I did it here is to make it easier to copy and paste to other scripts.

import smtplib
import arcpy
 
def sendResultEmail(msgContents, success_TF):
	'''sendResultEmail(msgContents, success_TF)
	This is the module that does the sending - you should only need to edit the top four active lines'''
	to = 'receiver@email.com'
	send_username = 'sender@email.com'
	send_password = 'password' # warning - will just be an unencrypted string - so be careful not to leave this file lying around!
	smtpserver = smtplib.SMTP("smtp.gmail.com",587)

	
	if success_TF:
		subject = 'Arcpy testing results: script SUCCESS.'
	else:
		subject = 'Arcpy testing results: script FAILURE.'
		
	smtpserver.ehlo()
	smtpserver.starttls()
	smtpserver.ehlo
	smtpserver.login(send_username, send_password)
	
	header = 'To:' + to + '\n' + 'From: ' + send_username + '\n' + 'Subject:' + subject + '\n'

	msg = header + '\nArcpy results: \n\t' + msgContents + '\n'
	
	smtpserver.sendmail(send_username, to, msg)
	
	arcpy.AddMessage('Results email sent!')
	
	smtpserver.close()


### Here is the geoprocessing code; add (+=) to strResult every time you need to add something to the message (i.e. completion times)
### In a complex script this could be a lot of information, so seperate with '\n' to force new lines in the text...

inFC = arcpy.GetParameterAsText(0)
outLoc = arcpy.GetParameterAsText(1)
outFC = arcpy.GetParameterAsText(2)
outTxt = arcpy.GetParameterAsText(3)

success = False # assume failure (changed to True if script is successful)
strResult = '' # start empty string

try:
	arcpy.FeatureClassToFeatureClass_conversion(inFC, outLoc, outFC)
	strResult += "Converted successfully"
	success = True
except arcpy.ExecuteError:
	strResult += arcpy.GetMessages()

arcpy.AddMessage(strResult)

# Now write results to the text file - if it exists it will be overwritten
txtFile = open(outTxt, "w")
txtFile.write(strResult)
txtFile.close()

# Now send the email. This line calls the module above, it requires the string and the T/F success variable...
sendResultEmail(strResult, success)

, , , , ,

4 Comments

Tutorial: Arcpy basics

The following tutorial develops some basics for using the Arcpy library with Python.

1. Every script using Arc functions must start with import arcpy

'''script.py
'''
import arcpy

2. Get user inputs with ‘GetParameterAsText’

'''script.py
'''
import arcpy

# get input feature class
inputFeatureClass = arcpy.GetParameterAsText(0)

This gets the first user entered parameter as text. If you haven’t specified any parameters when adding to Arc (see step 3) it will cause the script to fail.

3. Using the tool in Arc

Right click on a toolbox you can edit (if you haven’t already, you will have to create one in “My Toolboxes” or within a folder), go down to ‘Add’ and select ‘Script…’, give it a name and label, select the script file, then add the parameters. For this example we are getting a Feature Class, so call it InputFeatureClass (or something meaningful) and set the Data Type to Feature Class. Click Finish. Double click on your new script tool to use it. At the moment it does nothing, but if you give it a feature class and click OK, it should run happily and say it was successful.

From now on we can just edit the script where it is and the changes will carry through, you won’t need to import it again.

It is important to note that paths to feature classes are just strings, e.g. ‘C:\\GISData\\NYC\\2007\\Networks.gdb\\BusRoutes’.

4. Adding messages to the display with ‘AddMessage’

'''script.py
'''
import arcpy

# get input feature class
inputFeatureClass = arcpy.GetParameterAsText(0)

# inform user of selected feature class path
arcpy.AddMessage('  Input Feature Class: '+inputFeatureClass)

Save the script and run it again, and it will now print the feature class path before finishing happily.

5. Performing a geoprocessing operation

Let’s do a buffer on the input features, and also get the user to input the buffer size for the operation; you will need to right click on the tool in Arc, go down to Properties, and in the parameters tab add the parameter BufferSize_m as a Long (integer).

Buffer also requires an output feature class, so you will have to add that to the parameters as well; its data type should be feature class, but you will need to set the Direction (under Parameter Properties) to Output.

The script now becomes (with the message made a little more meaningful):

'''script.py
'''
import arcpy

# get input feature class
inputFeatureClass = arcpy.GetParameterAsText(0)

# get buffer size 
buffer = arcpy.GetParameterAsText(1)

# get output feature class
outputFeatureClass = arcpy.GetParameterAsText(2)

# inform user of selected feature class path
arcpy.AddMessage('  Buffering input Feature Class, '+inputFeatureClass+' by '+buffer+' Meters and writing to: '+outputFeatureClass)

buffer_corrected = buffer + ' Meters'

arcpy.Buffer_analysis(inputFeatureClass,  outputFeatureClass, buffer_corrected)

This should help you get started… Other tips:

  1. do some Pyhton tutorials, and
  2. constantly refer to the documentation (either in Arc, or on the ESRI website)

, , , ,

8 Comments

Installing GDAL (and OGR) for Python on Windows

Overview:

1. Get the GDAL core files
2. Get the Python bindings
3. Install GDAL
4. Edit Environment Variables
5. Install Python bindings

After years of having different GDAL versions on my Windows and Linux installations, I have finally found out how to install recent versions of GDAL (and OGR) on Windows; and it’s surprisingly easy… The latest version offers some significant speed increases for some operations over version 1.6. Note that you will need administrator rights on your computer.

1. Get the GDAL core files

This scary looking site: http://www.gisinternals.com/sdk/ is where you download GDAL. Scroll down to the bottom of the top-most table and click on the relevant link in the ‘Downloads’ column – either 32 bit (second to last row) or 64 bit (last row).

From the new page download the GDAL Core file: gdal-[version]-[build]-core.msi (e.g. gdal-18-1600-core.msi)

2. Get the Python Bindings

Fortunately the Python bindings are located right here as well, called GDAL-[gdalVersion.system]-py[pythonVersion].[exe/msi] (e.g. GDAL-1.8.0.win32-py2.6.exe), just make sure to download the version matching your installed version of Python.

3. Install GDAL

Simply run the GDAL Core installer downloaded in step 1, and note the install path if you change it.

4. Edit Environmental Variables

WARNING: Don’t make any mistakes here (like deleting things), or your system might not work…
To bring up the Environment Variables dialogue in Windows XP:

  • Right click on My Computer, select Properties, go to the Advanced tab, then click Environment Variables

On Windows 7 (and, I presume, Vista):

  • Right click on Computer, click on Advanced system settings on the left, then click Environment Variables

In the ‘System variables’ pane, scroll the box down and select the ‘Path’ variable, then go to Edit. Place the cursor in the box, and hit End to get the cursor at the end, add a semi-colon (;) and then enter the path to your GDAL installation – so, for my installation, I added this:

    ;C:\Program Files (x86)\GDAL

Just be careful not to insert any additional spaces before or after the path (see comments)! Click Ok, then click New and enter the following:
Variable name: GDAL_DATA
Variable value is the path to the GDAL data directory – so for me:

    C:\Program Files (x86)\GDAL\gdal-data

To test the installation, open up your command prompt (type cmd in the Run dialogue), type ogr2ogr and hit enter. This is one of the programs included with GDAL; if your Path is set correctly it will be accessible from anywhere. If some stuff about usage and options is printed out, GDAL is installed correctly and the Environment variables worked. If it instead says: ‘ogr2ogr’ is not recognized as an internal or external command, operable program or batch file., there is possibly a problem. First off try restarting the computer, if ogr2ogr still doesn’t work check your Environment variables for spelling mistakes or typos.

5. Install Python bindings

Run the Python bindings installer downloaded in step 2.
To test the entire setup, open a Python prompt and enter:

from osgeo import ogr

Hit enter, then:

from osgeo import gdal

If both of these run without printing out anything, you are good to go!

If you have problems please read through the comments (below), as they contain some solutions to common issues.

, ,

52 Comments

Follow

Get every new post delivered to your Inbox.

Join 53 other followers