Our current situation is that we have around 40,000 geolocated settlements across Africa, and each is located on a detailed road network shapefile. We would like to calculate the shortest distances by road from each settlement to every other settlement (i.e. creating a huge 40,000 x 40,000 Origin-Destination matrix of distances) - we have a methodology to do this for smaller datasets within GIS software, but are held back by memory limitations for such a large operation. Hence, we were interested to explore your methods before we consider developing our own solutions. Do you have any advice or suggestions on proceeding with this please ?
40,000 by 40,000 is quite large. One way to tackle this is to create an OD Matrix Layer, set the output shape type to none (why bother creating straight lines is all you want is distance) and load these 40,000 locations as destinations. Then in a loop do the following: select the first 100 destinations, load them as origins, Solve, Write out the results to a file geodatabase table. Remove the 100 origins and repeate this for the next 100.
So basically do a 100 by 40000, 400 times. You maybe able to 200 by 40000.
If you have the ArcGIS 10 pre-release, you could run this on a machine with at least 4 GB of RAM and 64 bit OS (windows 7 or vista) then you can run larger problems. Also in ArcGIS 10, if you upgrade your network dataset and dissolve it with a new tool, you will get better performance.
After computing the pairwise distance matrix, I would like to write the contents of this matrix into a text file. Subsequently, I plan to use this matrix in a C++ and/or Matlab program which I have written, for doing some data mining. This is all a part of my research that I am doing on finding some useful patterns in the data. I work as a post-doc at the University of Florida.
I was wondering if you can show me some script that can show how to compute the matrix (looping through all the destinations and a subset of the sources each time) . Also, how and where will I run this script please?
I noticed that the road network I have has several discontinuities along thje roads. So, is there a tool (like Create Topology) that I can use which sets the tolerance and joins these discontinuities with small line fragments.
I never used Network Analyst and OD matrices in ArcGIS, but here are some thoughts/hints before you obtain a better answer..
One way to build a loop is by programming it in Python and calling the ArcGIS Geoprocessor's methods (that correspond to the tool(s)) that you need to use. If you already program in C++/MATLAB, Python will not be an issue. An easy way to start is to open the Model Builder in ArcMap, drag and drop the tool(s) that you want to use, parameterize it/them, and export your model as a Python script (this way you will see how to create the Geoprocessor object, set the license, etc). Once done, try to run it and see if it works (on a small dataset) by typing "python yourScriptName.py" in a shell from the relevant folder (see note 1). If python is not recognized as an executable, you will have to update your PATH (environment variables, update the PATH by adding the path to your Python25 directory). When this simple code (directly exported from the Model Builder) works, the next step is to nest part of it in a loop that modifies the input/output parameters of the different tools/functions/methods calls.
In case you want to implement most of the computation by yourself, I did something quite similar 15 years ago based on Dijkstra's algorithm (well described on Wikipedia). I implemented it in C and it was extremely fast. There might be different ways to implement it, more or less fast or/and memory efficient, but I don't know/remember by heart.
If you are using Matlab, there are implementations of Dijkstra on Matlab Central, and you could use the Mapping toolbox - if you have it - to directly read a shapefile that contains edges lengths (or compute them using the distance() function of this toolbox on the lines or polylines that link your nodes).
Note 1: editing/running you code in Eclipse would be a good thing to do if you are planning to play a bit with Python. Once you've installed a version of Eclipse (e.g. the Classic), install the PyDev extension ('Help/Install New Software' and specify the PyDev repository: "http://pydev.org/updates"), you just have a few things to do: in 'Windows/Preferences'PyDev/Interpreter - Python', add your python interpreter ("YourPath\Python25\python.exe"). When asked for the PYTHONPATH, check the "YourPath\ArcGIS\bin" entry. When done, you can create a new PyDev project, add a new file (yourFilename.py), and run it ("as Python run"). Cutting and pasting the export from the Model Builder should work if you have absolute paths to your resources (shapefiles, tables/dbf, filegeodatabase/..) in the export (which I guess is the case), otherwise you'll have to update the paths.
I thought a little about it and I was wondering if you couldn't first analyze links between clusters before trying to compute shortest paths between all the nodes.
I guess that there are multiple reasons for clustering to happen, rivers and like bridges at specific locations, borders and control points, valleys, etc. Wouldn't it be more efficient to build a solution around the following strategy (?) ..
1. Define clusters based on the aforementioned criteria. 2. Build a list of nodes that connect these clusters. 3. Determine shortest path "intra-cluster" for each cluster. 4. Based on 2. and 3., build a second ("virtual") network whose nodes are connection nodes from 2. and whose edges lengths are the shortest paths between these nodes obtained in 3.
Once done, you can determine the shortest path between two nodes by: - If the two nodes are in the same cluster, it is given by 3. - If the two nodes are in different clusters, add them to the virtual network, connected to each connection node of their own clusters by edges whose lengths are the shortest path to these nodes. Compute shortest path on this extended virtual network.
It would have several major advantages I guess. The obvious one is to reduce the dimension of your "objects" (matrices, etc) in memory by splitting it/them into multiple smaller ones. This would also be relevant for your processing in Matlab because it would not require a huge amount of contiguous memory (that is smaller than the free memory that is left on your machine). Moreover, as these clusters are not represented as directly connected within the same matrix, the sum of the smaller matrices are likely to require significantly less memory than a 40,000x40,000 full/dense matrix (you could see all your smaller matrices as blocs of a 40k x 40k bloc/sparse matrix). The price to pay would be a little more algorithm to develop in comparison to the direct, "brute-force" method.
This might not be adapted to the data-mining that you want to perform and I am far to be a specialist, so you might want to ask a real computer/network person or wait for my post to be fairly criticized 😉