Battle of the Runtimes

MarkCederholm · ‎09-23-2021

Some time ago I posted an article comparing the performance of older Esri SDKs (ArcObjects: .NET, C++) with newer ones (Pro, Enterprise, Runtime: .NET). The Pro and Exterprise SDKs barely performed better than their ArcObjects .NET counterparts: instead of being re-engineered from scratch, they obviously leveraged the older ArcObjects technology, and were bogged down by the same COM interop performance issues. Runtime, on the other hand, proved to be a true innovation and far outperformed any of the other SDKs.

As a fun exercise, I decided to compare the three flavors of Runtime 100.12 available for Windows desktop (.NET, Java, Qt). Again, I used the same purely computational benchmark: creating convex hulls for 100,000 random polygons. I built all three examples as standalone console applications (release builds), and executed them outside of their respective IDEs. I ran each benchmark five times, and picked the best time for each.

Here's the benchmark comparison:

Runtime SDK	Execution time (seconds)
.NET	16
Java	19
Qt (C++)	33

And here's the normalized performance index:

I expected .NET and Java to be pretty much neck-and-neck, since no COM interop is involved in the benchmark (COM interop perfomance is much worse in Java than .NET). Qt was a bit of a surprise; although, I've dabbled with Runtime for Qt in the past, and noticed that fine-grained code seems not to be as fast as it could be. While C++ is my favorite programming language, and I admire Qt's "write once, deploy many" approach, it's obvious that the framework carries some baggage.

[See attachment for the code.]

Update:

This has certainly been a fascinating topic, and there's been some good participation and feedback. While the original purpose of the exercise was to compare the relative interop performance of the various flavors of Runtime in making a large number of fine-grained calls to the common libraries, it has since been demonstrated that tweaks to the logic can make a significant difference in performance. And in one case so far, the exercise has led to Esri's discovering and fixing a bug. Kudos to everyone who participated.

dotMorten_esri · ‎09-28-2021

Interesting comparison, and I agree QT is a bit of a surprise. I'm not an expert on the Java and QT code, but just looking over the .NET code and the general approach taken, I do have a few initial thoughts about this comparison:

First calculating the convex hull and area is all performed in native code, and all 3 uses the exact same native binary for this calculation, so there should be absolutely no difference in the actual calculation. The only difference would be in creating the geometry objects and sending them into that native library. The overhead here would mostly be on the managed languages like Java and .NET, so QT being that much slower makes me wonder what we're actually measuring.

Second, only running 5 times, and picking the best one doesn't say much. When we do performance tests, we typically run them 1000s or often 100s of 1000s times, doing warm up runs, pre-runs to understand how they roughly run to estimate how many runs we really need, measuring each run, removing extreme outliers, and also consider the standard deviations for interpretation. With only 5 runs, we can't say much about the standard deviation, and if the standard deviations are very large (say 10+ seconds for instance), you can't really reason that the above results are actually different. It could just be .NET got lucky, and QT got unlucky. Also we turn off all unnecessary apps and services, disable anti virus etc, since they can have a huge impact on the results. It could for instance be QT tests were running while a scanning tool was being busy. Again the standard deviations and LOTS of runs at different times would help you determine that.

Also when you rely on random input data, you're not really comparing apples to apples - .NET could just have gotten lucky here and getting easier calculations, and you helped that by only picking the best of the 5 runs.

Specifically for .NET (and probably the others), I would remove all calls to Console.WriteLine / Debug.WriteLine. That can actually a rather expensive call and I assume not something we want to measure here. There are a bunch of various other minor optimizations that could be applied, but again it's all about what you're trying to measure. For instance you're also measuring the performance of getting a random number, and that could be different on each platform as well. For .NET the DateTime object isn't good for performance measuring. Usually you should use the HighPerformanceTimer APIs instead. You should be able to re-use the QueryPerformanceCounter API on all 3 platforms for instance.

As a simple change, it might be better to measure the time spent on a single run instead of all 100,000 combined (and using a high performance timer API to measure), then get the average and standard deviations for each. The distribution of each run might tell you a lot more. And often it might be worth breaking down what you measure into smaller pieces, so measure polygon creation as one test, convex hull calculation as another, area calculation is a 3rd etc. As mentioned first, the main difference between the 3 APIs would the object creations, and not so much the geometry engine calls.

JamesBallard1 · ‎09-30-2021

@MarkCederholm , thanks for posting this. I agree with everything @dotMorten_esri said. We will look into this on the Qt team to make sure we do not have this type of overhead compared to .NET and Java Runtime SDKs. The numbers are very surprising that the Qt C++ SDK, which is native C++ through the entire stack, would be roughly twice as slow as our companion Runtime SDKs.

MarkCederholm · ‎10-02-2021

My bad, I'm aware that the different flavors of Runtime call the same native libraries. What I neglected to point out is that, by making a large number of fine-grained calls, my goal was to measure the relative performance of the various frameworks in wrapping and interacting with those libraries.

MatveiStefarov · ‎10-11-2021

Hi Mark! Thank you for sharing these benchmarks, it's been very interesting to profile and to look for optimization opportunities.

I was able to get a considerable speedup in .NET by using the PointCollection class instead of inserting one MapPoint at a time. It is a very efficient and reusable data structure for coordinates, and it can be used with any constructor that takes IEnumerable<MapPoint>. Here's what my code looks like:

// Inside DoIt():
var points = new PointCollection();
for (int i = 0; i < iNumPolygons; i++)
{
  PolyResult result = GeneratePolygon(points);
  
// ...
// Further down:
static PolyResult GeneratePolygon(PointCollection points)
{
  // Set up random point generation
  PolyResult result = new PolyResult { NumVertices = 0, Area = 0.0 };

  // Generate multipoint geometry
  for (int i = 0; i < iNumPoints; i++)
  {
    double dX = dXMin + (dXMax - dXMin) * _rand.NextDouble();
    double dY = dYMin + (dYMax - dYMin) * _rand.NextDouble();
    points.Add(dX, dY);
  }

  Multipoint mp = new Multipoint(points, _sr);
  points.Clear(); // to be reused for next polygon

With this one change, dotnet execution time went from 17 to 7 seconds on my machine

JamesBallard1 · ‎10-12-2021

@MarkCederholm we checked into it and sure enough we do have a bug in Qt. We are fixing it for the upcoming 100.13 release, but in the meantime here's a workaround to get the time down for Qt.

// Generate multipoint geometry
MultipointBuilder* mb = new MultipointBuilder(_sr, NULL);
auto points = mb->points();
for (int i = 0; i < iNumPoints; i++)
{
    double dX = dXMin + (dXMax - dXMin) * _rand->generateDouble();
    double dY = dYMin + (dYMax - dYMin) * _rand->generateDouble();
    points->addPoint(dX, dY);
}

We weren't caching the PointCollection class internally, and calling mb->points() in a loop that way exposed the issue.

In my testing the code with that one change drops from around ~32 seconds (on macOS) to ~4 seconds. I'd be interested to hear if you see a similar speedup with that change, but not required.

Thank you again for posting this. You helped us identify and fix a bug!

MarkCederholm · ‎10-14-2021

Well, this is interesting! I have confirmed the previous two comments, that tweaking the logic of the benchmark produces markedly different results depending on the flavor of Runtime. I've attached another set of code that represents three different options:

Opt 1: Access the MultipointBuilder point collection inside the random point generation loop.

Opt 2: Access the MultipointBuilder point collection outside the random point generation loop.

Opt 3: Create and recycle a PointCollection outside the polygon creation function.

The last option is a bit tricky because Runtime for Qt doesn't have a Multipoint constructor that takes a PointCollection argument. Instead, I used the following code for Qt:

    MultipointBuilder* mb = new MultipointBuilder(_sr, NULL);
    mb->setPoints(points);
    const Geometry mp = mb->toGeometry();
    delete mb;
    points->removeAll();

And the following for .NET/Java:

MultipointBuilder mb = new MultipointBuilder(points, _sr);
Multipoint mp = mb.ToGeometry();
points.Clear();

As expected, Opts 1 and 2 made no significant difference in .NET and Java, and should have made no difference for Qt, but for a bug already pointed out. Opt 3 made no significant difference for Qt, but actually had opposite results for .NET and Java! Check out these results:

Results (seconds)	Opt 1	Opt 2	Opt 3
.NET	16	16	5
Qt	35	5	5
Java	22	22	52

Is that bizarre, or what?

Jan-Tschada · ‎10-20-2021

@MarkCederholm very interesting shootout. In the past we analysed the memory consumption of the three runtimes. In our daily work, especially in disconnected environments, we always face memory intensive workflows. I created a medium post (GEOINT App: Using web maps as the spatial ground truth. ) regarding a simple use case, last year. The sample code is hosted on Github (GEOINT Monitor poc-viewer branch).

For performance-critical workflows, we usually depend upon low-level libraries which integrate perfectly into the Qt (GPL/LGPL) ecosystem or are specified by system integrators. A recurring restriction is the use of the GEOS C/C++ (LGPL) libraries, especially in combination with PostGIS. I could easily outperform the Qt Opt1 scenario (Avg area = 21715 / Avg pts = 12 / Seconds elapsed: ~24) on my machine by using a thin wrapper around GEOS::GeometryFactory::createMultipoint and GEOS::Geometry::convexHull, ::getNumPoint, ::getArea (Avg area = 21715 / Avg pts = 12 / Seconds elapsed: ~5). We should keep in mind, that these libraries were designed for low-level geometry operations (the multipoint implementation is a kind of std::vector having points as coordinates as struct of two/three doubles) and avoid any copy-construction/copy-assignments in the first place.

Because of the relatively low memory consumption and the easy integration of various low-level libraries, the ArcGIS Runtime for Qt is the way to go for most of our "Runtime Desktop" use cases.

We should spend more time in doing some kind of stress and performance testing.

Thanks for sharing and best Regards from Germany.

MarkCederholm · ‎10-21-2021

GEOS! What fun! I played with that some time ago.

ColinAnderson1 · ‎12-09-2021

@MarkCederholm, for the Opt 3 Java case I think that the API could definitely be optimized to handle PointCollection better but you don't actually need to use a PointCollection here. When constructing a MultipointBuilder it takes an Iterable<Point> which means you can just create a List<Point> e.g.

var points = new ArrayList<Point>();
for (int i = 0; i < iNumPoints; i++) {
  double dX = dXMin + (dXMax - dXMin) * _rand.nextDouble();
  double dY = dYMin + (dYMax - dYMin) * _rand.nextDouble();
  points.add(new Point(dX, dY));
}

MultipointBuilder mb = new MultipointBuilder(points, _sr);

which on my machine takes about 10s vs 25s when you use a PointCollection.

Hopefully PointCollection handling can be optimized in a future release.