XML Data Structures - Characteristics and Limitations

RJSunderman · ‎07-25-2018

In the blogJSON Data Structures - Working with Hierarchy and Multicardinality I wrote about how data can be organized in a JSON structure, how to recognize data hierarchy and cardinality from a GeoEvent Definition, and how to access data values given a hierarchical, multi-cardinal, data structure.

In this blog, we'll explore XML, another self-describing data format which -- like JSON -- has a specific syntax that organizes data using key/value pairs. XML is similar to JSON, but the two data formats are not interchangeable.

What does XML support that JSON does not?

One difference is that XML supports both attribute and element values whereas JSON really only supports key/value pairs. With JSON you generally expect data values will be associated with named fields. Consider the two examples below (credit: w3schools.com)

<person sex="female">
  <firstname>Anna</firstname>
  <lastname>Smith</lastname>
</person>‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

The XML in this first example above provides information on a person, "Anna". Her first and last name are provided as elements whereas her gender is provided as an attribute value.

<person>
  <sex>female</sex>
  <firstname>Anna</firstname>
  <lastname>Smith</lastname>
</person>‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

The XML in this second example above provides the same information, except now all of the data is provided using element values.

Both XML structures are valid, but if you have any influence with your data provider, it is probably better to avoid attribute values and instead use elements exclusively when ingesting XML data into GeoEvent Server. The inbound XML adapter can successfully translate XML which contains attribute values and node element values with some limitations we'll look at shortly.

Here's a little secret: GeoEvent Server does not actually handle XML data at all.

GeoEvent Server uses third party libraries to translate XML it receives to JSON. The JSON adapter is used interpret the data and create event records from the translated data. Because JSON does not support attribute values, all data values in an XML structure must be translated as elements. Consider the following illustration which shows how a block of XML data might be translated to JSON by GeoEvent Server:

XML vs. JSON

Notice the first line of the XML in the XML illustrated above/left. This declares the version and encoding being used which the libraries GeoEvent Server uses to translate the XML to JSON really like seeing as part of the XML data.

Also, notice the JSON to the right of the XML sample organizes each event record's data as separate elements in a JSON array. Data for employee "James Albert (Emp #1234)" is represented in its own set of curl-braces as a single JSON element. There are three employee JSON elements in the array.

Sometimes XML will include non-visible characters such as a BOM (byte-order mark). If the XML you are trying to ingest is not being recognized by an input you've configured, try copying the XML into a text editor which doesn't mask the sort of characters you might find at the beginning or end of a document. Saving the raw text after stripping out any hidden characters should help create a cleaner XML document.

Other limitations to consider when ingesting XML

There are several other limitations to consider when ingesting XML data into GeoEvent Server. Sometimes a block of JSON might pass an online JSON validator such as the one provided by JSON Lint but GeoEvent Server's inbound JSON adapter is not able to adapt the JSON to create an event record for processing. Esri Feature JSON and geoJSON are two examples which require special handling of arrays that don't have keys associated with them.

Mixing and Matching Attributes and Element Values

The following XML cannot be parsed:

<place>
  <latitude units="Decimal Degrees">45.125</latitude>
  <longitude units="Decimal Degrees">-115.375</longitude>
  <height units="Long Integer">10</height>
</place>

The attribute units in each node need to be pushed down to become child-elements beneath the parent node. The parent node's value shown in bold text is lost. There are two ways to work around this known limitation.

You could leave each parent node's value null and instead incorporate all of the data as node-level attributes:

<place>
  <latitude units="Decimal Degrees" value="45.125"></latitude>
  <longitude units="Decimal Degrees" value="-115.375"></longitude>
  <height units="Long Integer" value="10"></height>
</place>

XML requires node-level attributes be enclosed in double-quotes. When tailoring your GeoEvent Definition you can specify that the latitude and longitude values be adapted as Double and the height be adapted as a Long to avoid bringing the data in as literal strings.

Alternatively you could wrap each node’s value in a nested tag explicitly making it a child. When the parent node's attributes are pushed down to become children they will be siblings of the formal child elements named value:

<place>
  <latitude units="Decimal Degrees"><value>45.125</value></latitude>
  <longitude units="Decimal Degrees"><value>-115.375</value></longitude>
  <height units="Long Integer"><value>10</value></height>
</place>

.line-spacing.

Mixing and Matching Data Element Types

Consider the following block of XML data which includes data on both "vehicles" and "personnel".

<?xml version="1.0" encoding="utf-8"?>
<data>
  <vehicles>
    <vehicle make="Ford" model="Explorer">
      <license_plate>4GHG892</license_plate>
    </vehicle>
    <vehicle make="Toyota" model="Prius">
      <license_plate>6KLM153</license_plate>
    </vehicle>
  </vehicles>
  <personnel>
    <person fname="James" lname="Albert">
      <employee_number>1234</employee_number>
    </person>
    <person fname="Mary" lname="Smith">
      <employee_number>7890</employee_number>
    </person>
  </personnel>
</data>‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

The self-describing nature of the XML makes it apparent to a reader which data elements are which, but an input in GeoEvent Server will have trouble identifying the multiple occurrences of the different data items if the inbound adapter's XML Object Name property is not specified.

Here is the GeoEvent Definition the inbound adapter generates when its XML Object Name property is left unspecified and the XML data sample above is ingested into GeoEvent Server:

GeoEvent Definition

In testing, the very first time the XML with the combination of "vehicles" and "personnel" was received and written out as JSON to a system text file, I observed only one person and one vehicle written to the output file. Worse yet, without changing the generated GeoEvent Definition or any of the input connector's properties, sending the exact same XML a second time produced an output file with "vehicles" and "personnel" elements that were empty.

The blogJSON Data Structures - Working with Hierarchy and Multicardinality suggests that, at the very least, the cardinality specified by the generated GeoEvent Definition is not correct. The GeoEvent Definition also implies a nesting of groups within groups which won't work once an XML Object Name is specified.

Let's explore how you might work around this issue using the configurable properties available in GeoEvent Server. First, ensure the XML input connector specifies which node in the XML should be treated as the root node by setting the XML Object Name property accordingly as illustrated below:

GeoEvent Input

Second, verify the GeoEvent Definition has the correct cardinality for the data sub-structure beneath the specified root node as illustrated below:

GeoEvent Definition

By configuring these above properties accordingly, GeoEvent Server will only consider data within a sub-structure found beneath a "vehicles" root node and should make allowances that the sub-structure may contain more than one "vehicle".

XML Sample

With this approach, there are two ramifications you might want consider.

First, the inbound adapter is literally throwing half of the received data away by excluding data from any sub-structure found beneath the "personnel" nodes. This can be addressed by making a copy of the existing Receive XML on a REST Endpoint input and configuring this copy to use "personnel" as its XML Object Name. The copied input should also use a different GeoEvent Definition -- one which specifies "person" as an event attribute with cardinality Many and the attributes of a "person" (rather than a "vehicle") as illustrated below.

Copied Input Configuration

Second, the event record being ingested has multiple vehicles (or people) as items in an array. You'll likely want to process each vehicle (or person) as individual event records. To address this, it's recommended you use a processor available on the ArcGIS GeoEvent Server Gallery, specifically the Multicardinal Field Splitter Processor. There are two different field splitter processors provided in the download, so make sure to use the processor that handles multicardinal data structures.

A Multicardinal Field Splitter Processor, added to a GeoEvent Service illustrated below, will clone event records it receives and split the event record so that each record output has only one vehicle (or person). Notice that each event record output from the Multicardinal Field Splitter Processor includes an index at which the element was found in the original array.

GeoEvent Service

Conclusion

The examples I've referenced in this blog are obviously academic. There's no good reason why a data provider would mashup people and vehicles this way in the same XML data structure. However, you might come across data structures which are not homogeneous and need to use one or more of the approaches highlighted in this blog to extract a portion of the data out of a data structure. Or you might need to debug your input connector's configuration to figure out why attribute or element values you know to exist in the XML being received are not coming through in the event records that output. Or maybe in the data you're receiving you expect multiple event records to be ingested and end up only observing a few -- or maybe only one -- event records being ingested. Hopefully the information provided will help you address these challenges when you encounter them.

To summarize, below are the tips I highlighted in this article:

Use the GeoEvent Definition as a clue to the hierarchy and cardinality GeoEvent Server is using to define each event record's structure.
Specify the root node or element when ingesting XML or JSON; don't let the inbound adapter assume which node should be considered the root. If necessary, specify an interior node as the root node so only a subset of the data is actually considered.
Avoid XML data which uses attributes. If you must use XML data with attributes, know that an attempt will be made to promote these as elements when the XML is translated to JSON.
Encourage your data providers to design data structures whose records are homogeneous. This can run counter to database normalization instincts where data common to all records is included in a sub-section above each of the actual records. Sometimes simple is better, even when "simple" makes individual data records verbose.
Make sure the XML you ingest includes a header specifying its version and encoding -- the libraries GeoEvent Server is using really like seeing this metadata. Also, watch out for hidden characters which are sometimes present in the data.