rsunderman-esristaff

XML Data Structures - Characteristics and Limitations

Blog Post created by rsunderman-esristaff Employee on Jul 25, 2018

In a separate blog, JSON Data Structures - Working with Hierarchy and Multicardinality, I wrote about how data can be organized in a JSON structure, how to recognize data hierarchy and cardinality from a GeoEvent Definition, and how to access data values given a hierarchical, multi-cardinal, data structure.

In this blog, we'll explore XML, another self-describing data format which -- like JSON -- has a specific syntax that organizes data using key/value pairs. XML is similar to JSON, but the two data formats are not interchangeable.

What does XML support that JSON does not?

One difference is that XML supports both attribute and element values whereas JSON really only supports key/value pairs. With JSON you generally expect data values will be associated with named fields. Consider the two examples below (credit: w3schools.com):

<person sex="female">
  <firstname>Anna</firstname>
  <lastname>Smith</lastname>
</person>

The XML in this first example above provides information on a person, "Anna". Her first and last name are provided as elements whereas her gender is provided as an attribute value.

<person>
  <sex>female</sex>
  <firstname>Anna</firstname>
  <lastname>Smith</lastname>
</person>

The XML in this second example above provides the same information, except now all of the data is provided using element values

Both XML structures are valid, but if you have any influence with your data provider, it is probably better to avoid attribute values and instead use elements exclusively when ingesting XML data into GeoEvent Server. This is only a recommendation, not a requirement. As you will see in the following examples, GeoEvent Server can successfully adapt XML which contains attribute values.

Here's a little secret:  GeoEvent Server does not actually handle XML data at all.

GeoEvent Server uses third party libraries to translate XML it receives to JSON. The JSON adapter is used interpret the data and create event records from the translated data. Because JSON does not support attribute values, all data values in an XML structure must be translated as elements. Consider the following illustration which shows how a block of XML data might be translated to JSON by GeoEvent Server:

XML vs. JSON

Notice the JSON on the right in this example organizes each event record as separate elements in a JSON array. Also notice the first line of the XML on the left which declares the version and encoding being used. The libraries GeoEvent Server uses to translate the XML to JSON really like seeing this information as part of the XML data. Finally, sometimes XML will include non-visible characters such as a BOM (byte-order mark). If the XML you are trying to ingest is not being recognized by an input you've configured, try copying the XML into a text editor and saving a text-only version to strip out any hidden characters.

 

Other limitations to consider when ingesting XML

There are several other limitations to consider when ingesting XML data into GeoEvent Server. Sometimes a block of JSON might pass an online JSON validator such as the one provided by JSON Lint but still not be ingested into GeoEvent Server. The JSON syntax rules, for example, do not require that every nested element have a name; yet without a name, it is impossible to construct a GeoEvent Definition since every event attribute must have a name to create a complete GeoEvent Definition.

Similarly, there are XML structures which are perfectly valid which GeoEvent Server may have trouble ingesting. Consider the following block of XML data as an example:

<?xml version="1.0" encoding="utf-8"?>
<data>
  <vehicles>
    <vehicle make="Ford" model="Explorer">
      <license_plate>4GHG892</license_plate>
    </vehicle>
    <vehicle make="Toyota" model="Prius">
      <license_plate>6KLM153</license_plate>
    </vehicle>
  </vehicles>
  <personnel>
    <person fname="James" lname="Albert">
      <employee_number>1234</employee_number>
    </person>
    <person fname="Mary" lname="Smith">
      <employee_number>7890</employee_number>
    </person>
  </personnel>
</data>

The XML data illustrated above contains a mix of both "vehicles" and "personnel". The self-describing nature of the XML makes it apparent to a reader which data elements are which, but an input in GeoEvent Server may still have trouble identifying the multiple occurrences of the different data items if the inbound adapter's XML Object Name property is not specified.

Here is the GeoEvent Definition the inbound adapter generates when its XML Object Name property is left unspecified and the XML data sample above is ingested into GeoEvent Server:

GeoEvent Definition

In testing, the very first time the XML with the combination of "vehicles" and "personnel" was received and written out as JSON to a system text file, I observed only one person and one vehicle were written to the output file. Worse yet, without changing the generated GeoEvent Definition or any of the input connector's properties, sending the exact same XML a second time produced an output file with "vehicles" and "personnel" elements that were empty.

We know from the JSON Data Structures - Working with Hierarchy and Multicardinality blog that, at the very least, the cardinality specified by the generated GeoEvent Definition is not correct. The GeoEvent Definition also implies a nesting of groups within groups, which is probably not correct.

Working around the issue

Let's explore how you might work around the issue identified above using the configurable properties available in GeoEvent Server. First, ensure the XML input connector specifies which node in the XML should be treated as the root node by setting the XML Object Name property accordingly as illustrated below:

GeoEvent Input

Second, verify the GeoEvent Definition has the correct cardinality for the data sub-structure beneath the specified root node as illustrated below:

GeoEvent Definition

By configuring these above properties accordingly, GeoEvent Server will only consider data within a sub-structure found beneath a "vehicles" root node and should make allowances that the sub-structure may contain more than one "vehicle".

XML Sample

With this approach, there are two ramifications you might want consider. First, the inbound adapter is literally throwing half of the received data away by excluding data from any sub-structure found beneath the "personnel" nodes. This can be addressed by making a copy of the existing Receive XML on a REST Endpoint input and configuring this copy to use "personnel" as its XML Object Name. The copied input should also use a different GeoEvent Definition -- one which specifies "person" as an event attribute with cardinality Many and the attributes of a "person" (rather than a "vehicle") as illustrated below.

Copied Input Configuration

Second, the event record being ingested has multiple vehicles (or people) as items in an array. You'll likely want to process each vehicle (or person) as individual event records. To address this, it's recommended you use a processor available on the ArcGIS GeoEvent Server Gallery, specifically the Multicardinal Field Splitter Processor. There are two different field splitter processors provided in the download, so make sure to use the processor that handles multicardinal data structures.

A Multicardinal Field Splitter Processor, added to a GeoEvent Service illustrated below, will clone event records it receives and split the event record so that each record output has only one vehicle (or person). Notice that each event record output from the Multicardinal Field Splitter Processor includes an index at which the element was found in the original array.

GeoEvent Service

Conclusion

The examples I've referenced in this blog are obviously academic. There's no good reason why a data provider would mashup people and vehicles this way in the same XML data structure. However, you might come across data structures which are not homogeneous and need to use one or more of the approaches highlighted in this blog to extract a portion of the data out of a data structure. Or you might need to debug your input connector's configuration to figure out why attribute or element values you know to exist in the XML being received are not coming through in the event records that output. Or maybe in the data you're receiving you expect multiple event records to be ingested and end up only observing a few -- or maybe only one -- event records being ingested. Hopefully the information provided will help you address these challenges when you encounter them.

To summarize, below are the tips I highlighted in this article:

  • Use the GeoEvent Definition as a clue to the hierarchy and cardinality GeoEvent Server is using to define each event record's structure.
  • Specify the root node or element when ingesting XML or JSON; don't let the inbound adapter assume which node should be considered the root. If necessary, specify an interior node as the root node so only a subset of the data is actually considered.
  • Avoid XML data which uses attributes. If you must use XML data with attributes, know that an attempt will be made to promote these as elements when the XML is translated to JSON.
  • Encourage your data providers to design data structures whose records are homogeneous. This can run counter to database normalization instincts where data common to all records is included in a sub-section above each of the actual records. Sometimes simple is better, even when "simple" makes individual data records verbose.
  • Make sure the XML you ingest includes a header specifying its version and encoding -- the libraries GeoEvent Server is using really like seeing this metadata. Also, watch out for hidden characters which are sometimes present in the data.

Outcomes