When running any one of the statistical tools inside ArcPro, we are limited to numerical variables. There are frequent times when we would want to use categorical variables (State names, urban/rural...) it would really help if there was a tool, call it "encode", or something similar, where all Text fields could be converted to Integer. Using the analogy from about StateName, would go to StateName_num, urban_rural would go to urban_rural_num, etc. Even more ideally, we would be able to have the option to display the category alias or the number.
Right now, I either script it in Python, or continue my data prep work in Stata which does this off the shelf with Encode. Ideally I would be able to do all of this in the ArcPro/ArcPy environment so we can reduce license requirements for outside software.
Good news, there is an existing tool in ArcGIS Pro for encoding categorical variables! Take a look:
Encode Field (Data Management)
I also recommend checking out the Data Engineering view in Pro, which gives you direct access to this tool and many others for exploring and preparing your data for analysis.
Thank you for your reply and linking to that tool. That's not exactly what I had in mind. That splits out new binary categories based on each category. Thus if you had 10 potential categories, you would now get 10 new fields each a binary. The STATA Encode tool encodes all categorical values inside a single field. Additionally, it keeps the text label so we still know what variables we're actually talking about when we view the outputs, say in a regression. Perhaps this example is clearer than the PDF I linked to: https://www.youtube.com/watch?v=ZRWHjdIZyxo
Appreciate the clarification. Here are some ideas:
However, I would caution that these methods are not appropriate for use in many statistical workflows, unless you are working with an ordinal variable (and encoding as such, with categories in rank order). For an unordered categorical variable (factor), typically a statistical analysis would use one-hot or one-cold encoded variables, as provided by the Encode Field tool linked in the comment above. (From my understanding, Stata handles "dummy encoding" of factor variables internally.)
Also, I will just note that several of the spatial statistical tools do allow categorical variable inputs without the need for additional data prep. For example:
Hope this helps!
Thank you for the tools. At present I use ArcPy to generate new number based fields. I submitted the idea because I'm hoping that my organization can reduce reliance on other software. As mentioned elsewhere in the ideas environment, we often extract geospatial data using ArcPro, then export as a CSV, bring into Stata for statistical work, then export CSV, relink with spatial data, and proceed with data integration/analysis/graphics. With my proposed idea, which would, thanks to your feedback, expand the options in the encode tool, it would enable my less python literate colleagues, economists for the most part, to keep more of their workflow within the ESRI ecosystem. They often use regressions of various types, hence the request for numeric variables. Ideally the encode tool would include the option to create a coded domain so that the end user could both encode binary and non-binary variables and maintain the labels in a single tool GUI. Does that make sense?
ESRI is well established as the giant of the geospatial realm. It would be better placed to compete with PowerBI and Tableau if the statistical systems, in pro and AGOL/AE were more robust.
We'd absolutely like to help you keep your workflows in ArcGIS Pro where possible to simplify your data engineering workflows. To clarify the enhancement to Encode Field that you're suggesting, do you mean that you would like to be able to encode ordinal variables in a single field? In that case, we would need the user to specify the order of categories in the tool. If the categories do not have an order, then it is not appropriate to represent them numerically in the same field together (for example, red=1, green=2, blue=3) for use in regression.
I will bring the ideas of encoding ordinal variables and integrating domains back to the team. In the meantime, manually configuring domains is perhaps a good no-code option for representing ordinal data types with aliases.
For some reason it didn't take when I tried to reply weeks ago! I'm sorry for the delay!
Thank you for your reply. I was just thinking of just categories where order did not matter, but since you mention it...ordinal category encoding would be would be great, too. I'm envisioning in the GUI the encode options could be done via drop down, or multiple optional options (like in Spatial Analysts Distance tools where there are many optional outputs) does that make sense?
With the ordinal encoding since there may be known rankings of the categories which we wouldn't want to lose by shifting to dummies, it would be great to have additional options like, assuming linearity, basic numeric and, if non-linear, numeric + polynomial, or numeric + spline. Moving from the abstract to the concrete, this would enable things like taking country-level income categories (low-, lower-middle,...high-income countries, which are associated with non-linear increases in GNI per capita income and explore questions like how does the outcome move with money, even inside classes, not merely across classes (dummies). Here's the stata doc for splines: https://www.stata.com/manuals13/rmkspline.pdf
I understand that in regressions the true dummy variables will be treated as binaries, but it's much easier from a data perspective to keep them in a single column, than split out the binaries. For example, STATA can take single fields and then during regressions with the i. command functionally make binaries see 25.2 and 25.2.1 here: https://www.stata.com/manuals13/u25.pdf, Yes it's functionally the same, but it makes it easier, and would also enable a single variable call in arcPy operations.
As an aside, Couldn't a tool like this also be used to quickly build domains, which might be super useful to some of your other clients?
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.