Forms are often treated as passive containers for information. They quietly collect photos, text answers, and audio recordings, then push the real work downstream to someone else. Field workers upload images and type descriptions but still need to manually pick categories. Open comment questions gather valuable and detailed feedback, but it can be challenging to find time to review and summarize every response. A citizen reports a 311 issue using voice, eager to explain every detail, but you can’t understand their message because it's in a language you don't speak, requiring extra translation work.
What if the form itself could help? What if it could look at a photo and understand what’s in it, read a paragraph and pull out what matters, listen to a voice message and turn it into clear text—then use all of that to automatically fill the rest of the questions?
This is no longer “what if”. We’re introducing three new AI features (beta) in Survey123 that let your forms see images, analyze text, and transcribe audio right where data is captured. Instead of asking people to do more work in your form, you can let the form do more work for them.
Image analysis: configures a calculation on a target question with a custom prompt to automatically extract specific information from a source image question. For example:
Text analysis: configures a calculation on a target question with a custom prompt to automatically analyze or extract information from a source text question. For example:
Audio transcription: configures a calculation on a text question, with an optional prompt, to automatically transcribe the recording from an audio question into the text. For example:
While each of these features is powerful on its own, they become more transformative when you chain them together.
To show how image analysis, text analysis, and audio transcription work together in a realistic workflow, let's revisit the demo in the 2025 Esri User Conference plenary.
In the first part of the video, a community issue report survey uses a single photo of a broken sidewalk to automatically generate a clear description, classify the request, flag a safety hazard, and suggest the responsible department. In the second part, a community request form captures spoken feedback in Tagalog, transcribes it to text, detects the language, translates it into English, and again uses text analysis to route the request to the right team.
To configure AI-powered image analysis, text analysis, and audio transcription in Survey123, it all starts with one idea: set a calculation on the question you want to populate automatically. We’ll use a few key questions from the UC demo to walk through the configuration steps in the Survey123 web designer.
Generate a description for the uploaded image – image analysis
In XLSForm, use the following expression for the calculation of the question “Description of issue”:
Classify the request type – text analysis
In XLSForm, use the following expression for the calculation of the question “Request type”:
Notes:
Transcribe audio to text – audio transcription
The prompt for audio transcription is optional. It's mainly for improving spelling accuracy on specific words, for example, "Esri isn't transcribed as Ezri". For most audio, this can be left empty.
In XLSForm, use the following expression for the calculation of the question “Transcription”:
Translate non-English text into English – text analysis
Since we don’t need to translate if the respondent speaks English, we can set the “Translation” question to be visible only when “Spoken language” is not English.
Image analysis, text analysis, and audio transcription are currently available in beta. Before you can use them in your surveys, you need to enable the new analysis tools at the organization level. As an ArcGIS Online organization administrator, configure the required settings in both your ArcGIS Online organization and the Survey123 website by following the instructions in the Survey123 analysis tools (beta) blog.
How were these features built?
These capabilities are powered by AI models hosted in the Microsoft Azure OpenAI service. Image and text analysis use a general-purpose large language model, while audio transcription uses a dedicated speech-to-text model.
How is my data kept secure and private when I use these AI features?
We have built these features to ensure your data and input are always secure and private to you. The implementation follows ArcGIS Online data security and privacy guidelines. Specifically, the prompts you set in survey questions and the images or audio you captured in the web app:
In addition, the Microsoft Azure OpenAI service may temporarily store and process user prompts, images and audio for abuse monitoring as described at: https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/abuse-monitoring
Do these features consume ArcGIS Online credits?
At the moment, these features are in beta and available for testing and evaluation purposes without any cost. However, once they become a fully supported part of the product, they are expected to incur an additional charge or consume ArcGIS Online credits. The pricing for these features has not been determined yet.
How should I write prompts to get good results?
A good prompt is clear, specific, and tells the model exactly what you need back. Think of it as instructions to a new colleague: spell out the task, the format of the answer, and any rules they should follow. Avoid vague “do everything” prompts. Here are two examples:
If you’re not sure where to start, begin with a simple, very explicit prompt, check a few real responses, and then tighten the instructions based on what you see.
Are there situations where I should not rely on these AI features?
Do not use these features for high-stakes or safety-critical decisions, or in scenarios such as biometric ID / face recognition, medical or diagnostic images, low-quality inputs, dense crowd or small-object counts. Results should always be reviewed by a human before use in sensitive or high-impact workflows.
How to provide feedback?
We'd love to know your thoughts on these new features along with any feedback on how well it fits your workflow. You can provide feedback by either of the following approaches:
The new AI-powered image, text, and audio analysis features in Survey123 don’t replace the way you design surveys today—they give you new ways to cut typing, reduce errors, and avoid follow-up work. We’re just getting started with these capabilities, and we’re keen to hear your feedback on how well they support your real-world surveys.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.