Select to view content in your preferred language

Make your Survey123 forms smarter with AI-powered image, text, and audio analysis

393
2
2 weeks ago
ZhifangWang
Esri Regular Contributor
2 2 393

Forms are often treated as passive containers for information. They quietly collect photos, text answers, and audio recordings, then push the real work downstream to someone else. Field workers upload images and type descriptions but still need to manually pick categories. Open comment questions gather valuable and detailed feedback, but it can be challenging to find time to review and summarize every response. A citizen reports a 311 issue using voice, eager to explain every detail, but you can’t understand their message because it's in a language you don't speak, requiring extra translation work.

What if the form itself could help? What if it could look at a photo and understand what’s in it, read a paragraph and pull out what matters, listen to a voice message and turn it into clear text—then use all of that to automatically fill the rest of the questions?

 

AI-powered analysis for images, text, and audio in Survey123

 

This is no longer “what if”. We’re introducing three new AI features (beta) in Survey123 that let your forms see images, analyze text, and transcribe audio right where data is captured. Instead of asking people to do more work in your form, you can let the form do more work for them.

Image analysis: configures a calculation on a target question with a custom prompt to automatically extract specific information from a source image question. For example:

  • A user uploads an ID card or business card and the form automatically pulls the date or email address into the correct fields.
  • A road inspection survey reads the license plate and vehicle color from a photo and automatically fills those fields for the inspector.
  • A citizen 311 survey classifies a submitted photo as “illegal dumping”, “graffiti”, or “broken streetlight” and sets the correct incident category behind the scenes.

Text analysis: configures a calculation on a target question with a custom prompt to automatically analyze or extract information from a source text question. For example:

  • Determine the sentiment (positive, negative, neutral) of a visitor's comment about a public park or facility.
  • Extract key entities like equipment names and error codes from a technician's lengthy narrative field notes.
  • A community engagement form automatically translates non-English comments into English for downstream reviewers while keeping the original text.

Audio transcription: configures a calculation on a text question, with an optional prompt, to automatically transcribe the recording from an audio question into the text. For example:

  • Allow a field inspector wearing gloves to dictate safety observations and hazards instead of typing.
  • A citizen reporting app lets users describe an issue verbally and then uses the transcribed text to populate the incident description and help center staff triage it.

While each of these features is powerful on its own, they become more transformative when you chain them together.

 

See it in action: demo in 2025 Esri UC

 

To show how image analysis, text analysis, and audio transcription work together in a realistic workflow, let's revisit the demo in the 2025 Esri User Conference plenary.

In the first part of the video, a community issue report survey uses a single photo of a broken sidewalk to automatically generate a clear description, classify the request, flag a safety hazard, and suggest the responsible department. In the second part, a community request form captures spoken feedback in Tagalog, transcribes it to text, detects the language, translates it into English, and again uses text analysis to route the request to the right team.

 

Configure AI-powered analysis in Survey123

 

To configure AI-powered image analysis, text analysis, and audio transcription in Survey123, it all starts with one idea: set a calculation on the question you want to populate automatically. We’ll use a few key questions from the UC demo to walk through the configuration steps in the Survey123 web designer.

Community issue report survey

 

Generate a description for the uploaded image – image analysis

  1. Select the question “Description of issue”, and set its calculation to “Analyze image”, based on the “Please upload a photo of the damage or issue:” question.
  2. In the textbox “What would you like to extract from the image?”, enter the following prompt: Analyze the image and write a short 311 issue description. Identify the damaged public asset (e.g., sidewalk, road, park equipment), its location in the scene, any obvious safety hazards, and the necessary action (repair or maintenance). Describe only what is visually present.

In XLSForm, use the following expression for the calculation of the question “Description of issue”:

  • pulldata("@ai","image2text",${photoDamageOrIssue},"Analyze the image and write a short 311 issue description. Identify the damaged public asset (e.g., sidewalk, road, park equipment), its location in the scene, any obvious safety hazards, and the necessary action (repair or maintenance). Describe only what is visually present.")

1 description of issue.gif

 

 

Classify the request type – text analysis

  1. Select the question “Request type”, and set its calculation to “Analyze text”, based on the “Description of issue” question.
  2. In the textbox “What would you like to extract from the text”, enter the following prompt: Categorize the 311 issue description into one of the following types: Sidewalk (sidewalk), Pothole (pothole), Graffiti (graffiti), Streetlight (streetlight), Other (other). Your response must be only the choice name in parentheses.

In XLSForm, use the following expression for the calculation of the question “Request type”:

  • pulldata("@ai","text2text",${description_of_issue},"Categorize the 311 issue description into one of the following types: Sidewalk (sidewalk), Pothole (pothole), Graffiti (graffiti), Streetlight (streetlight), Other (other). Your response must be only the choice name in parentheses.")

Notes:

  • At the time of the UC demo, the web designer only supported setting image analysis or text analysis for a text question. Now both calculation types can apply to most question types.
  • When writing a prompt for a single select (select_one) question, the answer must be the choice name; for a multiple select (select_multiple) question, the answer must be selected choice names separated by commas.

2 request type.gif

 

 

Community request form survey

 

Transcribe audio to text – audio transcription

  1. Select the question “Transcription”, and set its calculation to “Transcribe audio”, based on the “Press to record” question.
  2. Leave the textbox “Transcription prompt” empty.

The prompt for audio transcription is optional. It's mainly for improving spelling accuracy on specific words, for example, "Esri isn't transcribed as Ezri". For most audio, this can be left empty.

In XLSForm, use the following expression for the calculation of the question “Transcription”:

  • pulldata("@ai","audio2text",${press_to_record},"")

3 transcription.gif

 

 

Translate non-English text into English – text analysis

  1. Select the question “Translation”, and set its calculation to “Analyze text”, based on the “Transcription” question.
  2. In the textbox “What would you like to extract from the text”, enter the following prompt: Translate the input text into English. Only output the translation result.

Since we don’t need to translate if the respondent speaks English, we can set the “Translation” question to be visible only when “Spoken language” is not English.

4 translation.gif

 

 

How to get started

 

Image analysis, text analysis, and audio transcription are currently available in beta. Before you can use them in your surveys, you need to enable the new analysis tools at the organization level. As an ArcGIS Online organization administrator, configure the required settings in both your ArcGIS Online organization and the Survey123 website by following the instructions in the Survey123 analysis tools (beta) blog.

 

Frequently asked questions

 

How were these features built?

These capabilities are powered by AI models hosted in the Microsoft Azure OpenAI service. Image and text analysis use a general-purpose large language model, while audio transcription uses a dedicated speech-to-text model.

How is my data kept secure and private when I use these AI features?

We have built these features to ensure your data and input are always secure and private to you. The implementation follows ArcGIS Online data security and privacy guidelines. Specifically, the prompts you set in survey questions and the images or audio you captured in the web app:

  • Are not used to train Esri or third-party AI models
  • Are not used to improve an Esri or third-party product or service
  • Are not stored or shared
  • Are not used to improve the Survey123 web designer and the web app, even for your own use

In addition, the Microsoft Azure OpenAI service may temporarily store and process user prompts, images and audio for abuse monitoring as described at: https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/abuse-monitoring

Do these features consume ArcGIS Online credits?

At the moment, these features are in beta and available for testing and evaluation purposes without any cost. However, once they become a fully supported part of the product, they are expected to incur an additional charge or consume ArcGIS Online credits. The pricing for these features has not been determined yet.

How should I write prompts to get good results?

A good prompt is clear, specific, and tells the model exactly what you need back. Think of it as instructions to a new colleague: spell out the task, the format of the answer, and any rules they should follow. Avoid vague “do everything” prompts. Here are two examples:

  • Image analysis – extract a license plate
    • Bad: “Look at the photo and tell me about the car.” - This is too broad: you might get color, make, or a long description instead of the one field you need.
    • Better (used in a calculation for a text question): “From the vehicle photo, return only the license plate number as plain text (no spaces, no extra words). If you cannot see a license plate, return UNKNOWN.”
  • Text analysis – classify comment sentiment
    • Bad: “Read the comment and analyze the sentiment in detail.” - The model may return a paragraph of explanation, which is hard to store in a single-choice field.
    • Better (used in a calculation for a select_one question): “Read the user’s comment about the software and return exactly one word from this list: positive, neutral, or negative. Do not include any explanation.” 

If you’re not sure where to start, begin with a simple, very explicit prompt, check a few real responses, and then tighten the instructions based on what you see.

Are there situations where I should not rely on these AI features?

Do not use these features for high-stakes or safety-critical decisions, or in scenarios such as biometric ID / face recognition, medical or diagnostic images, low-quality inputs, dense crowd or small-object counts. Results should always be reviewed by a human before use in sensitive or high-impact workflows.

How to provide feedback?

We'd love to know your thoughts on these new features along with any feedback on how well it fits your workflow. You can provide feedback by either of the following approaches:

  • In the Survey123 web app, once the result is obtained, submit your feedback by clicking the thumb up/down icon below the target question.
  • Share your feedback by posting in the Survey123 Beta Website Feedback forum.

 

The new AI-powered image, text, and audio analysis features in Survey123 don’t replace the way you design surveys today—they give you new ways to cut typing, reduce errors, and avoid follow-up work. We’re just getting started with these capabilities, and we’re keen to hear your feedback on how well they support your real-world surveys.

 

2 Comments
clai
by
Emerging Contributor

I have set the audio transcription in Survey 123 web design and the question worked when I filled in the survey through web. But the audio cannot be transcriped to text when I filled in the survey on my iphone. Is there any idea why it happened?

ZhifangWang
Esri Regular Contributor

Hi @clai ,

Did you use the Survey123 field app on your iPhone? All three AI analysis features are currently only available in the Survey123 web app (browser) and are not supported by the Survey123 field app.