Is it possible get duration of the audio as separate field which is recorded as part of Survey123 question?
No, this is not currently possible. This would require using pulldata @ audio (or similar) which is not currently implemented, to read the details of the audio file and extract those values. If you would like to see this as an enhancement in the future, I suggest you raise it as an idea on the ArcGIS Ideas page for Survey123, or raise an official enhancement with Esri Support.
Retrieving data ...