Reformat ASR results

After retrieving a speech-to-text result in XML format, you can use this method to reformat the XML into one of several supported formats.

Output format description

The transcribe API supports returning the ASR results in several formats. The output format is determined by the format query parameter value. Some of the formats provide formatting controls to modify the resulting format. Formatting controls are appended to the format type using the following structure:

{format},{control}:{value},{control}:{value},etc.

E.g.

format=srt,generator:advanced,max_lines:2,line_char_limit:32

XML

This is the default format and will be output if the format query parameter is not set or set to xml.

recording
	segment	(start, end, speaker)
		traceback
			item	(type)
				orth 
				samples	(start, end)
				confidence

XML element description (all time values are floating point numbers in seconds from the beginning of the audio).

Elementdescription
segmentOne or more segments which define a speech segment.
@start - The start time of the speech segment.
@end - The end time of the speech segment.
@speaker - Speaker detected for this segment. The speaker identifier is a string in the following format spk{number}{gender}. Number is a value from 0 to the total number of speakers. Gender; ‘m’ for male and ‘f’ for female.
itemThis contains the recognized item.
@type - Determines the item type; “pronunciation” a recognizer word or “punctuation” text punctuation which is non-speech item.
orthText (word or punctuation) associated with the item.
samplesIdentifies the time bounds where this word was recognized.
@start - The start time of the recognized word.
@end - The end time of the recognized word.
confidenceA confidence value of the recognized word which is a floating-point value from 0.0 to 1.0.

XML Sample:

<?xml version="1.0" encoding="utf-8"?>
<recording>
  <segment name="43" start="0.040" end="1.440">
    <traceback name="43" type="xml">
      <item type="pronunciation">
        <orth>Can</orth>
        <confidence>1.00</confidence>
        <samples start="0.040" end="0.290" />
      </item>
      <item type="pronunciation">
        <orth>you</orth>
        <confidence>1.00</confidence>
        <samples start="0.290" end="0.390" />
      </item>
      <item type="pronunciation">
        <orth>call</orth>
        <confidence>1.00</confidence>
        <samples start="0.390" end="0.690" />
      </item>
      <item type="pronunciation">
        <orth>me</orth>
        <confidence>1.00</confidence>
        <samples start="0.690" end="0.810" />
      </item>
      <item type="pronunciation">
        <orth>a</orth>
        <confidence>1.00</confidence>
        <samples start="0.810" end="0.880" />
      </item>
      <item type="pronunciation">
        <orth>taxi</orth>
        <confidence>1.00</confidence>
        <samples start="0.880" end="1.430" />
      </item>
      <item type="punctuation">
        <orth>?</orth>
      </item>
    </traceback>
  </segment>
</recording>

JSON

A JSON object that encapsulates the ASR results. The output is generated if the format query parameter value is json. See the “XML structure” for a description of the values.

JSON sample:

{
  "segments": [
      {"start": 0.040,
 				 "end": 1.430,
				"text": "Can you call me a taxi?",
				"confidence": 1.00,
				"speaker": "",
				"items":[
          {"start": 0.040,"end": 0.290,"text": "Can","confidence": 1.00 },
					{"start": 0.290,"end": 0.390,"text": "you","confidence": 1.00 },
					{"start": 0.390,"end": 0.690,"text": "call","confidence": 1.00 },
					{"start": 0.690,"end": 0.810,"text": "me","confidence": 1.00 },
					{"start": 0.810,"end": 0.880,"text": "a","confidence": 1.00 },
					{"start": 0.880,"end": 1.430,"text": "taxi?","confidence": 1.00 }
        ]
      }
  ]
}

SRT

A UTF-8 subrip formatted file. The output is generated if the format query parameter value is srt.
The SRT format has the following formatting controls:

Format controldescription
generatorselects the SRT generator.

“simple” – Use the simple SRT generator which does not use any advanced language analysis for SRT formatting. This is the default if not specified.

“advanced” – Use the advanced SRT generator which analyzes the text structure to create optimized SRT frames.

If “advanced_srt_generation” isn’t supported for the provided language, the request will default to the “simple” generator.

NOTE: The transcription must include punctuation for this to generate correct results.
line_char_limita positive integer value specifying the maximum number of characters per line.
max_linesa positive integer value specifying the maximum number of lines per frame.
max_durationa real value specifying the maximum duration of a frame in seconds. E.g. 10.4
min_durationa real value specifying the minimum duration of a single frame in seconds.
min_frame_spacinga real value specifying the minimum amount of time between frames in seconds.
maximum_pause_within_sentencean integer value specifying the maximum amount of silence before the current sentence is split.

SRT Sample:

1
00:00:00,040 --> 00:00:01,430
Can you call me a taxi?

TXT

A simple plain text UTF-8 file which can include line timing information. The output is generated if the format query parameter value is txt.

The TXT format has the following formatting controls:

Format controldescription
txt_time_formatSelects the format used for the line timing information:

clock - The start and end times are formatted as hh:mm:ss:ff (hours:minutes:seconds:milliseconds)

none - The text is untimed.

unset - If not specified, the default is a float-point value in seconds.
txt_separatorText separator to use when including timing information. The default is .

TXT Sample:

0.040	1.430	Can you call me a taxi?

ILS (subtitle)

An JSON format that presents the transcript in framed format where each frame is constructed of one or more lines of text. Like the SRT format, the output can be controlled so that it creates frames that adhere to certain criteria. The output is generated if the format query parameter value is ils.
The ILS format has the following framing controls:

Format controldescription
line_char_limita positive integer value specifying the maximum number of characters per line.
max_linesa positive integer value specifying the maximum number of lines per frame.
max_durationa real value specifying the maximum duration of a frame in seconds. E.g. 10.4
min_durationa real value specifying the minimum duration of a single frame in seconds.
min_frame_spacinga real value specifying the minimum amount of time between frames in seconds.
maximum_pause_within_sentencean integer value specifying the maximum amount of silence before the current sentence is split.

The ILS JSON structure

{
    "subtitles": [
        {
            "index": 1,
            "startTime": "0.140s",
            "stopTime": "2.327s",
            "lines": [
                {
                    "line": "This is line 1 of frame 1.",
                    "speakerId": "spk_1_m"
                },
                {
                    "line": "This is line 1 of frame 1.",
                    "speakerId": "spk_1_m"
                }
            ]
        },
        {
            "index": 2,
            "startTime": "2.410s",
            "stopTime": "4.627s",
            "lines": [
                {
                    "line": "This is line 1 of frame 2.",
                    "speakerId": "spk_1_f"
                },
                {
                    "line": "This is line 1 of frame 2.",
                    "speakerId": "spk_1_f"
                }
            ]
        }
    ]
}

Details

Language
Credentials
Header
Click Try It! to start a request and see the response here!