After retrieving a speech-to-text result in XML format, you can use this method to reformat the XML into one of several supported formats.
Output format description
The transcribe API supports returning the ASR results in several formats. The output format is determined by the format
query parameter value. Some of the formats provide formatting controls to modify the resulting format. Formatting controls are appended to the format type using the following structure:
{format},{control}:{value},{control}:{value},etc.
E.g.
format=srt,generator:advanced,max_lines:2,line_char_limit:32
XML
This is the default format and will be output if the format
query parameter is not set or set to xml
.
recording
segment (start, end, speaker)
traceback
item (type)
orth
samples (start, end)
confidence
XML element description (all time values are floating point numbers in seconds from the beginning of the audio).
Element | description |
---|---|
segment | One or more segments which define a speech segment. @start - The start time of the speech segment. @end - The end time of the speech segment. @speaker - Speaker detected for this segment. The speaker identifier is a string in the following format spk{number}{gender}. Number is a value from 0 to the total number of speakers. Gender; ‘m’ for male and ‘f’ for female. |
item | This contains the recognized item. @type - Determines the item type; “pronunciation” a recognizer word or “punctuation” text punctuation which is non-speech item. |
orth | Text (word or punctuation) associated with the item. |
samples | Identifies the time bounds where this word was recognized. @start - The start time of the recognized word. @end - The end time of the recognized word. |
confidence | A confidence value of the recognized word which is a floating-point value from 0.0 to 1.0. |
XML Sample:
<?xml version="1.0" encoding="utf-8"?>
<recording>
<segment name="43" start="0.040" end="1.440">
<traceback name="43" type="xml">
<item type="pronunciation">
<orth>Can</orth>
<confidence>1.00</confidence>
<samples start="0.040" end="0.290" />
</item>
<item type="pronunciation">
<orth>you</orth>
<confidence>1.00</confidence>
<samples start="0.290" end="0.390" />
</item>
<item type="pronunciation">
<orth>call</orth>
<confidence>1.00</confidence>
<samples start="0.390" end="0.690" />
</item>
<item type="pronunciation">
<orth>me</orth>
<confidence>1.00</confidence>
<samples start="0.690" end="0.810" />
</item>
<item type="pronunciation">
<orth>a</orth>
<confidence>1.00</confidence>
<samples start="0.810" end="0.880" />
</item>
<item type="pronunciation">
<orth>taxi</orth>
<confidence>1.00</confidence>
<samples start="0.880" end="1.430" />
</item>
<item type="punctuation">
<orth>?</orth>
</item>
</traceback>
</segment>
</recording>
JSON
A JSON object that encapsulates the ASR results. The output is generated if the format
query parameter value is json
. See the “XML structure” for a description of the values.
JSON sample:
{
"segments": [
{"start": 0.040,
"end": 1.430,
"text": "Can you call me a taxi?",
"confidence": 1.00,
"speaker": "",
"items":[
{"start": 0.040,"end": 0.290,"text": "Can","confidence": 1.00 },
{"start": 0.290,"end": 0.390,"text": "you","confidence": 1.00 },
{"start": 0.390,"end": 0.690,"text": "call","confidence": 1.00 },
{"start": 0.690,"end": 0.810,"text": "me","confidence": 1.00 },
{"start": 0.810,"end": 0.880,"text": "a","confidence": 1.00 },
{"start": 0.880,"end": 1.430,"text": "taxi?","confidence": 1.00 }
]
}
]
}
SRT
A UTF-8 subrip formatted file. The output is generated if the format
query parameter value is srt
.
The SRT format has the following formatting controls:
Format control | description |
---|---|
generator | selects the SRT generator. “simple” – Use the simple SRT generator which does not use any advanced language analysis for SRT formatting. This is the default if not specified. “advanced” – Use the advanced SRT generator which analyzes the text structure to create optimized SRT frames. If “advanced_srt_generation” isn’t supported for the provided language, the request will default to the “simple” generator. NOTE: The transcription must include punctuation for this to generate correct results. |
line_char_limit | a positive integer value specifying the maximum number of characters per line. |
max_lines | a positive integer value specifying the maximum number of lines per frame. |
max_duration | a real value specifying the maximum duration of a frame in seconds. E.g. 10.4 |
min_duration | a real value specifying the minimum duration of a single frame in seconds. |
min_frame_spacing | a real value specifying the minimum amount of time between frames in seconds. |
maximum_pause_within_sentence | an integer value specifying the maximum amount of silence before the current sentence is split. |
SRT Sample:
1
00:00:00,040 --> 00:00:01,430
Can you call me a taxi?
TXT
A simple plain text UTF-8 file which can include line timing information. The output is generated if the format
query parameter value is txt
.
The TXT format has the following formatting controls:
Format control | description |
---|---|
txt_time_format | Selects the format used for the line timing information:clock - The start and end times are formatted as hh:mm:ss:ff (hours:minutes:seconds:milliseconds)none - The text is untimed.unset - If not specified, the default is a float-point value in seconds. |
txt_separator | Text separator to use when including timing information. The default is . |
TXT Sample:
0.040 1.430 Can you call me a taxi?
ILS (subtitle)
An JSON format that presents the transcript in framed format where each frame is constructed of one or more lines of text. Like the SRT format, the output can be controlled so that it creates frames that adhere to certain criteria. The output is generated if the format
query parameter value is ils
.
The ILS format has the following framing controls:
Format control | description |
---|---|
line_char_limit | a positive integer value specifying the maximum number of characters per line. |
max_lines | a positive integer value specifying the maximum number of lines per frame. |
max_duration | a real value specifying the maximum duration of a frame in seconds. E.g. 10.4 |
min_duration | a real value specifying the minimum duration of a single frame in seconds. |
min_frame_spacing | a real value specifying the minimum amount of time between frames in seconds. |
maximum_pause_within_sentence | an integer value specifying the maximum amount of silence before the current sentence is split. |
The ILS JSON structure
{
"subtitles": [
{
"index": 1,
"startTime": "0.140s",
"stopTime": "2.327s",
"lines": [
{
"line": "This is line 1 of frame 1.",
"speakerId": "spk_1_m"
},
{
"line": "This is line 1 of frame 1.",
"speakerId": "spk_1_m"
}
]
},
{
"index": 2,
"startTime": "2.410s",
"stopTime": "4.627s",
"lines": [
{
"line": "This is line 1 of frame 2.",
"speakerId": "spk_1_f"
},
{
"line": "This is line 1 of frame 2.",
"speakerId": "spk_1_f"
}
]
}
]
}