After retrieving a speech-to-text result in XML format, you can use this method to reformat the XML into one of several supported formats.
Output format description
The transcribe API supports returning the ASR results in several formats. The output format is determined by the format
query parameter value. Some of the formats provide formatting controls to modify the resulting format. Formatting controls are appended to the format type using the following structure:
{format},{control}:{value},{control}:{value},etc.
E.g.
format=srt,generator:advanced,max_lines:2,line_char_limit:32
XML
This is the default format and will be output if the format
query parameter is not set or set to xml
.
recording
segment (start, end, speaker)
traceback
item (type)
orth
samples (start, end)
confidence
XML element description (all time values are floating point numbers in seconds from the beginning of the audio).
Element | description |
---|---|
segment | One or more segments which define a speech segment. @start - The start time of the speech segment. @end - The end time of the speech segment. @speaker - Speaker detected for this segment. The speaker identifier is a string in the following format spk_{number}_{gender}. Number is a value from 0 to the total number of speakers. Gender; ‘m’ for male and ‘f’ for female. |
item | This contains the recognized item. @type - Determines the item type; “pronunciation” a recognizer word or “punctuation” text punctuation which is non-speech item. |
orth | Text (word or punctuation) associated with the item. |
samples | Identifies the time bounds where this word was recognized. @start - The start time of the recognized word. @end - The end time of the recognized word. |
confidence | A confidence value of the recognized word which is a floating-point value from 0.0 to 1.0. |
XML Sample:
<?xml version="1.0" encoding="utf-8"?>
<recording>
<segment name="43" start="0.040" end="1.440">
<traceback name="43" type="xml">
<item type="pronunciation">
<orth>Can</orth>
<confidence>1.00</confidence>
<samples start="0.040" end="0.290" />
</item>
<item type="pronunciation">
<orth>you</orth>
<confidence>1.00</confidence>
<samples start="0.290" end="0.390" />
</item>
<item type="pronunciation">
<orth>call</orth>
<confidence>1.00</confidence>
<samples start="0.390" end="0.690" />
</item>
<item type="pronunciation">
<orth>me</orth>
<confidence>1.00</confidence>
<samples start="0.690" end="0.810" />
</item>
<item type="pronunciation">
<orth>a</orth>
<confidence>1.00</confidence>
<samples start="0.810" end="0.880" />
</item>
<item type="pronunciation">
<orth>taxi</orth>
<confidence>1.00</confidence>
<samples start="0.880" end="1.430" />
</item>
<item type="punctuation">
<orth>?</orth>
</item>
</traceback>
</segment>
</recording>
JSON
A JSON object that encapsulates the ASR results. The output is generated if the format
query parameter value is json
. See the “XML structure” for a description of the values.
JSON sample:
{
"segments": [
{"start": 0.040,
"end": 1.430,
"text": "Can you call me a taxi?",
"confidence": 1.00,
"speaker": "",
"items":[
{"start": 0.040,"end": 0.290,"text": "Can","confidence": 1.00 },
{"start": 0.290,"end": 0.390,"text": "you","confidence": 1.00 },
{"start": 0.390,"end": 0.690,"text": "call","confidence": 1.00 },
{"start": 0.690,"end": 0.810,"text": "me","confidence": 1.00 },
{"start": 0.810,"end": 0.880,"text": "a","confidence": 1.00 },
{"start": 0.880,"end": 1.430,"text": "taxi?","confidence": 1.00 }
]
}
]
}
SRT
A UTF-8 subrip formatted file. The output is generated if the format
query parameter value is srt
.
The SRT format has the following formatting controls:
Format control | description |
---|---|
| selects the SRT generator. “simple” – Use the simple SRT generator which does not use any advanced language analysis for SRT formatting. This is the default if not specified. “advanced” – Use the advanced SRT generator which analyzes the text structure to create optimized SRT frames. If “advanced_srt_generation” isn’t supported for the provided language, the request will default to the “simple” generator. NOTE: The transcription must include punctuation for this to generate correct results. |
| a positive integer value specifying the maximum number of characters per line. |
| a positive integer value specifying the maximum number of lines per frame. |
| a real value specifying the maximum duration of a frame in seconds. E.g. 10.4 |
| a real value specifying the minimum duration of a single frame in seconds. |
| a real value specifying the minimum amount of time between frames in seconds. |
| a real value specifying the maximum amount of silence before the current sentence is split. |
The following format controls apply if using the advanced generator.
Format control | description |
---|---|
new_speaker_symbol | E.g. '>>' or '-'. Will be added in case of speaker changes when input format is XML and speaker labels are available. Default is to not add a symbol. |
max_close_gap_duration | Gaps shorter than this duration (in seconds) will be closed to create back-to-back subtitles by increasing the first subtitle's end time. |
max_reading_speed | The maximum number of characters per second in a subtitle block (soft constraint). In the ASR use case, achieved by increasing end time of subtitle blocks. For translation, where timings are not changed, the constraint influences the actual segmentation instead, the number of allowed characters per block will be accounted for (if possible). |
use_multi_sentence_lines | Enforces a specific behavior of when to put subsequent sentences onto the same line. If set to true, put short sentences into the same line as preceding or following sentence wherever possible (see also 'multi_sentence_lines_max_pause'). If set to false, keep all sentences separate, this may however lead to short blocks that violate the minimum duration in some cases. If not set, no specific behavior is enforced and the algorithm tries to make optimal decisions per individual case, possibly depending on other configuration fields and language-specific defaults. |
use_multi_sentence_blocks | Enforces a specific behavior of when to put several sentences into one block. If set to true, put multiple sentences into one block separated by line breaks wherever possible (see also 'multi_sentence_blocks_max_pause'), if spoken by the same speaker. If set to false, start each sentence in a new block. If not set, no specific behavior is enforced and the algorithm tries to make optimal decisions per individual case, possibly depending on other configuration fields and language-specific defaults. If no speaker ids are available (via speaker diarization), all sentences are assumed to be spoken by the same speaker. |
use_multi_speaker_blocks | If set, implies 'use_multi_sentence_blocks' but allow putting multiple sentences into the same block even if spoken by different speakers. See also the 'dialogue_dash' option. |
multi_sentence_lines_max_pause | Only allow putting multiple sentences into one line according to 'use_multi_sentence_lines' if the pause between the sentences (in seconds) is not longer than this value. |
multi_sentence_blocks_max_pause | Only allow putting multiple sentences into one block according to 'use_multi_sentence_blocks' if the pause between the sentences (in seconds) is not longer than this value. |
dialogue_dash | If 'use_multi_speaker_blocks' is set, use the following string to indicate speaker changes within multi-speaker blocks. Spacing sensitive, for example set to "- " to use hyphen with space. Use empty string to not add a symbol at all. Differs from 'new_speaker_symbol' in that the symbol is not added at every speaker change, but only where necessary to distinguish speakers within a multi-speaker block. |
SRT Sample:
1
00:00:00,040 --> 00:00:01,430
Can you call me a taxi?
TXT
A simple plain text UTF-8 file which can include line timing information. The output is generated if the format
query parameter value is txt
.
The TXT format has the following formatting controls:
Format control | description |
---|---|
txt_time_format | Selects the format used for the line timing information:
unset - If not specified, the default is a float-point value in seconds. |
txt_separator | Text separator to use when including timing information. The default is <tab>. |
TXT Sample:
0.040 1.430 Can you call me a taxi?
ILS (subtitle)
An JSON format that presents the transcript in framed format where each frame is constructed of one or more lines of text. Like the SRT format, the output can be controlled so that it creates frames that adhere to certain criteria. The output is generated if the format
query parameter value is ils
.
The ILS format accepts the same format controls as the SRT format above.
The ILS JSON structure
{
"subtitles": [
{
"index": 1,
"startTime": "0.140s",
"stopTime": "2.327s",
"lines": [
{
"line": "This is line 1 of frame 1.",
"speakerId": "spk_1_m"
},
{
"line": "This is line 1 of frame 1.",
"speakerId": "spk_1_m"
}
]
},
{
"index": 2,
"startTime": "2.410s",
"stopTime": "4.627s",
"lines": [
{
"line": "This is line 1 of frame 2.",
"speakerId": "spk_1_f"
},
{
"line": "This is line 1 of frame 2.",
"speakerId": "spk_1_f"
}
]
}
]
}