Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
This article documents the image generation and audio (speech) data plane inference REST API operations for Azure OpenAI in the 2024-10-21 GA release. For chat completions, embeddings, completions, and all other operations, see the official Azure OpenAI REST API reference.
API specs
Managing and interacting with Azure OpenAI models and resources is divided across three primary API surfaces:
- Control plane
- Data plane - authoring
- Data plane - inference
Each API surface/specification encapsulates a different set of Azure OpenAI capabilities. Each API has its own unique set of preview and stable/generally available (GA) API releases. Preview releases currently tend to follow a monthly cadence.
Important
There is now a new preview inference API. Learn more in our API lifecycle guide.
| API | Latest preview release | Latest GA release | Specifications | Description |
|---|---|---|---|---|
| Control plane | 2025-07-01-preview |
2025-06-01 |
Spec files | The control plane API is used for operations like creating resources, model deployment, and other higher level resource management tasks. The control plane also governs what is possible to do with capabilities like Azure Resource Manager, Bicep, Terraform, and Azure CLI. |
| Data plane | v1 preview |
v1 |
Spec files | The data plane API controls inference and authoring operations. |
Authentication
Azure OpenAI provides two methods for authentication. You can use either API Keys or Microsoft Entra ID.
API Key authentication: For this type of authentication, all API requests must include the API Key in the
api-keyHTTP header. The Quickstart provides guidance for how to make calls with this type of authentication.Microsoft Entra ID authentication: You can authenticate an API call using a Microsoft Entra token. Authentication tokens are included in a request as the
Authorizationheader. The token provided must be preceded byBearer, for exampleBearer YOUR_AUTH_TOKEN. You can read our how-to guide on authenticating with Microsoft Entra ID.
REST API versioning
The service APIs are versioned using the api-version query parameter. All versions follow the YYYY-MM-DD date structure. For example:
POST https://YOUR_RESOURCE_NAME.openai.azure.com/openai/deployments/YOUR_DEPLOYMENT_NAME/chat/completions?api-version=2024-06-01
Data plane inference
The rest of this article covers the image and audio operations in the GA release of the Azure OpenAI data plane inference specification, 2024-10-21.
For the preview image and audio operations, see the preview image and audio REST API reference.
Transcriptions - Create
POST https://{endpoint}/openai/deployments/{deployment-id}/audio/transcriptions?api-version=2024-10-21
Transcribes audio into the input language.
URI Parameters
| Name | In | Required | Type | Description |
|---|---|---|---|---|
| endpoint | path | Yes | string url |
Supported Azure OpenAI endpoints (protocol and hostname, for example: https://aoairesource.openai.azure.com. Replace "aoairesource" with your Azure OpenAI resource name). https://{your-resource-name}.openai.azure.com |
| deployment-id | path | Yes | string | Deployment ID of the speech to text model. For information about supported models, see [/azure/ai-foundry/openai/concepts/models#audio-models]. |
| api-version | query | Yes | string | API version |
Request Header
| Name | Required | Type | Description |
|---|---|---|---|
| api-key | True | string | Provide Azure OpenAI API key here |
Request Body
Content-Type: multipart/form-data
| Name | Type | Description | Required | Default |
|---|---|---|---|---|
| file | string | The audio file object to transcribe. | Yes | |
| prompt | string | An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language. | No | |
| response_format | audioResponseFormat | Defines the format of the output. | No | |
| temperature | number | The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit. | No | 0 |
| language | string | The language of the input audio. Supplying the input language in ISO-639-1 format will improve accuracy and latency. | No |
Responses
Status Code: 200
Description: OK
| Content-Type | Type | Description |
|---|---|---|
| application/json | audioResponse or audioVerboseResponse | |
| text/plain | string | Transcribed text in the output format (when response_format was one of text, vtt or srt). |
Examples
Example
Gets transcribed text and associated metadata from provided spoken audio data.
POST https://{endpoint}/openai/deployments/{deployment-id}/audio/transcriptions?api-version=2024-10-21
Responses: Status Code: 200
{
"body": {
"text": "A structured object when requesting json or verbose_json"
}
}
Example
Gets transcribed text and associated metadata from provided spoken audio data.
POST https://{endpoint}/openai/deployments/{deployment-id}/audio/transcriptions?api-version=2024-10-21
"---multipart-boundary\nContent-Disposition: form-data; name=\"file\"; filename=\"file.wav\"\nContent-Type: application/octet-stream\n\nRIFF..audio.data.omitted\n---multipart-boundary--"
Responses: Status Code: 200
{
"type": "string",
"example": "plain text when requesting text, srt, or vtt"
}
Translations - Create
POST https://{endpoint}/openai/deployments/{deployment-id}/audio/translations?api-version=2024-10-21
Transcribes and translates input audio into English text.
URI Parameters
| Name | In | Required | Type | Description |
|---|---|---|---|---|
| endpoint | path | Yes | string url |
Supported Azure OpenAI endpoints (protocol and hostname, for example: https://aoairesource.openai.azure.com. Replace "aoairesource" with your Azure OpenAI resource name). https://{your-resource-name}.openai.azure.com |
| deployment-id | path | Yes | string | Deployment ID of the whisper model which was deployed. For information about supported models, see [/azure/ai-foundry/openai/concepts/models#audio-models]. |
| api-version | query | Yes | string | API version |
Request Header
| Name | Required | Type | Description |
|---|---|---|---|
| api-key | True | string | Provide Azure OpenAI API key here |
Request Body
Content-Type: multipart/form-data
| Name | Type | Description | Required | Default |
|---|---|---|---|---|
| file | string | The audio file to translate. | Yes | |
| prompt | string | An optional text to guide the model's style or continue a previous audio segment. The prompt should be in English. | No | |
| response_format | audioResponseFormat | Defines the format of the output. | No | |
| temperature | number | The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit. | No | 0 |
Responses
Status Code: 200
Description: OK
| Content-Type | Type | Description |
|---|---|---|
| application/json | audioResponse or audioVerboseResponse | |
| text/plain | string | Transcribed text in the output format (when response_format was one of text, vtt or srt). |
Examples
Example
Gets English language transcribed text and associated metadata from provided spoken audio data.
POST https://{endpoint}/openai/deployments/{deployment-id}/audio/translations?api-version=2024-10-21
"---multipart-boundary\nContent-Disposition: form-data; name=\"file\"; filename=\"file.wav\"\nContent-Type: application/octet-stream\n\nRIFF..audio.data.omitted\n---multipart-boundary--"
Responses: Status Code: 200
{
"body": {
"text": "A structured object when requesting json or verbose_json"
}
}
Example
Gets English language transcribed text and associated metadata from provided spoken audio data.
POST https://{endpoint}/openai/deployments/{deployment-id}/audio/translations?api-version=2024-10-21
"---multipart-boundary\nContent-Disposition: form-data; name=\"file\"; filename=\"file.wav\"\nContent-Type: application/octet-stream\n\nRIFF..audio.data.omitted\n---multipart-boundary--"
Responses: Status Code: 200
{
"type": "string",
"example": "plain text when requesting text, srt, or vtt"
}
Image generation
POST https://{endpoint}/openai/deployments/{deployment-id}/images/generations?api-version=2024-10-21
Generates a batch of images from a text caption on a given dall-e model deployment
URI Parameters
| Name | In | Required | Type | Description |
|---|---|---|---|---|
| endpoint | path | Yes | string url |
Supported Azure OpenAI endpoints (protocol and hostname, for example: https://aoairesource.openai.azure.com. Replace "aoairesource" with your Azure OpenAI resource name). https://{your-resource-name}.openai.azure.com |
| deployment-id | path | Yes | string | Deployment ID of the dall-e model which was deployed. |
| api-version | query | Yes | string | API version |
Request Header
| Name | Required | Type | Description |
|---|---|---|---|
| api-key | True | string | Provide Azure OpenAI API key here |
Request Body
Content-Type: application/json
| Name | Type | Description | Required | Default |
|---|---|---|---|---|
| prompt | string | A text description of the desired image(s). The maximum length is 4,000 characters. | Yes | |
| n | integer | The number of images to generate. | No | 1 |
| size | imageSize | The size of the generated images. | No | 1024x1024 |
| response_format | imagesResponseFormat | The format in which the generated images are returned. | No | url |
| user | string | A unique identifier representing your end-user, which can help to monitor and detect abuse. | No | |
| quality | imageQuality | The quality of the image that will be generated. | No | standard |
| style | imageStyle | The style of the generated images. | No | vivid |
Responses
Status Code: 200
Description: Ok
| Content-Type | Type | Description |
|---|---|---|
| application/json | generateImagesResponse |
Status Code: default
Description: An error occurred.
| Content-Type | Type | Description |
|---|---|---|
| application/json | dalleErrorResponse |
Examples
Example
Creates images given a prompt.
POST https://{endpoint}/openai/deployments/{deployment-id}/images/generations?api-version=2024-10-21
{
"prompt": "In the style of WordArt, Microsoft Clippy wearing a cowboy hat.",
"n": 1,
"style": "natural",
"quality": "standard"
}
Responses: Status Code: 200
{
"body": {
"created": 1698342300,
"data": [
{
"revised_prompt": "A vivid, natural representation of Microsoft Clippy wearing a cowboy hat.",
"prompt_filter_results": {
"sexual": {
"severity": "safe",
"filtered": false
},
"violence": {
"severity": "safe",
"filtered": false
},
"hate": {
"severity": "safe",
"filtered": false
},
"self_harm": {
"severity": "safe",
"filtered": false
},
"profanity": {
"detected": false,
"filtered": false
}
},
"url": "https://dalletipusw2.blob.core.windows.net/private/images/e5451cc6-b1ad-4747-bd46-b89a3a3b8bc3/generated_00.png?se=2023-10-27T17%3A45%3A09Z&...",
"content_filter_results": {
"sexual": {
"severity": "safe",
"filtered": false
},
"violence": {
"severity": "safe",
"filtered": false
},
"hate": {
"severity": "safe",
"filtered": false
},
"self_harm": {
"severity": "safe",
"filtered": false
}
}
}
]
}
}
Components
For the schema definitions used by chat, completions, embeddings, and other text operations, see the Azure OpenAI REST API reference. The following schemas support the image and audio operations on this page.
innerErrorCode
Error codes for the inner error object.
Description: Error codes for the inner error object.
Type: string
Default:
Enum Name: InnerErrorCode
Enum Values:
| Value | Description |
|---|---|
| ResponsibleAIPolicyViolation | The prompt violated one of more content filter rules. |
dalleErrorResponse
| Name | Type | Description | Required | Default |
|---|---|---|---|---|
| error | dalleError | No |
dalleError
| Name | Type | Description | Required | Default |
|---|---|---|---|---|
| param | string | No | ||
| type | string | No | ||
| inner_error | dalleInnerError | Inner error with additional details. | No |
dalleInnerError
Inner error with additional details.
| Name | Type | Description | Required | Default |
|---|---|---|---|---|
| code | innerErrorCode | Error codes for the inner error object. | No | |
| content_filter_results | dalleFilterResults | Information about the content filtering category (hate, sexual, violence, self_harm), if it has been detected, as well as the severity level (very_low, low, medium, high-scale that determines the intensity and risk level of harmful content) and if it has been filtered or not. Information about jailbreak content and profanity, if it has been detected, and if it has been filtered or not. And information about customer blocklist, if it has been filtered and its id. | No | |
| revised_prompt | string | The prompt that was used to generate the image, if there was any revision to the prompt. | No |
contentFilterSeverityResult
| Name | Type | Description | Required | Default |
|---|---|---|---|---|
| filtered | boolean | Yes | ||
| severity | string | No |
contentFilterDetectedResult
| Name | Type | Description | Required | Default |
|---|---|---|---|---|
| filtered | boolean | Yes | ||
| detected | boolean | No |
dalleFilterResults
Information about the content filtering category (hate, sexual, violence, self_harm), if it has been detected, as well as the severity level (very_low, low, medium, high-scale that determines the intensity and risk level of harmful content) and if it has been filtered or not. Information about jailbreak content and profanity, if it has been detected, and if it has been filtered or not. And information about customer blocklist, if it has been filtered and its id.
| Name | Type | Description | Required | Default |
|---|---|---|---|---|
| sexual | contentFilterSeverityResult | No | ||
| violence | contentFilterSeverityResult | No | ||
| hate | contentFilterSeverityResult | No | ||
| self_harm | contentFilterSeverityResult | No | ||
| profanity | contentFilterDetectedResult | No | ||
| jailbreak | contentFilterDetectedResult | No |
audioResponse
Translation or transcription response when response_format was json
| Name | Type | Description | Required | Default |
|---|---|---|---|---|
| text | string | Translated or transcribed text. | Yes |
audioVerboseResponse
Translation or transcription response when response_format was verbose_json
| Name | Type | Description | Required | Default |
|---|---|---|---|---|
| text | string | Translated or transcribed text. | Yes | |
| task | string | Type of audio task. | No | |
| language | string | Language. | No | |
| duration | number | Duration. | No | |
| segments | array | No |
audioResponseFormat
Defines the format of the output.
Description: Defines the format of the output.
Type: string
Default:
Enum Values:
- json
- text
- srt
- verbose_json
- vtt
imageQuality
The quality of the image that will be generated.
Description: The quality of the image that will be generated.
Type: string
Default: standard
Enum Name: Quality
Enum Values:
| Value | Description |
|---|---|
| standard | Standard quality creates images with standard quality. |
| hd | HD quality creates images with finer details and greater consistency across the image. |
imagesResponseFormat
The format in which the generated images are returned.
Description: The format in which the generated images are returned.
Type: string
Default: url
Enum Name: ImagesResponseFormat
Enum Values:
| Value | Description |
|---|---|
| url | The URL that provides temporary access to download the generated images. |
| b64_json | The generated images are returned as base64 encoded string. |
imageSize
The size of the generated images.
Description: The size of the generated images.
Type: string
Default: 1024x1024
Enum Name: Size
Enum Values:
| Value | Description |
|---|---|
| 1792x1024 | The desired size of the generated image is 1792x1024 pixels. |
| 1024x1792 | The desired size of the generated image is 1024x1792 pixels. |
| 1024x1024 | The desired size of the generated image is 1024x1024 pixels. |
imageStyle
The style of the generated images.
Description: The style of the generated images.
Type: string
Default: vivid
Enum Name: Style
Enum Values:
| Value | Description |
|---|---|
| vivid | Vivid creates images that are hyper-realistic and dramatic. |
| natural | Natural creates images that are more natural and less hyper-realistic. |
generateImagesResponse
| Name | Type | Description | Required | Default |
|---|---|---|---|---|
| created | integer | The unix timestamp when the operation was created. | Yes | |
| data | array | The result data of the operation, if successful | Yes |
Next steps
Learn about models and fine-tuning with the REST API. Learn more about the underlying models that power Azure OpenAI.