流式输出是AI应用的核心体验。本文从SSE协议底层、Python异步流式处理、Token级输出控制三个层面,拆解流式响应的工程实现细节。
一、为什么流式输出这么重要
非流式调用的问题很明显——用户发送请求后要等几秒甚至十几秒才能看到完整回复,体验极差。
流式输出的价值在于首Token延迟。用户发送请求后,200-500ms内就能看到第一个字,后续内容逐字流出。从心理学角度,这把"等待"变成了"阅读",体验质的飞跃。
非流式:请求 → [等待3秒] → 完整回复一次性出现
流式: 请求 → [200ms] 首字 → 逐字流出 → 完成
但流式输出的工程实现比非流式复杂得多,涉及SSE协议、异步处理、背压控制、错误恢复等多个环节。
二、SSE协议:流式输出的基础
2.1 SSE是什么
SSE(Server-Sent Events)是基于HTTP的单向长连接协议。服务器可以持续向客户端推送数据,客户端不需要轮询。
HTTP请求:
POST /v1/chat/completions HTTP/1.1
Content-Type: application/json
{"model": "gpt-4o", "messages": [...], "stream": true}
HTTP响应:
HTTP/1.1 200 OK
Content-Type: text/event-stream
data: {"id":"chatcmpl-xxx","choices":[{"delta":{"content":"你"}}]}
data: {"id":"chatcmpl-xxx","choices":[{"delta":{"content":"好"}}]}
data: {"id":"chatcmpl-xxx","choices":[{"delta":{},"finish_reason":"stop"}]}
data: [DONE]
每个data:行是一个独立的JSON对象,delta.content是增量内容。最后的[DONE]标记流结束。
2.2 SSE vs WebSocket
为什么AI API用SSE而不是WebSocket?
| 维度 | SSE | WebSocket |
|---|---|---|
| 方向 | 单向(服务器→客户端) | 双向 |
| 协议 | HTTP | 独立协议 |
| 复杂度 | 低 | 高 |
| 兼容性 | 好(HTTP基础设施) | 需要额外支持 |
| 重连 | 自动重连 | 需手动实现 |
AI聊天场景中,客户端只需要发送一次请求,然后接收持续输出的回复——典型的单向数据流,SSE完美匹配。
三、Python异步流式处理
3.1 基础实现
用OpenAI SDK的异步流式接口:
import asyncio
from openai import AsyncOpenAI
async def stream_chat():
# 配置客户端
# 可以直连官方API,也可以通过中转站
# 这里以魔芋AI中转站为例(注册见代码块底部注释)
client = AsyncOpenAI(
api_key="your-api-key",
base_url="https://api.moyu.info/v1"
# 魔芋AI注册地址(放在代码注释中):
# https://www.moyu.info/register?aff=CRB8
# 新用户注册有免费额度,支持GPT/Claude/Gemini/DeepSeek等模型
)
stream = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "用Python实现快速排序"}],
stream=True
)
async for chunk in stream:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
print() # 换行
asyncio.run(stream_chat())
3.2 流式输出的中途取消
用户可能在输出过程中取消请求。正确处理取消很重要:
async def stream_chat_with_cancel():
client = AsyncOpenAI(
api_key="your-api-key",
base_url="https://api.moyu.info/v1"
)
stream = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "写一篇长文"}],
stream=True
)
collected = []
try:
async for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
collected.append(content)
print(content, end="", flush=True)
# 用户取消时(比如按了Ctrl+C)
# stream会抛出asyncio.CancelledError
except asyncio.CancelledError:
print(f"\n\n[已取消,已接收 {len(collected)} 个chunk]")
# 这里可以做清理工作:保存已生成的内容等
return "".join(collected)
return "".join(collected)
3.3 背压控制
如果消费端处理速度慢于生产端,需要背压控制避免内存溢出:
import asyncio
async def stream_with_backpressure():
client = AsyncOpenAI(
api_key="your-api-key",
base_url="https://api.moyu.info/v1"
)
# 用Queue作为缓冲区,设置最大容量
buffer = asyncio.Queue(maxsize=100)
async def producer():
"""从API接收数据放入队列"""
try:
stream = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "讲个长故事"}],
stream=True
)
async for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
# 队列满时会阻塞,实现背压
await buffer.put(chunk.choices[0].delta.content)
finally:
await buffer.put(None) # 结束标记
async def consumer():
"""从队列取出数据处理"""
total = 0
while True:
content = await buffer.get()
if content is None:
break
# 模拟慢速消费(比如写文件、调另一个API)
await asyncio.sleep(0.01)
total += len(content)
print(f"\n总共处理 {total} 个字符")
# 并发运行生产者和消费者
await asyncio.gather(producer(), consumer())
四、Token级输出控制
4.1 流式Token统计
非流式调用中,Token数在response.usage里直接返回。流式调用默认不返回usage,需要手动统计:
async def stream_with_usage():
client = AsyncOpenAI(
api_key="your-api-key",
base_url="https://api.moyu.info/v1"
)
stream = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "解释量子计算"}],
stream=True,
stream_options={"include_usage": True} # 关键参数
)
prompt_tokens = 0
completion_tokens = 0
async for chunk in stream:
# 内容chunk
if chunk.choices and chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
# usage chunk(最后一个chunk)
if chunk.usage:
prompt_tokens = chunk.usage.prompt_tokens
completion_tokens = chunk.usage.completion_tokens
print(f"\n\nInput tokens: {prompt_tokens}")
print(f"Output tokens: {completion_tokens}")
print(f"Total: {prompt_tokens + completion_tokens}")
4.2 输出长度控制
有时候需要在生成到一定长度时停止:
async def stream_with_limit(max_chars=500):
client = AsyncOpenAI(
api_key="your-api-key",
base_url="https://api.moyu.info/v1"
)
stream = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "写一篇散文"}],
stream=True
)
char_count = 0
async for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
char_count += len(content)
if char_count > max_chars:
print(f"\n[已达 {max_chars} 字上限,停止]")
# 调用close()关闭流
await stream.close()
break
print(content, end="", flush=True)
4.3 关键词触发动作
在流式输出中检测特定关键词,触发动作(比如检测到代码块时高亮显示):
import re
async def stream_with_keyword_detection():
client = AsyncOpenAI(
api_key="your-api-key",
base_url="https://api.moyu.info/v1"
)
stream = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "写一个Python排序函数并解释"}],
stream=True
)
buffer = ""
in_code_block = False
async for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
buffer += content
# 检测代码块开始/结束
if "```" in buffer:
if not in_code_block:
# 代码块开始
lang_match = re.search(r'```(\w+)', buffer)
lang = lang_match.group(1) if lang_match else "text"
print(f"\n[代码块开始: {lang}]")
in_code_block = True
else:
# 代码块结束
print(f"\n[代码块结束]")
in_code_block = False
buffer = ""
print(content, end="", flush=True)
五、错误处理与重试
5.1 流式请求的错误类型
from openai import (
APITimeoutError,
APIConnectionError,
RateLimitError,
InternalServerError
)
async def stream_with_retry(prompt, max_retries=3):
client = AsyncOpenAI(
api_key="your-api-key",
base_url="https://api.moyu.info/v1",
timeout=30.0 # 设置超时
)
for attempt in range(max_retries):
try:
stream = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
stream=True
)
async for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
yield chunk.choices[0].delta.content
return # 成功则退出
except RateLimitError:
# 429: 限流,等待后重试
wait = 2 ** attempt
print(f"\n[限流,{wait}s后重试]")
await asyncio.sleep(wait)
except APITimeoutError:
# 超时,缩短max_tokens重试
print(f"\n[超时,重试]")
continue
except APIConnectionError:
# 连接错误,检查中转站状态
print(f"\n[连接错误,重试]")
await asyncio.sleep(1)
continue
except InternalServerError:
# 500: 服务端错误
print(f"\n[服务端错误,重试]")
await asyncio.sleep(2)
continue
raise Exception(f"重试 {max_retries} 次后仍失败")
5.2 断流续传
如果流式中途断开,可以重新发起请求,让模型从断点继续:
async def stream_with_resume(prompt, max_chars=10000):
client = AsyncOpenAI(
api_key="your-api-key",
base_url="https://api.moyu.info/v1"
)
collected = ""
retries = 0
while len(collected) < max_chars and retries < 3:
try:
# 如果已有部分内容,让模型从断点继续
messages = [{"role": "user", "content": prompt}]
if collected:
messages = [
{"role": "user", "content": prompt},
{"role": "assistant", "content": collected},
{"role": "user", "content": "请继续"}
]
stream = await client.chat.completions.create(
model="gpt-4o",
messages=messages,
stream=True
)
async for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
collected += content
print(content, end="", flush=True)
break # 正常结束
except Exception as e:
retries += 1
print(f"\n[断流,重试 {retries}/3: {e}]")
await asyncio.sleep(2)
return collected
六、性能优化技巧
6.1 连接池复用
import httpx
# 创建可复用的HTTP客户端
http_client = httpx.AsyncClient(
limits=httpx.Limits(
max_connections=100,
max_keepalive_connections=20
),
timeout=httpx.Timeout(30.0, connect=5.0)
)
client = AsyncOpenAI(
api_key="your-api-key",
base_url="https://api.moyu.info/v1",
http_client=http_client # 复用连接池
)
6.2 并发流式请求
同时发起多个流式请求,合并输出:
async def concurrent_streams(prompts: list):
client = AsyncOpenAI(
api_key="your-api-key",
base_url="https://api.moyu.info/v1"
)
async def single_stream(prompt, index):
stream = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
stream=True
)
result = ""
async for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
result += chunk.choices[0].delta.content
return index, result
# 并发执行
tasks = [single_stream(p, i) for i, p in enumerate(prompts)]
results = await asyncio.gather(*tasks)
# 按顺序输出
results.sort(key=lambda x: x[0])
for _, text in results:
print(text)
七、完整示例:带UI的流式聊天
把前面的组件组合起来,实现一个完整的流式聊天后端:
# app.py - 完整的流式聊天服务
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from openai import AsyncOpenAI
import json
import asyncio
app = FastAPI()
# 客户端配置
# 支持直连或通过中转站
# 示例使用魔芋AI中转站(兼容OpenAI协议)
client = AsyncOpenAI(
api_key="your-api-key",
base_url="https://api.moyu.info/v1",
# 中转站注册:https://www.moyu.info/register?aff=CRB8
timeout=60.0
)
class ChatRequest(BaseModel):
message: str
model: str = "gpt-4o-mini"
@app.post("/chat")
async def chat(req: ChatRequest):
async def generate():
try:
stream = await client.chat.completions.create(
model=req.model,
messages=[
{"role": "system", "content": "你是一个技术助手"},
{"role": "user", "content": req.message}
],
stream=True,
stream_options={"include_usage": True}
)
total_tokens = 0
async for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
data = {"content": chunk.choices[0].delta.content}
yield f"data: {json.dumps(data)}\n\n"
if chunk.usage:
total_tokens = chunk.usage.completion_tokens
yield f"data: {json.dumps({'done': True, 'tokens': total_tokens})}\n\n"
except Exception as e:
yield f"data: {json.dumps({'error': str(e)})}\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
# 启动: uvicorn app:app --reload
八、总结
流式输出的工程实现涉及四个层面:
- 协议层:理解SSE格式,正确解析
data:行和[DONE]标记 - 异步层:用
async for处理流,正确处理取消和背压 - 控制层:Token统计、长度限制、关键词检测
- 容错层:超时重试、断流续传、连接池复用
掌握这些,就能构建稳定可靠的流式AI应用。文中代码使用OpenAI兼容协议,适用于直连或通过任何兼容中转站调用。有问题欢迎评论区讨论。
1394

被折叠的 条评论
为什么被折叠?



