Enabling Responses API and MCP on vLLM
The OpenAI Responses API is the successor to the Chat Completions API, designed to support agentic workflows with built-in tool use, multi-turn conversation management, and streaming. We have implemented the Responses API in vLLM, enabling any model served by vLLM to participate in agentic pipelines that call tools, execute code, search the web, and reason through complex tasks – all through a single POST /v1/responses endpoint.
This blog post covers:
- The Responses API implementation in vLLM: endpoint design, streaming and non-streaming modes, and the full set of supported features
- MCP (Model Context Protocol) integration: how vLLM connects to external tool servers and executes tool calls during generation
- Two context architectures: HarmonyContext for GPT-OSS models and ParsableContext for all other models
The Responses API
Responses API is a modern interface for interacting with large language models that unifies text generation, multimodal inputs, and tool use into a single API primitive. Introduced as the successor to earlier interfaces like Chat Completions and Assistants, it provides a flexible abstraction for building agentic applications—allowing models to generate structured outputs, call tools, maintain conversation state, and integrate external data sources in one request. The API treats a “response” as the fundamental unit of interaction, combining inputs, model reasoning, tool calls, and outputs into a structured object. Developers can learn more about the official specification in the OpenAI Responses API documentation (https://developers.openai.com/api/reference/resources/responses ). At the same time, efforts like OpenResponses Initiative (https://www.openresponses.org/ ) aim to define an open, provider-agnostic standard inspired by this interface, enabling interoperable tooling and reducing vendor lock-in across LLM platforms.
Endpoint Overview
vLLM exposes three endpoints under the Responses API:
| Endpoint | Method | Description |
|---|---|---|
/v1/responses |
POST | Create a new response (streaming or non-streaming) |
/v1/responses/{response_id} |
GET | Retrieve a stored response |
/v1/responses/{response_id}/cancel |
POST | Cancel a background response |
The primary endpoint is POST /v1/responses. It accepts a ResponsesRequest with an input (a string or a list of conversation items), an optional instructions field for system messages, and a tools list for tool definitions. The stream flag determines whether the response is returned as a single JSON object or as a stream of Server-Sent Events (SSE).
Non-Streaming Mode
In non-streaming mode, vLLM generates the full response before returning it. The response includes the model’s output items (text messages, function calls, reasoning content), token usage statistics, and a final status (completed, incomplete, or failed).
curl -X POST http://localhost:8000/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct",
"input": "What is the capital of France?",
}'
Streaming Mode
When stream: true, vLLM emits events as SSE with monotonically increasing sequence numbers. The event lifecycle follows a structured pattern:
response.createdandresponse.in_progress– the response object is initialized- For each output item:
response.output_item.added– a new output item begins (message, reasoning, function_call, etc.)- Content-specific delta events (e.g.,
response.output_text.deltafor text tokens) - Content done events (e.g.,
response.output_text.done) response.output_item.done– the output item is finalized
response.completed– the full response with usage statistics
This event structure matches the OpenAI Responses API specification, making vLLM a drop-in replacement for clients already using the Responses API.
curl -X POST http://localhost:8000/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct",
"input": "Explain quicksort in one paragraph.",
"stream": true
}'
See https://www.openresponses.org/specification#streaming for more details on the streaming protocol and the types of events that are streamed.
Supported Features
The vLLM Responses API supports a broad set of features:
Tool Calling (Function and MCP) Tools of type function can be defined in the tools list. The model’s output is parsed for tool calls using configurable tool parsers (Hermes, Llama, Mistral, etc.). The tool_choice parameter supports "auto", "none", "required", or a named function. When the model emits a function call, it is returned as a function_call output item with streaming events response.function_call_arguments.delta and response.function_call_arguments.done.
TODO
Reasoning. The reasoning parameter (with an effort field) enables chain-of-thought reasoning. Reasoning content is tracked separately from regular output and appears as ResponseReasoningItem output items. Streaming emits response.reasoning_text.delta and response.reasoning_text.done events, allowing clients to display the model’s thinking process in real time (see https://github.com/vllm-project/vllm/pull/29947)
Structured Output. The structured_outputs field supports JSON Schema-constrained generation. When a JSON schema is provided, vLLM enforces the schema during decoding using guided generation, ensuring the output is valid JSON conforming to the specified schema. When a choice is specified, vLLM will only output in final output from the options listed. (see https://github.com/vllm-project/vllm/pull/33709).
Logprobs. When include contains "message.output_text.logprobs", the response includes per-token log probabilities. The top_logprobs parameter controls how many top alternatives are returned per token position.
Background Mode. Setting background: true (with store: true) queues the response for asynchronous generation. The response is returned immediately with status "queued" and can be polled or retrieved later via GET /v1/responses/{response_id}. Background responses can be cancelled via the cancel endpoint.
vLLM-Specific Extensions. Beyond the standard API, vLLM adds parameters for priority (request scheduling priority), cache_salt (prefix cache isolation), seed (deterministic sampling), repetition_penalty, custom stop sequences, and enable_response_messages (returns raw prompt and output token IDs for debugging).
Debugging We also have implemented the ability to return raw input and output tokens for responsesAPI. (See https://github.com/vllm-project/vllm/pull/29549)
MCP: Model Context Protocol Integration
The Model Context Protocol (MCP) allows LLMs to call external tools during generation, with vLLM handling the tool calling instead of function tools, in which the client is responsible for handling tool calls. vLLM implements MCP as a first-class feature of the Responses API: when a model generates a tool call, vLLM intercepts it, calls the appropriate MCP tool server, and feeds the result back to the model for the next turn of generation – all within a single API request.
Built-in Tools
vLLM supports three categories of built-in tools:
Web Search (web_search_preview). Enables the model to search the web during generation. Streaming events follow: response.web_search_call.in_progress -> response.web_search_call.searching -> response.web_search_call.completed. The search results are injected back into the conversation for the model to synthesize.
Code Interpreter (code_interpreter). Enables the model to write and execute Python code in a sandboxed Docker environment. Streaming events include response.code_interpreter_call_code.delta for the generated code and response.code_interpreter_call.completed for the execution result.
Container (container). Enables the model to execute shell commands in a stateful Docker container, supporting arguments like cmd, workdir, env, and timeout.
Tool Server Architecture
vLLM provides two ToolServer implementations:
MCPToolServer: Connects to external MCP-compatible tool servers over SSE. Multiple servers can be specified via comma-separated URLs. Each server exposes its tools via the MCP protocol, and vLLM discovers available tools at startup via session.list_tools(). Tool sessions are created per-request with unique session IDs.
# Starting vLLM with an MCP tool server
vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--tool-server-url http://localhost:3001/sse
Agentic Loop
The core of MCP integration is the agentic loop in _generate_with_builtin_tools. This loop:
- Generates tokens from the model
- Checks if the model requested a tool call (
need_builtin_tool_call()) - If yes, calls the tool via the MCP session (
call_tool()) - Appends the tool result to the conversation context
- Renders the updated conversation as a new prompt
- Repeats from step 1
This loop continues until the model produces a final response without requesting a tool call, or until the max_tool_calls limit is reached. The entire multi-turn interaction happens server-side within a single API request.
MCP Streaming Events
When streaming is enabled with MCP tools, vLLM emits fine-grained events for each tool call:
response.mcp_call.in_progress -- tool call begins
response.mcp_call_arguments.delta -- argument tokens stream in
response.mcp_call_arguments.done -- arguments are complete
response.mcp_call.completed -- tool execution result is available
This allows clients to display the model’s tool interactions in real time, showing what tool is being called, with what arguments, and what the result was.
Context Architecture: HarmonyContext vs. ParsableContext
A key design decision in the Responses API is how to manage the conversation state during multi-turn tool-calling loops. vLLM implements two context architectures to support different model families.
HarmonyContext (GPT-OSS Models)
HarmonyContext is designed for GPT-OSS models that use OpenAI’s Harmony message format. These models use a channel-based parsing system where the model’s output is split into channels (analysis for reasoning, commentary, and final for the actual response). The context tracks messages in the Harmony Message format and uses the Harmony tokenizer’s render_for_completion() to produce token IDs for the next turn.
Key characteristics:
- Uses
openai_harmonymessage types (Author,Message,Role,StreamState,TextContent) - Tool recipients are identified by message
recipientfield (e.g.,browser.search,python,container.exec) - Token rendering uses the Harmony encoding’s stop tokens for assistant actions
- Per-turn token metrics (input, output, cached, tool output tokens) are tracked for accurate usage reporting
StreamingHarmonyContext extends this for token-by-token streaming, processing each token through the Harmony parser and tracking parser state transitions to emit the correct streaming events.
ParsableContext (All Other Models)
ParsableContext is the context for non-GPT-OSS models (Llama, Mistral, Qwen, etc.). It uses vLLM’s standard chat template system to render conversations and parses tool calls from the model output using configurable tool parsers.
Key characteristics:
- Uses
ResponseInputOutputItemtypes from the OpenAI SDK (e.g.,ResponseFunctionToolCall,ResponseFunctionToolCallOutputItem) - Tool calls are identified by the
namefield matching built-in tool names (code_interpreter,web_search_preview,container) - Prompt rendering uses vLLM’s chat template system via
_render_next_turn() - Supports the
VLLM_USE_EXPERIMENTAL_PARSER_CONTEXTenvironment variable to enable this context
SimpleContext
For non-tool-calling scenarios, SimpleContext provides a lightweight context that accumulates raw text and token IDs without any parsing overhead. It is the default for models that do not have tool use enabled.
Choosing the Right Context
The context is selected automatically based on the model type and request configuration:
| Condition | Context Used |
|---|---|
| GPT-OSS model, streaming | StreamingHarmonyContext |
| GPT-OSS model, non-streaming | HarmonyContext |
| Non-GPT-OSS, tools enabled, experimental parser | ParsableContext |
| Non-GPT-OSS, no tools or simple request | SimpleContext |
Token Usage Tracking
The Responses API provides detailed token usage information in the response, including:
input_tokens: Total prompt tokens across all turnsoutput_tokens: Total generated tokensinput_tokens_details.cached_tokens: Tokens served from the prefix cacheoutput_tokens_details.reasoning_tokens: Tokens spent on reasoning (chain-of-thought)
For multi-turn tool-calling interactions, the TurnMetrics class tracks per-turn metrics. Tool output tokens are calculated as the difference between consecutive turns’ prompt sizes minus the previous turn’s output, capturing the token cost of tool results injected between turns.
Getting Started
To use the Responses API with vLLM:
# Basic serving
vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct
# With tool calling support
vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct \
--enable-auto-tool-choice \
--tool-call-parser hermes
# With MCP tool server
vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--tool-server-url http://localhost:3001/sse
Then use the OpenAI Python SDK to make requests:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1")
# Non-streaming
response = client.responses.create(
model="meta-llama/Llama-4-Maverick-17B-128E-Instruct",
input="What is the capital of France?",
)
print(response.output_text)
# Streaming
stream = client.responses.create(
model="meta-llama/Llama-4-Maverick-17B-128E-Instruct",
input="Explain quicksort step by step.",
stream=True,
)
for event in stream:
if event.type == "response.output_text.delta":
print(event.delta, end="", flush=True)
# With function calling
response = client.responses.create(
model="meta-llama/Llama-4-Maverick-17B-128E-Instruct",
input="What is the weather in San Francisco?",
tools=[{
"type": "function",
"name": "get_weather",
"description": "Get the current weather for a location.",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"]
}
}],
)
Future Work
- Offloading response storage & API Layer: Currently, stored responses are held in memory with no eviction. We would like to support offloading responsesAPI state management to a third party database, and potentially offload the API layer outside of the core vLLM engine (see https://github.com/vllm-project/vllm/issues/26934)
- Expanded tool support: Add MCP support for all tools. (See https://github.com/vllm-project/vllm/issues/30115)
- Open Responses Conformity: OpenResponses (https://www.openresponses.org/ ), launched in January 2026, is an open initiative to standardize a vendor-neutral API for LLM interactions based on the Responses API abstraction, covering structured outputs, tool calls, multimodal inputs, and reasoning traces. For vLLM, supporting OpenResponses enables compatibility with a growing ecosystem of agent frameworks and SDKs while giving users a portable interface that works across both hosted APIs and open-source model deployments.
To see more details about the future work and explore opportunities to contribute, please see this vLLM feature development map: https://github.com/vllm-project/vllm/issues/34857
Acknowledgements
This work was a collaboration across the vLLM community. Thanks to all contributors who helped design and implement the Responses API and MCP integration.
Meta: Andrew Xia, Daniel Salib, Ye Hu, Alec Solder, Ye (Charlotte Qi)
vLLM: Chauncey Jiang