| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209 |
- ---
- title: "Multimodal Tool Outputs"
- description: "Return images and files from your tools."
- icon: "image"
- ---
- ## What This Feature Unlocks
- Returning images and files from the tools enables real agentic feedback loops on completely new modalities.
- For example, instead of dumping all the data into an agent, and hoping for the best, you can generate a visualization or analyze PDF reports, and allow the agent to provide insights based on that output. Just like a real data analyst.
- This saves your context window and unlocks autonomous agentic workflows for a lot of new use cases:
- ## New Use Cases
- <CardGroup cols={2}>
- <Card title="Software Development" icon="code">
- Agents can check websites autonomously and iterate until all elements are properly positioned, enabling them to tackle complex projects without manual screenshot feedback.
- </Card>
- <Card title="Brand Asset Generation" icon="paintbrush">
- Provide brand guidelines, logos, and messaging, then let agents iterate on image and video generation (including Sora 2) until outputs fully match your expectations.
- </Card>
- <Card title="Screen-Aware Assistance" icon="eye">
- Build agents that help visually impaired individuals navigate websites or create customer support agents that see the user's current webpage for better assistance.
- </Card>
- <Card title="Data Analytics" icon="chart-area">
- Generate visual graphs and analyze PDF reports, then let agents provide insights based on these outputs without overloading the context window.
- </Card>
- </CardGroup>
- ## Output Formats
- ### Images (PNG, JPG)
- To return an image from a tool, you can either:
- 1. Use the `ToolOutputImage` class.
- 2. Return a dict with the `type` set to `"image"` and either `image_url` (URL or data URL) or `file_id`.
- 3. Use our convenience `tool_output_image_from_path` function.
- ```python
- from agency_swarm import BaseTool, ToolOutputImage, ToolOutputImageDict
- from agency_swarm.tools.utils import tool_output_image_from_path
- from pydantic import Field
- class FetchGalleryImage(BaseTool):
- """Return a static gallery image."""
- detail: str = Field(default="auto", description="Level of detail")
- def run(self) -> ToolOutputImage:
- return ToolOutputImage(
- image_url="https://upload.wikimedia.org/wikipedia/commons/0/0c/GoldenGateBridge-001.jpg",
- detail=self.detail,
- )
- class FetchGalleryImageDict(BaseTool):
- """Dict variant of the same image output."""
- detail: str = Field(default="auto", description="Level of detail")
- def run(self) -> ToolOutputImageDict:
- return {
- "type": "image",
- "image_url": "https://upload.wikimedia.org/wikipedia/commons/0/0c/GoldenGateBridge-001.jpg",
- "detail": self.detail,
- }
- class FetchLocalImage(BaseTool):
- """Load an image from disk using the helper."""
- path: str = Field(default="examples/data/landscape_scene.png", description="Image to publish")
- def run(self) -> ToolOutputImage:
- return tool_output_image_from_path(self.path, detail="auto")
- ```
- ### Files (PDF)
- Similarly to return a file from a tool:
- ```python
- from agency_swarm import BaseTool, ToolOutputFileContent
- from agency_swarm.tools.utils import tool_output_file_from_path, tool_output_file_from_url
- from pydantic import Field
- class FetchReferenceReport(BaseTool):
- """Return a reference PDF hosted remotely."""
- source_url: str = Field(
- default="https://raw.githubusercontent.com/VRSEN/agency-swarm/main/examples/data/sample_report.pdf",
- description="Remote file to share",
- )
- def run(self) -> ToolOutputFileContent:
- return ToolOutputFileContent(file_url=self.source_url)
- class FetchLocalReport(BaseTool):
- """Return a report stored on disk."""
- path: str = Field(default="examples/data/sample_report.pdf", description="Local file path")
- def run(self) -> ToolOutputFileContent:
- return tool_output_file_from_path(self.path)
- class FetchRemoteReport(BaseTool):
- """Return a remote file using the helper."""
- archive_url: str = Field(default="https://example.com/document.pdf", description="File to expose")
- def run(self) -> ToolOutputFileContent:
- return tool_output_file_from_url(self.archive_url)
- ```
- <Note>
- When you choose `file_data`, include `filename` to hint a download name; URL-based outputs rely on the remote server metadata instead.
- </Note>
- <Warning>
- `tool_output_file_from_path` only supports PDF files.
- </Warning>
- <Tip>
- **Need to load local files without custom logic?** Use the built-in [`LoadFileAttachment`](/core-framework/tools/built-in-tools#loadfileattachment) tool instead of creating a custom tool. It handles both images and PDFs and uses these same utility functions under the hood.
- </Tip>
- ### Combining Multiple Outputs
- Return multiple outputs by returning a list from `run`.
- ```python
- from agency_swarm import BaseTool, ToolOutputFileContent, ToolOutputImage, ToolOutputText
- class PrepareShowcase(BaseTool):
- """Return rich media and a short description."""
- teaser_a: str = "https://example.com/teaser-a.png"
- teaser_b: str = "https://example.com/teaser-b.png"
- report_id: str = "file-report-123"
- def run(self) -> list:
- return [
- ToolOutputImage(image_url=self.teaser_a),
- ToolOutputImage(image_url=self.teaser_b),
- ToolOutputText(text="Gallery updated: Teaser A and Teaser B now live."),
- ToolOutputFileContent(file_id=self.report_id),
- ]
- ```
- ## Complete Example (Chart generation tool)
- Here's a complete example using `BaseTool`:
- ```python
- from agency_swarm import Agent, BaseTool, ToolOutputImage
- from pydantic import Field
- import base64
- import matplotlib.pyplot as plt
- import io
- class GenerateChartTool(BaseTool):
- """Generate a bar chart from data."""
-
- data: list[float] = Field(..., description="Data points for the chart")
- labels: list[str] = Field(..., description="Labels for each data point")
-
- def run(self) -> ToolOutputImage:
- """Generate and return the chart as a base64-encoded image."""
- # Create the chart
- fig, ax = plt.subplots()
- ax.bar(self.labels, self.data)
-
- # Convert to base64
- buf = io.BytesIO()
- plt.savefig(buf, format='png')
- buf.seek(0)
- image_base64 = base64.b64encode(buf.read()).decode('utf-8')
- plt.close()
-
- # Return in multimodal format
- return ToolOutputImage(image_url=f"data:image/png;base64,{image_base64}")
- # Create an agent with the tool
- agent = Agent(
- name="DataViz",
- instructions="You generate charts and visualizations for data analysis.",
- tools=[GenerateChartTool]
- )
- ```
- <Note>
- `function_tool` decorators and `BaseTool` classes both support multimodal outputs in the exact same way.
- </Note>
- ```python
- from agency_swarm import ToolOutputImage, function_tool
- @function_tool
- def fetch_gallery_image() -> ToolOutputImage:
- return ToolOutputImage(
- image_url="https://upload.wikimedia.org/wikipedia/commons/0/0c/GoldenGateBridge-001.jpg",
- detail="auto",
- )
- ```
- ## Tips & Best Practices
- - Base64-encoded images can be large. Use file references for large content.
- - Compress screenshots and other visuals before returning them to cut token usage without sacrificing clarity.
- - Include the image names in your textual response whenever you return more than one image so the agent can reference them unambiguously.
- ## Real Examples
- - [TBD: Include repo from YouTube video]
- - [`examples/multimodal_outputs.py`](https://github.com/VRSEN/agency-swarm/blob/main/examples/multimodal_outputs.py)
|