multimodal-outputs.mdx 7.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209
  1. ---
  2. title: "Multimodal Tool Outputs"
  3. description: "Return images and files from your tools."
  4. icon: "image"
  5. ---
  6. ## What This Feature Unlocks
  7. Returning images and files from the tools enables real agentic feedback loops on completely new modalities.
  8. For example, instead of dumping all the data into an agent, and hoping for the best, you can generate a visualization or analyze PDF reports, and allow the agent to provide insights based on that output. Just like a real data analyst.
  9. This saves your context window and unlocks autonomous agentic workflows for a lot of new use cases:
  10. ## New Use Cases
  11. <CardGroup cols={2}>
  12. <Card title="Software Development" icon="code">
  13. Agents can check websites autonomously and iterate until all elements are properly positioned, enabling them to tackle complex projects without manual screenshot feedback.
  14. </Card>
  15. <Card title="Brand Asset Generation" icon="paintbrush">
  16. Provide brand guidelines, logos, and messaging, then let agents iterate on image and video generation (including Sora 2) until outputs fully match your expectations.
  17. </Card>
  18. <Card title="Screen-Aware Assistance" icon="eye">
  19. Build agents that help visually impaired individuals navigate websites or create customer support agents that see the user's current webpage for better assistance.
  20. </Card>
  21. <Card title="Data Analytics" icon="chart-area">
  22. Generate visual graphs and analyze PDF reports, then let agents provide insights based on these outputs without overloading the context window.
  23. </Card>
  24. </CardGroup>
  25. ## Output Formats
  26. ### Images (PNG, JPG)
  27. To return an image from a tool, you can either:
  28. 1. Use the `ToolOutputImage` class.
  29. 2. Return a dict with the `type` set to `"image"` and either `image_url` (URL or data URL) or `file_id`.
  30. 3. Use our convenience `tool_output_image_from_path` function.
  31. ```python
  32. from agency_swarm import BaseTool, ToolOutputImage, ToolOutputImageDict
  33. from agency_swarm.tools.utils import tool_output_image_from_path
  34. from pydantic import Field
  35. class FetchGalleryImage(BaseTool):
  36. """Return a static gallery image."""
  37. detail: str = Field(default="auto", description="Level of detail")
  38. def run(self) -> ToolOutputImage:
  39. return ToolOutputImage(
  40. image_url="https://upload.wikimedia.org/wikipedia/commons/0/0c/GoldenGateBridge-001.jpg",
  41. detail=self.detail,
  42. )
  43. class FetchGalleryImageDict(BaseTool):
  44. """Dict variant of the same image output."""
  45. detail: str = Field(default="auto", description="Level of detail")
  46. def run(self) -> ToolOutputImageDict:
  47. return {
  48. "type": "image",
  49. "image_url": "https://upload.wikimedia.org/wikipedia/commons/0/0c/GoldenGateBridge-001.jpg",
  50. "detail": self.detail,
  51. }
  52. class FetchLocalImage(BaseTool):
  53. """Load an image from disk using the helper."""
  54. path: str = Field(default="examples/data/landscape_scene.png", description="Image to publish")
  55. def run(self) -> ToolOutputImage:
  56. return tool_output_image_from_path(self.path, detail="auto")
  57. ```
  58. ### Files (PDF)
  59. Similarly to return a file from a tool:
  60. ```python
  61. from agency_swarm import BaseTool, ToolOutputFileContent
  62. from agency_swarm.tools.utils import tool_output_file_from_path, tool_output_file_from_url
  63. from pydantic import Field
  64. class FetchReferenceReport(BaseTool):
  65. """Return a reference PDF hosted remotely."""
  66. source_url: str = Field(
  67. default="https://raw.githubusercontent.com/VRSEN/agency-swarm/main/examples/data/sample_report.pdf",
  68. description="Remote file to share",
  69. )
  70. def run(self) -> ToolOutputFileContent:
  71. return ToolOutputFileContent(file_url=self.source_url)
  72. class FetchLocalReport(BaseTool):
  73. """Return a report stored on disk."""
  74. path: str = Field(default="examples/data/sample_report.pdf", description="Local file path")
  75. def run(self) -> ToolOutputFileContent:
  76. return tool_output_file_from_path(self.path)
  77. class FetchRemoteReport(BaseTool):
  78. """Return a remote file using the helper."""
  79. archive_url: str = Field(default="https://example.com/document.pdf", description="File to expose")
  80. def run(self) -> ToolOutputFileContent:
  81. return tool_output_file_from_url(self.archive_url)
  82. ```
  83. <Note>
  84. When you choose `file_data`, include `filename` to hint a download name; URL-based outputs rely on the remote server metadata instead.
  85. </Note>
  86. <Warning>
  87. `tool_output_file_from_path` only supports PDF files.
  88. </Warning>
  89. <Tip>
  90. **Need to load local files without custom logic?** Use the built-in [`LoadFileAttachment`](/core-framework/tools/built-in-tools#loadfileattachment) tool instead of creating a custom tool. It handles both images and PDFs and uses these same utility functions under the hood.
  91. </Tip>
  92. ### Combining Multiple Outputs
  93. Return multiple outputs by returning a list from `run`.
  94. ```python
  95. from agency_swarm import BaseTool, ToolOutputFileContent, ToolOutputImage, ToolOutputText
  96. class PrepareShowcase(BaseTool):
  97. """Return rich media and a short description."""
  98. teaser_a: str = "https://example.com/teaser-a.png"
  99. teaser_b: str = "https://example.com/teaser-b.png"
  100. report_id: str = "file-report-123"
  101. def run(self) -> list:
  102. return [
  103. ToolOutputImage(image_url=self.teaser_a),
  104. ToolOutputImage(image_url=self.teaser_b),
  105. ToolOutputText(text="Gallery updated: Teaser A and Teaser B now live."),
  106. ToolOutputFileContent(file_id=self.report_id),
  107. ]
  108. ```
  109. ## Complete Example (Chart generation tool)
  110. Here's a complete example using `BaseTool`:
  111. ```python
  112. from agency_swarm import Agent, BaseTool, ToolOutputImage
  113. from pydantic import Field
  114. import base64
  115. import matplotlib.pyplot as plt
  116. import io
  117. class GenerateChartTool(BaseTool):
  118. """Generate a bar chart from data."""
  119. data: list[float] = Field(..., description="Data points for the chart")
  120. labels: list[str] = Field(..., description="Labels for each data point")
  121. def run(self) -> ToolOutputImage:
  122. """Generate and return the chart as a base64-encoded image."""
  123. # Create the chart
  124. fig, ax = plt.subplots()
  125. ax.bar(self.labels, self.data)
  126. # Convert to base64
  127. buf = io.BytesIO()
  128. plt.savefig(buf, format='png')
  129. buf.seek(0)
  130. image_base64 = base64.b64encode(buf.read()).decode('utf-8')
  131. plt.close()
  132. # Return in multimodal format
  133. return ToolOutputImage(image_url=f"data:image/png;base64,{image_base64}")
  134. # Create an agent with the tool
  135. agent = Agent(
  136. name="DataViz",
  137. instructions="You generate charts and visualizations for data analysis.",
  138. tools=[GenerateChartTool]
  139. )
  140. ```
  141. <Note>
  142. `function_tool` decorators and `BaseTool` classes both support multimodal outputs in the exact same way.
  143. </Note>
  144. ```python
  145. from agency_swarm import ToolOutputImage, function_tool
  146. @function_tool
  147. def fetch_gallery_image() -> ToolOutputImage:
  148. return ToolOutputImage(
  149. image_url="https://upload.wikimedia.org/wikipedia/commons/0/0c/GoldenGateBridge-001.jpg",
  150. detail="auto",
  151. )
  152. ```
  153. ## Tips & Best Practices
  154. - Base64-encoded images can be large. Use file references for large content.
  155. - Compress screenshots and other visuals before returning them to cut token usage without sacrificing clarity.
  156. - Include the image names in your textual response whenever you return more than one image so the agent can reference them unambiguously.
  157. ## Real Examples
  158. - [TBD: Include repo from YouTube video]
  159. - [`examples/multimodal_outputs.py`](https://github.com/VRSEN/agency-swarm/blob/main/examples/multimodal_outputs.py)