What is Multi-Modal Prompting?

Multi-modal prompting is a technique that enables AI models to process and respond to multiple types of input media, such as combining text with images, audio, or documents. Unlike traditional text-only prompts, multi-modal prompting allows for richer interactions by incorporating visual context, enabling applications like image analysis, document processing, and visual reasoning.

Why Use Multi-Modal Prompting?

  • Enhanced Context: Visual content provides information that’s difficult to express in text alone
  • Improved Accuracy: Models can “see” what they’re analyzing rather than relying on descriptions
  • Complex Reasoning: Combines visual perception with textual reasoning for sophisticated tasks
  • Natural Interaction: Mimics human ability to process multiple sensory inputs simultaneously
  • Application Versatility: Enables new use cases like visual QA, document analysis, and content moderation
  • Reduced Ambiguity: Images provide concrete references that minimize misinterpretations

Basic Implementation in Latitude

Here’s a simple multi-modal prompting example using Latitude:

Image Analysis
---
provider: OpenAI
model: gpt-4o
temperature: 0.7
parameters:
  input_image:
    type: image
    description: "Image to analyze"
  analysis_request:
    type: text
    description: "What aspect of the image to analyze"
---

# Image Analysis Assistant

I'll help you analyze the provided image in detail.

## Image:
{{ input_image }}

## Analysis Request:
{{ analysis_request }}

## Comprehensive Analysis:
I'll examine the image carefully and provide insights based on your request.

In the Playground, the input_image parameter will automatically be configured with an image upload button, while the analysis_request will be a standard text field. This configuration is based on the parameter types specified in the YAML header.

Advanced Implementation with Multiple Media Types

This example shows how to work with both images and text in a more sophisticated workflow:

---
provider: OpenAI
model: gpt-4o
temperature: 0.5
parameters:
  product_image:
    type: image
    description: "Product image to analyze"
  product_description:
    type: text
    description: "Text description of the product"
---

<step>
# Initial Image Analysis

## Image:
{{ product_image }}

## Task:
Analyze this product image thoroughly. Focus on:
1. What type of product it is
2. Key visible features
3. Visual design elements
4. Target demographic suggested by the styling
5. Overall quality impression

## Initial Analysis:
</step>

<step>
# Comparative Assessment

## Product Image:
{{ product_image }}

## Product Description:
{{ product_description }}

## Comparison Task:
Compare the actual product image with the provided product description.
Identify:
1. Features mentioned in the description that are clearly visible in the image
2. Features mentioned but not clearly visible
3. Visual aspects not mentioned in the description
4. Any inconsistencies between the image and description

## Comparative Analysis:
</step>

<step>
# Final Recommendations

Based on the image analysis and comparison with the description:

## Product Enhancement Suggestions:
1. Product presentation improvements
2. Description alignment recommendations
3. Target audience considerations
4. Visual marketing opportunities

## Final Recommendations:
</step>

Setting Up Multi-Modal Parameters

Latitude offers several ways to configure and use multi-modal parameters:

Parameter Configuration in YAML

You can define parameter types directly in your prompt’s configuration YAML:

parameters:
  product_image:
    type: image
    description: "Product image to analyze"
  data_file:
    type: file
    description: "Document to process"
  text_input:
    type: text
    description: "Standard text prompt"

Parameter Input Methods in the Playground

When testing multi-modal prompts in the Playground, you can use these input methods:

  1. Manual Upload: Directly upload images and files through the parameter input fields
  2. Dataset Integration: Load images from a dataset for batch testing across multiple visual examples
  3. History Reuse: Access previously used images from your parameter history

For detailed information, see the Playground Parameter Input Methods guide.

Working with Multi-Modal Inputs in the Playground

When testing multi-modal prompts in the Latitude Playground, you’ll encounter specific controls for image and file inputs:

  1. Image Parameters:

    • Click the upload button to select an image from your device
    • Optionally, use the image preview to verify you’ve selected the correct file
    • For vision models, images will be appropriately encoded and embedded in the prompt

  2. File Parameters:

    • Upload PDF documents or other supported file types
    • The Playground will process these files according to the provider’s requirements
    • Some providers may have file size or type limitations

  3. Dataset Testing:

    • Create datasets that include image or file URLs for batch testing
    • Test your multi-modal prompts across a variety of visual inputs
    • Compare performance with different visual content types

For complex multi-step prompts with visual components, you can use the Playground to observe how the model processes visual information at each step of the chain. This is particularly useful for debugging visual reasoning flows.

Document Processing Implementation

This example demonstrates how to process document files (PDFs) using multi-modal capabilities:

---
provider: Anthropic
model: claude-3-opus-20240229
temperature: 0.2
parameters:
  input_document:
    type: file
    description: "PDF or document file to analyze"
  analysis_request:
    type: text
    description: "What to extract or analyze from the document"
  extraction_fields:
    type: text
    description: "Optional comma-separated list of fields to extract"
    required: false
---

# Document Analysis Assistant

I'll help you extract and analyze information from the provided document.

## Document:
{{ input_document }}

## Analysis Request:
{{ analysis_request || "Please summarize the key points of this document." }}

## Document Analysis:
I'll carefully review the document and provide a thorough analysis based on your request.

{{ if extraction_fields }}
## Specific Information Extraction:
I'll extract the following specific fields from the document:
{{ for field in extraction_fields }}
- {{ field }}
{{ endfor }}
{{ endif }}

Best Practices for Multi-Modal Prompting

Advanced Techniques

Visual Chain-of-Thought

Implement a step-by-step visual reasoning process:

Visual CoT
---
provider: OpenAI
model: gpt-4o
temperature: 0.4
parameters:
  problem_image:
    type: image
    description: "Visual problem or puzzle to solve"
  reasoning_task:
    type: text
    description: "Specific reasoning task to perform on the image"
---

# Visual Reasoning Task

I'll solve this visual problem step by step.

## Image:
{{ problem_image }}

## Task:
{{ reasoning_task }}

## Visual Chain-of-Thought:
1. **Initial Observation**: First, I'll carefully observe what's in the image...
2. **Identify Key Elements**: Next, I'll identify the specific elements relevant to this problem...
3. **Analyze Relationships**: Then, I'll analyze how these elements relate to each other...
4. **Apply Reasoning**: Based on these observations, I'll apply reasoning to solve the task...
5. **Final Answer**: Therefore, my conclusion is...

Multi-Image Comparison

Compare and analyze multiple images simultaneously:

---
provider: OpenAI
model: gpt-4o
temperature: 0.3
parameters:
  first_image:
    type: image
    description: "First image for comparison"
  second_image:
    type: image
    description: "Second image for comparison"
  comparison_request:
    type: text
    description: "Specific aspects to compare between images"
    default: "Compare these two images and describe key differences and similarities."
---

# Multi-Image Comparison Analysis

I'll compare the provided images and analyze the differences and similarities.

## Image 1:
{{ first_image }}

## Image 2:
{{ second_image }}

## Comparison Task:
{{ comparison_request || "Compare these two images and describe key differences and similarities." }}

## Comparative Analysis:
### Similarities:
[I'll list the common elements between both images]

### Differences:
[I'll detail the key differences I observe]

### Significance:
[I'll explain the importance of these differences in the context of your request]

Visual Information Extraction

Extract structured data from visually rich content:

Visual Extraction
---
provider: OpenAI
model: gpt-4o
temperature: 0.2
type: json
parameters:
  source_image:
    type: image
    description: "Image containing information to extract"
  extraction_request:
    type: text
    description: "What specific information to extract from the image"
schema:
  type: object
  required: ["extracted_data", "confidence_level"]
  properties:
    extracted_data:
      type: object
      description: "Data extracted from the image according to the extraction request"
    confidence_level:
      type: string
      enum: ["very low", "low", "medium", "high", "very high"]
      description: "Confidence level in the extracted information"
    reasoning:
      type: string
      description: "Brief explanation of how the extraction was performed and any challenges encountered"
---

# Visual Information Extraction

Extract specific information from the provided image.

## Image:
{{ source_image }}

## Extraction Request:
{{ extraction_request }}

## Analysis:
I'll examine the image carefully to extract the requested information and provide it in a structured format.

Multi-modal prompting works well when combined with other prompting techniques:

  1. Chain-of-Thought: Break down visual reasoning into explicit steps for complex image analysis.

  2. Self-Consistency: Generate multiple interpretations of an image and select the most consistent one.

  3. Template-Based Prompting: Use templates to standardize visual analysis across different images.

  4. Retrieval-Augmented Generation: Combine image analysis with retrieved textual information for contextual understanding.

  5. Few-Shot Learning: Provide examples of image-text pairs to guide the model’s visual interpretation.

Real-World Applications

Multi-modal prompting is particularly valuable in these domains:

  • Content Moderation: Analyzing images for policy violations or inappropriate content
  • E-commerce: Automated product photo analysis, comparison, and description generation
  • Healthcare: Reviewing medical images alongside patient records (with appropriate regulatory compliance)
  • Document Processing: Extracting information from forms, receipts, and ID documents
  • Accessibility: Generating detailed image descriptions for vision-impaired users
  • Education: Creating interactive learning experiences with visual elements
  • Quality Control: Inspecting products or materials for defects or compliance issues

Advanced Configuration for Multi-Modal Parameters

Parameter Definition Options

Parameters in multi-modal prompts can be configured with several advanced options:

parameters:
  product_image:
    type: image                  # Parameter type: image, file, or text
    description: "Product image" # Help text shown in Playground
  document_file:
    type: file
    description: "PDF document to analyze"
  extraction_fields:
    type: text
    description: "Fields to extract (comma separated)"
    default: "date, invoice number, total amount"

Parameter Types for Multi-Modal Inputs

Latitude supports these parameter types for multi-modal prompting:

  1. Image Parameters (type: image):

    • Supported by models with vision capabilities (e.g., GPT-4o, Claude 3)
    • Rendered in the prompt using {{ parameter_name }}
    • Appears as an image upload button in the Playground
    • Most models support common formats: JPEG, PNG, WebP, GIF
  2. File Parameters (type: file):

    • Supported by models with document processing capabilities
    • Different providers support different file types:
      • Claude: PDF documents
      • GPT-4o: Various document formats
    • Enables document analysis, data extraction, and PDF processing
  3. Text Parameters (type: text):

    • Standard text input that can be used alongside multi-modal inputs
    • Can contain instructions for processing the visual content

For complete details on parameter configuration, refer to the Configuration guide.

Multi-Modal Content in PromptL

When working with multi-modal content in Latitude’s PromptL syntax, you can reference images and files directly in your messages:

# Image Analysis

<user>
Please analyze this image: <content-image src="{{ image_parameter }}" />
What can you tell me about it?
</user>

<assistant>
I'll analyze the image you provided.
...
</assistant>

Or for document content:

<user>
Review this document: <content-file src="{{ document_parameter }}" />
Summarize the key points.
</user>

This explicit content tag syntax is an alternative to direct variable insertion and works well in more complex prompt structures.