Vision Models

Vision models are AI systems designed to understand and analyze visual information from images or documents. These models can perform tasks such as image classification, object detection, and visual question answering. This guide will explain how to use vision models with Anthropic's API for both image and document inputs.

Understanding Vision Models

Vision models use advanced neural networks to process and interpret visual data. They can:

Identify objects, people, and scenes in images
Analyze text and layouts in documents
Answer questions about visual content
Generate descriptions of images

Supported Models and Input Types

Different models support various combinations of input types. Here's a breakdown:

Models supporting Text, Images, and Documents:

Anthropic Claude 3 Sonnet
Anthropic Claude 3.5 Sonnet
Anthropic Claude 3 Haiku
Anthropic Claude 3 Opus

Models supporting Text and Images only:

OpenAI GPT-4o
OpenAI GPT-4o Mini

When using these models, ensure you're providing the appropriate input types as supported by each model.

Input Modalities

Images

Vision models can accept various image formats as input, including JPEG, PNG, WebP, and GIF. Images can be provided to the API in base64-encoded format.

Documents

For document analysis, vision models can work with CSV, DOC, DOCX, HTML, Markdown, PDF, Plain text, XLS, and XLSX files.

Documents should also be converted to base64 format before sending to the API.

Running Vision Inference Requests

To use a vision model, you'll need to structure your API request to include both the image/document data and any text prompts or instructions. The image or document data should be converted to a base64 string and included in the base64String property, not in the text property. Here's an example of how to format a request:

{
  "model": "anthropic.claude-3-haiku-20240307-v1:0",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image",
          "base64String": "iVBORw0KGgoAAAANSUhEUgAAAfgA..." // Base64-encoded image data
        },
        {
          "type": "text",
          "text": "Describe this image"
        }
      ]
    }
  ],
  // ... rest of the request body ...
}

Converting Images or Documents to Base64

Before sending an image or document to the API, you need to convert it to a base64 string. Here are examples of how to do this in different languages:

base64 -i image.jpg | tr -d '\n' > image_base64.txt

These functions work for both images and documents (like PDFs). Make sure to include the converted base64 string in the base64String property of your API request, as shown in the example above.

Best Practices

Ensure your images are clear and well-lit for best results.
Be specific in your text prompts to guide the model's analysis.
Experiment with different parameter settings to optimize output.

Conclusion

Vision models open up exciting possibilities for AI-powered image and document analysis. By following this guide, you can start leveraging these powerful tools in your applications.