What Is Multimodal Image Embedding?

Multimodal Image Embedding extracts images from PDFs and Word (DOCX) documents during training, generates vector embeddings for each image, and stores them in the same vector index as your text chunks. When users ask questions, the system can retrieve relevant images (diagrams, screenshots, charts) alongside text, and surface them in answers.

Key points: Images and text share the same vector space and index. Only the default embedding provider supports image embedding (OpenAI text-embedding-3-large does not). Available on Trial, Small, Medium, Large, and Enterprise plans.

Prerequisites

  • Plan eligibility: Trial, Small, Medium, Large, or Enterprise. Mini and Starter plans do not include multimodal embeddings.
  • Supported file types: PDF and DOCX only.

How to Enable

1 Start Training

Go to AI Agents, select the agent, ensure document tags and files are ready to train, then click Train (or Retrain).

2 Enable the Checkbox

In the AI Training Confirmation dialog, check Multimodal Image Embedding. Optionally enable Smart Image-Aware Chunking to link images to text chunks with captions. Click Start Training.

Note: The checkbox is only shown when the training set includes PDF or DOCX files. It is disabled if your plan does not include multimodal embeddings.

What Happens During Training

  1. Image extraction: Images are extracted from each PDF and DOCX page.
  2. Image preparation: Each image is validated and optionally resized (max 2048px, 2MB). Supported formats: PNG, JPEG.
  3. Embedding: Images are sent to the embedding service in batches of up to 6 per request.
  4. Storage: Each image embedding is stored in the agent's vector collection with modality: 'image'.
  5. Display: A DocumentImage record is created so the image can be shown in chat when a relevant chunk is returned.

Multimodal vs. Smart Image-Aware Chunking

Multimodal Image Embedding embeds images as vectors so they can be retrieved by semantic search. Smart Image-Aware Chunking links images to text chunks with captions so they appear when the linked text is relevant. You can enable both.

Verifying It Worked

After training, the Training Result modal shows Images embedded with a count if multimodal was enabled. If no images were found in your documents, the count will be 0.

Learn More

For the power and benefits of multimodal image embedding, see our Multimodal Image Embedding insight. For training tips, see AI Agent Training and Document Best Practices.