Go to Studio

Video to Text

Extract text descriptions from videos using AI

What does this node do?

The Video to Text node uses AI models to analyze video content and produce text descriptions, summaries, or structured data. It can describe what happens in a video, transcribe spoken content, or extract specific information based on your prompt.

Common uses:

  • Generate detailed video descriptions for cataloging or accessibility
  • Transcribe and summarize video content
  • Extract specific information from videos for downstream processing
  • Analyze video scenes for content moderation or tagging

Quick setup

Add the Video to Text node

Find it in AI NodesVideo to Text

Provide the video

Connect a video source via the input_media port — accepts videos, URLs, images, text, or arrays

Choose your model

Select the LLM model and provider to use for analysis

Write your prompt

Tell the AI what to extract, describe, or summarize from the video

Configuration

Input

input_media media

The media to analyze. Accepts videos, URLs, images, text, or arrays of these types. This input is optional — you can also reference media via variables in the prompt.

Required fields

modelName string required

The LLM model to use for video analysis. Choose a model that supports vision/video input (e.g., GPT-4o, Gemini Pro).

Optional fields

llmProvider string

The LLM provider to use (e.g., OpenAI, Google). Automatically set based on the selected model.

prompt string

Custom instructions for the AI describing what to extract or generate from the video. Supports {{variables}} to inject dynamic values from other nodes.

Examples:

  • “Describe everything that happens in this video”
  • “Transcribe the spoken dialogue in this video”
  • “List all products shown in this video with timestamps”
  • “Summarize the key points discussed in this presentation”

Output

The node outputs a single string containing the AI-generated text.

{
  "output": "The video shows a 30-second product demo. A person unboxes a wireless keyboard, connects it via Bluetooth, and types a few sentences to demonstrate the key feel. The packaging is minimal with a white box and the brand logo visible at 0:05."
}

Version: 2.0

Examples

Generate a video description

Describe a marketing video for content cataloging:

Workflow:

  1. Static Video — Provide the video file
  2. Video to Text — Prompt: “Describe this video in detail, including the setting, actions, and any visible text or branding”
  3. LLM — Rewrite the description for a specific audience or format

Summarize a presentation recording

Extract key points from a recorded meeting or presentation:

Workflow:

  1. Google Drive Reader — Select the video from Drive
  2. Video to Text — Prompt: “Summarize the main topics discussed in this video. List each topic with a brief description”
  3. Notion Database Writer — Save the summary to Notion for team reference

Best practices

  • Choose the right model. Video analysis requires models with vision capabilities. Not all LLMs support video input — check model documentation.
  • Be specific in your prompt. The more precise your instructions, the better the output. Instead of “describe the video”, ask for exactly what you need (timestamps, people, actions, text).
  • Consider video length. Longer videos take more time and tokens to process. If you only need a specific section, use Extract Video Frame first.

Common issues

The node returns an empty or generic response

Cause: The selected model may not support video input, or the video format is not recognized.

Solution: Verify the model supports vision/video analysis. Try a different model (e.g., GPT-4o or Gemini Pro). Ensure the video file is in a supported format (MP4, MOV, WebM).

The output is inaccurate or misses key details

Cause: The prompt is too vague, or the video is too long for the model to process entirely.

Solution: Refine your prompt to be more specific about what to extract. For long videos, consider extracting key frames first with Extract Video Frame and processing them individually with Image to Text.