Go to Studio

Video to Text

The Video to Text node uses a vision-capable LLM to analyze a video and produce a text description, transcription, or extracted information based on your prompt.

Video to Text node on canvas with video media input and editable analysis prompt

What does the Video to Text node do?

The Video to Text node sends a video to a vision-capable LLM (such as GPT-4o or Gemini) along with a custom prompt, and returns a text response. It can describe scenes, transcribe spoken content, summarize a recording, or extract structured information from the visuals.

Common use cases:

  • Generating detailed video descriptions for cataloging or accessibility.
  • Transcribing and summarizing webinars, interviews, or meeting recordings.
  • Extracting structured information (products shown, on-screen text, timestamps) from a video for downstream nodes.
  • Tagging or moderating video content based on visual scenes.

Quick setup

Follow these steps to add and configure the Video to Text node in your workflow:

Add the node to the canvas

Open the Node Library, go to AI > Video, then drag and drop the Video to Text node onto your workspace.

Connect the video source

Connect the input_media port (on the left of the node) to a node that produces a video — for example a Static Video, a Google Drive reader, or any node returning a video file or URL. The input also accepts images, text, strings, URLs, or arrays of these.

Pick a vision-capable model

In the settings, select the LLM Provider (e.g. OpenAI, Google) and then a Model that supports video input (e.g. GPT-4o, Gemini Pro). Only models compatible with this node are listed.

Write the prompt

In the prompt field, describe what the model should produce. You can inject values from previous nodes with the {{variable}} syntax. The prompt field is required.

Connect the output

Connect the output port (on the right) to the next node. Define the receiving variable name in that next node to use the generated text.

Configuration parameters

Video to Text panel listing provider vision model framing and transcription options

The node configuration combines an input port for the video, the model selection, and a free-form prompt that drives the analysis.

Required fields

Name string required default: Video to Text

Node name — Short identifier for this node in the canvas (e.g. “Describe demo video”). Useful for debugging and reading workflow logs.

Description string required default: Extract text descriptions from videos using AI.

Node description — A short phrase describing the role of this node in the workflow.

modelName llm required

Model — The LLM used for video analysis. Must be a vision/video-capable model (e.g. GPT-4o, Gemini Pro). Only compatible models are shown in the dropdown.

prompt string required

Instructions — Free-form instructions describing what the AI should extract, describe, or summarize from the video. Supports {{variable}} placeholders to inject values from upstream nodes. The node fails validation if this field is empty.

Optional fields

input_media media

Video input — The video to analyze. Accepts videos, URLs, images, text, strings, or arrays of these. Optional: you can also reference a media variable directly inside the prompt with {{my_video}}.

llmProvider string

LLM provider — Provider associated with the selected model (e.g. OpenAI, Google). Set automatically when you pick a model; you usually don’t edit it directly.

Tip

In version 2.0 the legacy “Video Files” and “URLs” inputs were merged into a single unified input_media port — connect any video file node or URL node to the same input.

What does the node output?

The node returns a single string containing the LLM response generated from the video and the prompt.

How to use the output

In Draft & Goal you don’t need to look up a system-generated variable name. To use the result:

  1. Draw a connection from the Video to Text node’s output.
  2. Connect it to the next node’s input.
  3. In that next node, create and name your own variable (for example, video_summary). The generated text is injected into it automatically.
output string

The text generated by the LLM in response to your prompt and the input video.

{
  "output": "The video shows a 30-second product demo. A person unboxes a wireless keyboard, connects it via Bluetooth, and types a few sentences to demonstrate the key feel. The packaging is minimal with a white box and the brand logo visible at 0:05."
}

Usage examples

Example 1: Describe a marketing video for a content catalog

Generate a rich description of a promotional video and rewrite it for a specific channel.

Workflow:

  1. Static Video — provides the video file.
  2. Video to Text — Prompt: Describe this video in detail, including the setting, on-screen actions, spoken dialogue, and any visible text or branding.
  3. LLM — rewrites the description for a target audience (e.g. social media caption, product page paragraph).

Example 2: Summarize a recorded presentation

Pull the key points out of a long meeting or webinar recording.

Workflow:

  1. Google Drive — selects the video file from Drive.
  2. Video to Text — Prompt: Summarize the main topics discussed in this video. For each topic, give a 1–2 sentence description and an approximate timestamp.
  3. Notion Database Writer — saves the summary to a Notion database for the team.

Common issues

The node returns an empty or generic response

Cause: The selected model does not actually support video input, or the file format is not recognized by the provider.

Solution: Pick a model explicitly listed as vision/video-capable (e.g. GPT-4o, Gemini Pro). Make sure the input file is in a common format (MP4, MOV, WebM). If you’re passing a URL, check that it is publicly reachable.

The output misses key details or feels too shallow

Cause: The prompt is too vague, or the video is too long for the model to process in detail end-to-end.

Solution: Make the prompt more specific (timestamps, named entities, sections to focus on). For long videos, extract a few key frames first with Extract Video Frame and analyze them with Image to Text, then aggregate the results.

Validation error: 'requires instructions to be configured'

Cause: The prompt field is empty.

Solution: Always fill in the prompt — it is required even when the video clearly suggests what to do. State explicitly what kind of output you expect (description, transcription, list, JSON, etc.).

Best practices and pitfalls

Tip

Be explicit about the output shape. If a downstream node expects JSON, ask for JSON in the prompt (e.g. “Return a JSON object with keys summary, topics, timestamps) and pair this node with JSON Path Extractor to consume it cleanly.

Warning

Video analysis is significantly slower and more expensive than text. Test on a short clip before running over a large dataset, and prefer Extract Video Frame + Image to Text when you only need information from a specific moment.

How does it fit into a workflow?

Video to Text is typically the bridge between a video source and any text-based downstream processing.

graph LR
    Source[Static Video / Google Drive] --> V2T[Video to Text
<br/>analyzes video]
    V2T --> Extractor[JSON Path Extractor]
    Extractor --> LLM[LLM
<br/>final formatting]
    LLM --> Writer[Notion / Sheets Writer]