Video to Text
Extract text descriptions from videos using AI
What does this node do?
The Video to Text node uses AI models to analyze video content and produce text descriptions, summaries, or structured data. It can describe what happens in a video, transcribe spoken content, or extract specific information based on your prompt.
Common uses:
- Generate detailed video descriptions for cataloging or accessibility
- Transcribe and summarize video content
- Extract specific information from videos for downstream processing
- Analyze video scenes for content moderation or tagging
Quick setup
Add the Video to Text node
Find it in AI Nodes → Video to Text
Provide the video
Connect a video source via the input_media port — accepts videos, URLs, images, text, or arrays
Choose your model
Select the LLM model and provider to use for analysis
Write your prompt
Tell the AI what to extract, describe, or summarize from the video
Configuration
Input
input_media media The media to analyze. Accepts videos, URLs, images, text, or arrays of these types. This input is optional — you can also reference media via variables in the prompt.
Required fields
modelName string required The LLM model to use for video analysis. Choose a model that supports vision/video input (e.g., GPT-4o, Gemini Pro).
Optional fields
llmProvider string The LLM provider to use (e.g., OpenAI, Google). Automatically set based on the selected model.
prompt string Custom instructions for the AI describing what to extract or generate from the video. Supports {{variables}} to inject dynamic values from other nodes.
Examples:
- “Describe everything that happens in this video”
- “Transcribe the spoken dialogue in this video”
- “List all products shown in this video with timestamps”
- “Summarize the key points discussed in this presentation”
Output
The node outputs a single string containing the AI-generated text.
{
"output": "The video shows a 30-second product demo. A person unboxes a wireless keyboard, connects it via Bluetooth, and types a few sentences to demonstrate the key feel. The packaging is minimal with a white box and the brand logo visible at 0:05."
}
Version: 2.0
Examples
Generate a video description
Describe a marketing video for content cataloging:
Workflow:
- Static Video — Provide the video file
- Video to Text — Prompt: “Describe this video in detail, including the setting, actions, and any visible text or branding”
- LLM — Rewrite the description for a specific audience or format
Summarize a presentation recording
Extract key points from a recorded meeting or presentation:
Workflow:
- Google Drive Reader — Select the video from Drive
- Video to Text — Prompt: “Summarize the main topics discussed in this video. List each topic with a brief description”
- Notion Database Writer — Save the summary to Notion for team reference
Best practices
- Choose the right model. Video analysis requires models with vision capabilities. Not all LLMs support video input — check model documentation.
- Be specific in your prompt. The more precise your instructions, the better the output. Instead of “describe the video”, ask for exactly what you need (timestamps, people, actions, text).
- Consider video length. Longer videos take more time and tokens to process. If you only need a specific section, use Extract Video Frame first.
Common issues
The node returns an empty or generic response
Cause: The selected model may not support video input, or the video format is not recognized.
Solution: Verify the model supports vision/video analysis. Try a different model (e.g., GPT-4o or Gemini Pro). Ensure the video file is in a supported format (MP4, MOV, WebM).
The output is inaccurate or misses key details
Cause: The prompt is too vague, or the video is too long for the model to process entirely.
Solution: Refine your prompt to be more specific about what to extract. For long videos, consider extracting key frames first with Extract Video Frame and processing them individually with Image to Text.