I’ve been eyeing up Nvidia’s latest release in the generative AI world. The Cosmos models for text to video generation and image to video look pretty promising from the samples that I’ve glanced at so far.
Both text-to-video and image-to-video models work natively in ComfyUI and I’m excited to test them out.
Here’s a little more info on them and then I will go into sharing my early results of testing it out below:
– There are 7B and 14B versions of these models. The 7B model is the version that I think most of us will want to be focusing on with our hardware.
– These models are non distilled which makes them a bit different from other recent text to video and image to video models that have been released.
– The VAE is super efficient with these new models and allows you to encode/decode with a 12GB VRAM card by some sort of magic as it doesn’t even use any tiling in the process.
– There’s also a new sampler called res_multistep that comes with this release. Apparently it can be used with other models like Hunyuan which is widely thought to be the current best open-source video model.
How to use the Cosmos model in ComfyUI
1. Install ComfyUI
2. Download the needed models and place them in the appropriate folders as follows:
oldt5_xxl_fp8_e4m3fn_scaled.safetensors should be placed in: ComfyUI/models/text_encoders/
cosmos_cv8x8x8_1.0.safetensors should be placed in: ComfyUI/models/vae/
and for text-to-video:
Cosmos-1_0-Diffusion-7B-Text2World.safetensors should be placed in: ComfyUI/models/diffusion_models
and for image-to-video or video-to-video:
Cosmos-1_0-Diffusion-7B-Video2World.safetensors should be placed in: ComfyUI/models/diffusion_models
3. Restart ComfyUI if it is running.
4. Drag and drop one of the following workflows into your ComfyUI:
Example text to video workflow json file
Example image to video workflow json file
5. Re-select the models under the “Load Diffusion Model”, “Load Clip”, and “Load VAE” nodes so they point to the proper files on your drive.
6. Adjust the positive and negative text prompts to your liking.
With Nvidia Cosmos models, the more detailed and longer the prompt is, the better it seems to work.
7. Adjust the KSampler setting if needed. Note* that 704 x 704 is the lowest resolution that the model can handle and that a frame count other than 121 tends to produce poor results (for whatever reason).
8. Click “Generate” and wait for your results.
Here are a few early test results that I got with Nvidia Cosmos
My first impression of the Cosmos video models is that it is slow to generate. My first run I used the 7B text to video model an generated a 5 second 1280 x 704 video at 24fps, and it took 59 minutes total on my RTX 3060 with 12GB VRAM.
Here is the output of that generation:
A masterpiece. Just kidding.. it put out decent image quality, but clearly the movement is pretty terrible.
Now how about the image to video model AKA “video to world” model.
I used whatever Dalle-3 image generation I had hanging around and gave it a shot. Here was what I got from my first generation:
This generation looks a bit better.. It matched the image that I put into it fairly well and the movement was a little more interesting, although it didn’t create anything overly impressive from it. This generation took 60 minutes to generate a clip of the same length as before.
First impressions of Nvidia Cosmos video models
Judging by these test generations I think it is still clearly obvious that the open-source video models are not even close to being as impressive as the closed, paid models such as Kling.
Kling 1.6 in professional mode especially is on an entirely different level than the output above. And that isn’t even factoring in how long it takes to generate the videos with Cosmos. Generating a 5 second clip of the quality above in one hour is just not feasible.
Here is to hoping that the open-source video space will be blessed with a clear winner of a model in the near future because in it’s current state, Nvidia Cosmos just isn’t it.
I leave you with a couple recent generations I did using Kling 1.6 in professional mode for comparison sake.
Hi, I’m the guy behind Helpful Tiger. This website shares a little bit of everything related to generative AI and online marketing. Have questions? Reach out, and I’ll do my best to help!