Text-to-Video Model LOCALLY Tutorial (Mochi-1)

Introduction

Text-to-video models have made significant strides recently, but many still find them elusive. While OpenAI has demonstrated projects like Sora and Meta has announced their text-to-video model, neither has been widely accessible yet. However, a few weeks ago, a company called Genmo AI released an open-source text-to-video model named Mochi-1. In this article, I'll guide you through the process of getting it up and running on your local computer, showcasing some examples of what you can create with it.

Getting Started with Mochi-1

Mochi-1 has quickly gained recognition as a state-of-the-art model in open-source video generation. One stunning example outputted by the model includes a video of wine pouring into a glass, a person under street lights during rain, and a lightning strike. The results are impressive and demonstrate the capabilities of this tool.

Running Mochi-1 Locally

Many users assume that text-to-video models must be hosted in the cloud, but that's not true. I managed to run Mochi-1 locally using my Dell Precision Tower, equipped with two RTX A6000 GPUs (though I typically only used one). I also want to acknowledge Dell for partnering with me on this tutorial. You'll find links to the Dell computer and additional resources in the description below.

To begin using Mochi-1, we will be leveraging Comfy UI, which may seem intimidating due to its complex interface, but I assure you that getting it up and running is manageable. Here’s how to install Mochi-1:

Download Comfy UI:
- Visit the Comfy UI GitHub page (links provided in the description).
- Scroll down to find the "Installing Comfy UI" link and click it.
- Direct download and choose a location for extraction, such as the desktop.
Unzip Comfy UI:
- Right-click and select 'extract all' to decompress the files.
- Open the folder and double-click on the "run Nvidia GPU" file. This assumes you have an Nvidia GPU.
Install Comfy UI Manager:
- Go to the Comfy UI Manager GitHub page (links in description).
- Click the green code button and copy the URL.
- Open your terminal, navigate (cd) into the extracted Comfy UI folder.
- In the terminal, navigate to custom nodes folder and run git clone followed by pasting the URL.
Load Mochi Wrapper:
- Again, visit the GitHub page for the Comfy UI Mochi wrapper.
- Copy the URL and run git clone in the custom nodes folder as before.
- Restart Comfy UI by double-clicking the "run Nvidia GPU" link.
Install Additional Nodes:
- Open the manager and search for "comfy UI video helper suite" to install additional functionalities.
- Make sure to also install KJ nodes.
Run Your First Video:
- Use the Mochi wrapper example files. Choose an example (like 49 frames) to start and open it.
- Modify the prompt to your liking, such as "nature video of a red panda eating bamboo in front of a waterfall".
- Click "Q prompt" to generate the video.

Once set up, you can experiment with prompts and settings. Adjust frame rates, video lengths, and other output formats according to your preferences.

Examples and Results

I tested a couple of prompts including “panda eating bamboo” and “kid riding a bike.” The results were varied; the outputs for each prompt were impressive, though imperfections were evident due to using a quantized version of the model.

The process might take a few minutes, especially when downloading the model for the first time, but it's a gratifying experience to see your prompts come to life. The capabilities of text-to-video generation on such accessible hardware are exciting, and I believe this is just the beginning.

Conclusion

I hope this tutorial empowers you to explore the world of local text-to-video generation with Mochi-1. Feel free to drop any questions in the comments or join our Discord for additional support. My thanks again to Dell and Nvidia for facilitating this exciting project. If you enjoyed this article, consider subscribing for more in-depth tutorials!

Keywords

Text-to-video model
Mochi-1
Genmo AI
Comfy UI
Local installation
Nvidia GPU
Video generation

FAQ

Q: What is Mochi-1?
A: Mochi-1 is an open-source text-to-video model released by Genmo AI, known for its impressive video generation capabilities.

Q: Do I need cloud hosting to run text-to-video models?
A: No, you can run models like Mochi-1 locally on high-end consumer hardware, eliminating the need for cloud hosting.

Q: What system requirements do I need for running Comfy UI and Mochi-1?
A: A workstation with an Nvidia GPU is ideal, particularly models like RTX A6000, as they provide the necessary processing power.

Q: Can I adjust the video output settings in Comfy UI?
A: Yes, Comfy UI allows you to customize various settings such as frame rate, video length, resolution, and quantization.

Q: Where can I find support if I face issues while installing?
A: You can drop your questions in the comments section or join the Discord community for further assistance.