February 13, 2025

ByteDance's New Goku+ Video Model Could Unleash A New Wave of AI Influencers

The age of fake social media influencers is coming.

by

Jim Clyde Monge

Would you agree if I said that the age of fake influencers is coming?

Well, the truth is, it’s actually here.

According to a study from Influencer Marketing Hub, 31.7% of brands think that virtual influencers have an advantage over human influencers because they have “more control over messaging.” A further 29.1% said 24/7 availability was the biggest advantage offered by AI influencers.

AI-powered platforms that let you create images of attractive female influencers and turn them into realistic videos are now highly accessible. Some even offer their services for free.

However, from my personal experience, AI-generated videos of people are still hit-and-miss in terms of realism. Most, if not all, video models still struggle with maintaining motion coherence.

Recently, the University of Hong Kong officially launched the Goku video generation model, developed in collaboration with ByteDance. This new video model aims to produce some of the most realistic, tiktok-esque kind of videos, which are perfect if you’re aiming to create videos of an AI influencer.

What is Goku?

Goku is a family of rectified flow Transformer models designed for joint image and video generation. It is designed to achieve industry-grade performance, integrating advanced techniques for high-quality visual generation, including meticulous data curation, model design, and flow formulation.

Goku supports multiple generation tasks:

🎬 Text-to-Video Generation
🖼️ Image-to-Video Generation
🎨 Text-to-Image Generation

Goku has another variation called Goku+, which lets you directly create virtual digital human videos. With Goku+, you can turn text into extremely realistic human videos that outperform current methods.

It even produces videos longer than 20 seconds, with steady hand movements and very expressive facial and body actions.

How The Image-To-Video Works

Goku for image-to-video generation (I2V) uses a widely adopted strategy of using the first frame of each video clip as a reference image. Here’s a breakdown of the process:

Reference Image as a Condition: The initial image is used as an additional condition for generating the video.
Token Concatenation: The image tokens corresponding to the reference image are broadcasted and then concatenated with the noised video tokens along the channel dimension.
Preserved Pre-trained Knowledge: To leverage pre-existing knowledge, a single MLP (Multi-Layer Perceptron) layer is introduced for channel alignment. The rest of the model architecture remains the same as the Goku-T2V (text-to-video) model.
Fine-tuning: The Goku-I2V model is fine-tuned using approximately 4.5 million text-image-video triplets from various domains to ensure generalization. Despite using only 10,000 fine-tuning steps, the model can animate the reference image and maintain alignment with the accompanying text.
Results: The generated videos display high visual quality and temporal coherence, capturing the semantic details described in the text.

Take a look at these examples:

The reference images are shown on the leftmost columns. Keywords are highlighted in red text.

To ensure that Goku produces high-quality videos, the model is trained on a dataset that is visually appealing, contextually relevant, and diverse.

The data curation pipeline consists of five main stages:

Image and video collection
Video extraction and clipping
Image and video filtering
Captioning
Data distribution balancing

This pipeline ensures that the video clips used for training the model are of high visual quality. This is achieved through visual aesthetic filtering using aesthetic models to evaluate keyframes and retain photorealistic and visually rich clips.

For instance, videos with resolutions around 480 x 864 are discarded if their aesthetic score is below 4.3, while for resolutions exceeding 720 x 1280, the threshold is raised to 4.56.

You can learn more about the technical details in this whitepaper.

Example Use Cases

The examples below showcase the capability of Goku+ to generate hyper-realistic videos of products and AI influencers:

Example #1: Advertising Scenario
Example #2: Product and Human Interaction
Example #3: Turn Product Image To Video Clip
Example #4: Text-To-Video

Example #1: Advertising Scenario

In the example below, Goku+ demonstrates its capability to generate videos ideal for advertising self-care products. The visual style closely mirrors the dynamic, fast-paced aesthetics found on popular platforms like TikTok.

Did they actually scrape millions of TikTok videos and use them as training data? If so, did they even get permission from the uploaders?

Example #2: Product and Human Interaction

In this example, the model does a great job of creating videos where a person interacts naturally with a product. The videos feel like a friendly explainer or a casual demo where everything just flows effortlessly.

Think about using an AI influencer to do live selling for you. How cool is it to have someone who doesn’t get tired talking and answering questions?

Example #3: Turn Product Image To Video Clip

This feature is probably one of the most practical: turning a static product image into a lively video clip. Instead of setting up a full video shoot, you just take one image and let Goku+ bring it to life with subtle movements and engaging details.

It’s a huge time-saver, especially for online sellers who need dynamic content fast. However, whether Goku+ maintains this level of consistency from the reference image is still yet to be seen.

How To Access Goku?

Right now, Goku is still a research paper, and no publicly accessible demo page is released. I highly recommend keeping an eye on their GitHub and HuggingFace spaces to stay up to date with future updates.

Final Thoughts

Goku+ is seriously impressive on paper—those example videos look fantastic. But these are cherry-picked highlights, and we won’t know the real deal until the public demo drops. Once we see it in action across all types of content, we’ll really get a sense of whether it can deliver consistent, high-quality performance.

Another big question on my mind is how the training data was actually gathered. Did they really scrape TikTok videos for this? If that’s the case, it raises some valid concerns about privacy and permissions. And then there’s the matter of ByteDance’s involvement—what's in it for them?

The possibility that these AI influencers could eventually be integrated into TikTok is pretty wild, and it opens up a whole new debate about the future of digital content and influencer marketing.

What’s your take on AI influencers? I’d love to know your thoughts in the comments section.

References:

[1] Saiyan-World, “Goku,” GitHub Repository. Available: https://github.com/Saiyan-World/goku?tab=readme-ov-file.

[2] “arXiv:2502.04896,” arXiv Preprint. Available: https://arxiv.org/abs/2502.04896.

‍

Stay ahead. Stay updated.