DeepSeek says Janus-Pro 7B outperforms OpenAI's Dall-E 3 and Stable Diffusion in several benchmarks.
DeepSeekās R-1 model has been making headlines globally for the past few days. Itās an open-source and affordable alternative to OpenAIās o1 model. Yet, even before the buzz around R-1 has settled, the Chinese startup has unveiled another open-source AI image model called Janus-Pro.
DeepSeek says Janus-Pro 7B outperforms OpenAIās Dall-E 3 and Stable Diffusion in several benchmarks. But is it really that good? Does it live up to the claims, or is this just another model riding the AI hype?
Letās find out.
In simple terms, Janus-Pro is a powerful AI model that can understand images and text and can also create images from text descriptions.
Janus-Pro is an enhanced version of the Janus model, designed for unified multimodal understanding and generation. It has a better training method, more data, and a larger model. It also delivers more stable outputs for short prompts, with improved visual quality, richer details, and the ability to generate simple text.
Take a look at some examples below:
Prompt: The face of a beautiful girl
The newer model is also more capable at rendering texts.
Prompt: A clear image of a blackboard with a clean, dark green surface and the word āHelloā written precisely and legibly in the center with bold, white chalk letters.
The Janus-Pro series includes two model sizes: 1 billion and 7 billion, demonstrating scalability of the visual encoding and decoding method. The image resolution generated by both models is 384 Ć 384.
In terms of commercial licensing, this model is available with a permissive license for both academic and commercial use.
Janus-Pro uses separate visual encoding methods for multimodal understanding and visual generation tasks. This design aims to mitigate conflicts between these two tasks and improve overall performance.
For multimodal understanding, Janus-Pro uses the SigLIP encoder to extract high-dimensional semantic features from images, which are then mapped to the LLMās input space via an understanding adaptor.
For visual generation, the model uses a VQ tokenizer to convert images into discrete IDs, which are then mapped to the LLMās input space via a generation adaptor.
In text-to-image instruction following, Janus-Pro-7B scores 0.80 on the GenEval benchmark, outperforming other models, such as OpenAIās Dall-E 3 and Stability AIās Stable Diffusion 3 Medium.
Additionally, Janus-Pro-7B achieves a score of 84.19 on DPG-Bench, surpassing all other methods and demonstrating its ability to follow dense instructions for text-to-image generation.
According to the internal benchmarks from DeepSeek, both Dall-E 3 and Stable Diffusion models have scored less on GenEval and DPG-Bench benchmarks.
But I take this information with a grain of salt because of how the sample images look. The best way to prove it is to do my own tests. Letās take a look at some examples below:
Prompt: A photo of a herd of red sheep on a green field.
Prompt: A beautiful 35 year old woman of average build wearing a pink tulle dress sits on the ground in front of the Eiffel Tower. Soft light illuminates her face as she poses for a photo with Paris in the background in Chanel style. Her shoulder length brown hair is styled in loose waves that fall to one side.
Prompt: An image of a little boy holding a white board with the text āAI is awesome!ā
Based on the examples above, Dall-E 3 clearly performs better than Janus Pro. The faces and body proportions in Janus Proās outputs are noticeably off, and the text rendering examples suggest it struggles in that area as well.
That said, itās possible Iām missing somethingāthere might be specific parameters or fine-tuning required to improve the results. However, with the default settings, the outputs feel underwhelming.
By the way, if youāre simply looking for the best AI image generator, I highly recommend using Flux Pro 1.1 Ultra in Flux Labs AI. Itās the same tool I used in the cover image of this article.
The Flux image model from Black Forest Labs is hands down the best out there. Itās open-weight, so you can finetune it with your own custom images.
DeepSeek released Janus to the public on HuggingFace to support a broader and more diverse range of research within both academic and commercial communities.
Note that the Janus-Pro with 7 billion parameters model can eat up almost 15 gigabytes of your internal memory.
If you donāt want to run the model on your own hardware, you can simply run the Gradio demo on HuggingFace.
Itās pretty easy to use. Simply describe the image youād like to generate and click on the āGenerate imageā button. The speed of generation depends on the number of users concurrently using the app.
You can also try the multimodal understanding feature by uploading an image and asking the AI to explain it. Hereās an example:
Prompt: Explain this meme
This image is a humorous comparison between two types of visual encoding methods, using the popular "buff Doge vs. Cheems" meme format.1. **Decoupling Visual Encoding (Buff Doge)**: Ā - The left side of the image shows a muscular, buff Doge, representing a strong and robust method of visual encoding. This could symbolize a more powerful or efficient approach to encoding visual information, such as deep learning models or advanced neural networks that can handle complex visual tasks.2. **Single Visual Encoder (Cheems)**: Ā - The right side of the image shows a smaller, less muscular Doge sitting down, representing a simpler or less powerful method of visual encoding. This could symbolize a traditional or less advanced approach, such as a single neural network or a basic encoding method that may not perform as well as more complex systems.### Overall Meaning:The meme humorously contrasts the strength and efficiency of two different methods of visual encoding. The buff Doge represents a more advanced, powerful, and robust method, while the Cheems represents a simpler, less powerful one. This could be interpreted as a commentary on the evolution of visual encoding techniques, with deep learning and neural networks being seen as more advanced and effective compared to traditional methods.
It's accurate, and I see it as a great tool for autocaptioning or adding alt texts to photos.
For the developers out there, you can download the model and run it on your local disk. Hereās an example inference code snippet to generate an image from text:
import os
import PIL.Image
import torch
import numpy as np
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
# specify the path to the model
model_path = "deepseek-ai/Janus-Pro-7B"
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(
model_path, trust_remote_code=True
)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()
conversation = [
{
"role": "<|User|>",
"content": "A stunning princess from kabul in red, white traditional clothing, blue eyes, brown hair",
},
{"role": "<|Assistant|>", "content": ""},
]
sft_format = vl_chat_processor.apply_sft_template_for_multi_turn_prompts(
conversations=conversation,
sft_format=vl_chat_processor.sft_format,
system_prompt="",
)
prompt = sft_format + vl_chat_processor.image_start_tag
@torch.inference_mode()
def generate(
mmgpt: MultiModalityCausalLM,
vl_chat_processor: VLChatProcessor,
prompt: str,
temperature: float = 1,
parallel_size: int = 16,
cfg_weight: float = 5,
image_token_num_per_image: int = 576,
img_size: int = 384,
patch_size: int = 16,
):
input_ids = vl_chat_processor.tokenizer.encode(prompt)
input_ids = torch.LongTensor(input_ids)
tokens = torch.zeros((parallel_size*2, len(input_ids)), dtype=torch.int).cuda()
for i in range(parallel_size*2):
tokens[i, :] = input_ids
if i % 2 != 0:
tokens[i, 1:-1] = vl_chat_processor.pad_id
inputs_embeds = mmgpt.language_model.get_input_embeddings()(tokens)
generated_tokens = torch.zeros((parallel_size, image_token_num_per_image), dtype=torch.int).cuda()
for i in range(image_token_num_per_image):
outputs = mmgpt.language_model.model(inputs_embeds=inputs_embeds, use_cache=True, past_key_values=outputs.past_key_values if i != 0 else None)
hidden_states = outputs.last_hidden_state
logits = mmgpt.gen_head(hidden_states[:, -1, :])
logit_cond = logits[0::2, :]
logit_uncond = logits[1::2, :]
logits = logit_uncond + cfg_weight * (logit_cond-logit_uncond)
probs = torch.softmax(logits / temperature, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
generated_tokens[:, i] = next_token.squeeze(dim=-1)
next_token = torch.cat([next_token.unsqueeze(dim=1), next_token.unsqueeze(dim=1)], dim=1).view(-1)
img_embeds = mmgpt.prepare_gen_img_embeds(next_token)
inputs_embeds = img_embeds.unsqueeze(dim=1)
dec = mmgpt.gen_vision_model.decode_code(generated_tokens.to(dtype=torch.int), shape=[parallel_size, 8, img_size//patch_size, img_size//patch_size])
dec = dec.to(torch.float32).cpu().numpy().transpose(0, 2, 3, 1)
dec = np.clip((dec + 1) / 2 * 255, 0, 255)
visual_img = np.zeros((parallel_size, img_size, img_size, 3), dtype=np.uint8)
visual_img[:, :, :] = dec
os.makedirs('generated_samples', exist_ok=True)
for i in range(parallel_size):
save_path = os.path.join('generated_samples', "img_{}.jpg".format(i))
PIL.Image.fromarray(visual_img[i]).save(save_path)
generate(
vl_gpt,
vl_chat_processor,
prompt,
)
I understand the hype around this new image model. People claim that itās a good alternative to Dall-E 3, but I donāt agree with it. Iāve tried Janus-Pro myself, but the quality of the images isnāt as impressive as I thought.
One key limitation is the restricted input resolution of 384 Ć 384. Additionally, the relatively low resolution for text-to-image generation, combined with reconstruction losses from the vision tokenizer, can result in images that lack the level of detail many users might expect.
That said, the rapid emergence of open-source models like Janus-Pro signals that DeepSeek is already positioning itself as a formidable disruptor in the AI race. Despite the current quality limitations, their push for accessible, open innovation is no doubt leaving industry leaders scrambling to adapt.
Software engineer, writer, solopreneur