Imagine you’re an artist, but instead of paint, your medium is the fabric of reality itself. With a few strokes of your pen — or a couple of sentences — you sketch out a world, a playground of imagination where every element bends to your creative will. Now, envision an AI that takes this sketch, this snippet of text, and breathes life into it, crafting an interactive realm that you can step into and explore. This isn’t the plot of a sci-fi novel; it’s the reality brought forth by Genie, a groundbreaking AI developed by Google DeepMind.

Genie is not just a program; it’s a wish-granting entity for the digital age, turning text, drawings, and photos into dynamic, explorable worlds. What once required the concerted effort of teams of programmers and artists, Genie conjures with the ease of a magician pulling rabbits out of a hat. But Genie’s sorcery is rooted in the realm of zeros and ones, an 11-billion-parameter model trained unsupervised on the vast wilderness of internet videos.

As you stand at the precipice of this brave new world, ask yourself: What kind of universe will you create? Will it be a serene landscape where the grass whispers secrets to the wind, or a bustling metropolis where the architecture pulses with untold stories? With Genie, the power of creation is at your fingertips, offering a glimpse into the future of interactive AI where the only limit is your imagination.

Introduction

As the dawn of generative AI casts its light upon the technological landscape, we witness the emergence of models that breathe life into the very essence of creativity and novelty. Powered by the revolutionary transformer architectures and bolstered by the relentless march of hardware capabilities, we now stand at the precipice of a new digital horizon. The convergence of scaled models and expansive datasets has endowed machines with the ability to craft coherent, conversational language, and fabricate images that captivate our aesthetic senses from simple text prompts. These developments, once the domain of speculative fiction, have rapidly transitioned into the tactile realm of human-machine collaboration.

Yet, within this burgeoning realm, there lies a vast gulf — a space between the interactivity and engagement offered by the likes of language models such as ChatGPT and the immersive experiences we yearn for. The question then arises: What if the vast ocean of internet videos could serve as a crucible for not only generating novel images or sequences but entire interactive experiences?

This is where Genie enters the narrative — a paradigm shift in generative AI that heralds the creation of generative interactive environments from mere textual or visual prompts. Genie, trained on a dataset of unprecedented scale, encompassing over 200,000 hours of publicly available internet gaming videos, transcends the traditional boundaries of generative models. With no reliance on action or text annotations, Genie is a system that responds to user input on a frame-by-frame basis, guided by a learned latent action space.

At the heart of Genie’s architecture is the spatiotemporal (ST) transformers, a core design choice that unifies all components of this generative behemoth. It utilizes a novel video tokenizer, and its dynamics are shaped by a causal action model. The integration of video tokens and latent actions through MaskGIT for autoregressive frame prediction is a testament to the model’s ingenuity.

Scaling analysis of Genie’s architecture reveals a graceful ascent in performance with increased computational resources, culminating in an 11-billion-parameter model trained on a curated set of 30,000 hours of internet gameplay videos from a myriad of 2D platformer games. This foundation world model sets the stage for Genie’s versatility.

To demonstrate the universality of Genie’s approach, we also witness its application to the domain of action-free robot videos, yielding a generative environment replete with consistent latent actions. Moreover, the latent actions extrapolated from internet videos hold the potential to infer policies from unseen videos, a capability that could be the cornerstone for training generalist agents.

Thus, Genie emerges as a new class of generative model, controllable on a frame-by-frame basis and poised to redefine the landscape of generative AI. This exploration into Genie’s realm is not just an academic endeavor; it is a foray into the future of interactive experiences, where the boundaries between creator and creation blur into a canvas of endless potential.

Methodology

Within the boundless realms of generative AI, Genie stands as a paradigmatic shift, a model that transmutes video-only data into rich, interactive experiences. At its core, Genie is a tale of transformation, a methodology that begins with a foundation laid by the Vision Transformer (ViT) and evolves into a memory-efficient narrative capable of handling the vast lexical expanse of video content.

Picture the challenges posed by transformers, with their quadratic memory costs, when faced with videos potentially containing tens of thousands of tokens. Genie rises to this challenge by adopting a spatiotemporal (ST)-transformer architecture, a sophisticated dance of balancing model capacity with computational constraints, inspired by the seminal work of Xu et al.

In this realm, tokens do not simply attend to all others in indiscriminate fashion. Instead, the ST-transformer wields 𝐿 spatiotemporal blocks, an intricate layering of spatial and temporal attention layers, cascading into a feed-forward network. The spatial layer’s self-attention spans across the 1 × 𝐻 × 𝑊 tokens within each time step, while the temporal layer’s gaze traverses 𝑇 × 1 × 1 tokens through time’s corridor.

The magic of Genie’s efficiency lies in its architectural alchemy. The computation complexity, primarily shouldered by the spatial attention layer, scales linearly rather than quadratically with the number of frames. This ingenious design allows Genie to perform video generation with consistent dynamics over prolonged interactions with a mere fraction of the computational cost.

The ST block within Genie is a testament to innovation, a single FFW layer after both spatial and temporal attention, forsaking the post-spatial FFW to scale up other components, which has shown significant improvements in outcomes.

In Genie’s universe, three key components are the stars that guide its journey:

A latent action model (LAM) that infers the latent action 𝒂 between frames,
A video tokenizer converting raw frames into discrete tokens 𝒛, and
A dynamics model that, holding the past tokens and a latent action, predicts the video’s next frame.

This triad is trained in a duet of phases, following the autoregressive video generation tradition. The video tokenizer, once trained, becomes the maestro for the dynamics model orchestrates the future of the video, taking cues from the discrete tokens, provided by the video tokenizer, and the subtle influences of latent actions, inferred by the latent action model. Together, they generate the next frames in an iterative symphony of prediction.

The latent action model (LAM) is a pivotal player in the Genie ensemble, allowing for controllable video generation by conditioning each future frame on the previous frame’s action. But actions are often unlabelled in the wild landscapes of internet videos, and here Genie improvises by learning latent actions in a fully unsupervised manner, eschewing the need for costly annotations.

Imagine an encoder that gazes upon the frames past and present, distilling from them a continuous stream of latent actions. A decoder then takes these whispers of past movements and latent actions, weaving them together to predict the next frame. The training of this model unfolds through the lens of a VQ-VAE-based objective, constraining the potential actions to a discrete set of codes, enabling both human playability and robust controllability.

The ST-transformer architecture extends its tendrils into the latent action model, taking the entire video as input and elegantly generating all latent actions with the precision of a causal mask. This approach allows Genie to handle the full narrative arc of the video, generating latent actions for each frame in a cohesive sequence.

Compressing the vastness of videos into discrete tokens is the realm of the video tokenizer. It takes frames of video as input and, using the alchemy of VQ-VAE, transmutes them into a compact representation. Unlike other works that focus solely on spatial compression, Genie’s video tokenizer employs the ST-transformer in both encoder and decoder phases, capturing the temporal dynamics and enhancing the quality of video generation.

The dynamics model, a decoder-only MaskGIT transformer, is the final act in Genie’s methodological play. It takes the baton from the tokenized video and latent actions, generating predictions for the upcoming frames. It’s a model that learns from both the past and the potential future, predicting the next scenes with a cross-entropy loss that finely tunes its performance to the true trajectory of the video.

In Genie’s world, the latent actions are not mere afterthoughts; they are central to the generation process, treated as additive embeddings that enrich the model’s controllability and responsiveness.

As we pivot to Genie’s action-controllable video generation at inference time, we invite the user to prompt the model with an initial image. This image, tokenized and combined with user-specified latent actions, becomes the seed from which the dynamics model grows the subsequent frames, each iteration a step into a new world crafted by the user’s desires.

The beauty of Genie lies not just in its technical prowess but in its promise for the future — a future where generative AI transcends the screen to create interactive, dynamic narratives that respond to our input and imagination.

Examples

Conclusion and Future Work

In the innovative realm of generative AI, Genie emerges as a transformative force, illustrating the power of technology to turn the fantastical into the tangible. Genie enables the creation and exploration of virtual worlds, crafted from the imagination and made real through the lens of a video dataset comprised of 2D platformer games. It’s a tool that promises to extend the boundaries of play, learning, and creativity for all ages.

Despite its novel capabilities, Genie is not without limitations. The current iteration, akin to its autoregressive transformer model relatives, can sometimes predict futures that border on the fantastical rather than the feasible. The challenge of sustaining coherent environments over longer sequences persists, given its current temporal reach of 16 frames. Additionally, the operational speed of Genie sits at around 1FPS, necessitating advancements for seamless real-time interactions.

Looking ahead, Genie’s potential as a foundation for future research is immense. The prospect of a model trained on a wider array of Internet videos offers the exciting possibility of simulating a broader diversity of environments. Particularly intriguing is the potential application in reinforcement learning, where Genie could be instrumental in developing agents with a more generalized skill set, adapting to varied and complex scenarios.

Broader Impact

Genie holds the promise of democratizing game development, providing a platform where individuals can generate their own interactive experiences. This could significantly enhance creative expression, especially for children, who could construct and inhabit worlds born from their own stories and drawings. With the progression of this technology, the gaming industry may find in Genie a powerful ally for game creation, enabling designers to push the boundaries of world-building.

Caution and Responsibility

Given the exploratory nature of Genie and its potential for broad application, the decision to withhold the trained model checkpoints and the training dataset has been made. This ensures that any future releases or applications of Genie are conducted responsibly, with a focus on safety, ethics, and the nurturing of creativity.

Reproducibility and Community Engagement

To bridge the gap between the capabilities of Genie and the wider research community’s resources, a scaled-down version of the model has been made available. This version is designed to be more accessible, allowing for experimentation and further development on less advanced hardware. It provides an opportunity for a more diverse group of researchers to contribute to the ongoing dialogue surrounding Genie’s architecture and its applications, fostering a collaborative effort towards a shared vision for the future of interactive environments.

‍

Stay ahead. Stay updated.