HDiT creates HD images in a compute-efficient manner.
Image generative tasks like text-to-image are either slow or resource-intensive.
HDiT attempts to solve this issue.
The folks at Stability.ai, along with some other researchers, developed a variation on the vision transformer architecture, which they call the Hourglass Diffusion Transformer (⏳ HDiT). Not sure about you but it does kinda look like an hourglass lying on its side.
With this new architecture, we get linear scaling, instead of quadratic scaling.
What does that mean for us mere mortals?
This means we get to generate high resolution images, with less time and computing resources.
Here are some images generated using the new HDiT models:
github.com
I’ve heard some people say that transformers are a dead-end when it comes to generative AI given their computational complexity, so it’s interesting to see researchers trying out hybrid approaches to the transformer architecture or doing away with them altogether, like what RWKV and Mamba did.
And then eventually, some day, maybe — we’d be able to run tasks like text-to-image, superresolution (low res to high res), or even text-to-video quickly on our laptops or even on our phones!
Definitely looking forward to that!
Data/AI Engineer