The Impossible Science of Stable Diffusion

Gideon Greenspan
11 min readNov 17, 2022

Last time I felt like this was 2013. After hearing about bitcoin a few times, I devoted a day to reading the paper and learning how it works. By late afternoon, the penny had dropped, as had my jaw. Bitcoin had achieved something which I’d assumed was impossible — digital money without an intermediary. Soon after, I co-founded a company in the blockchain space, which I continue to run today. Despite the excessive hype and crypto chaos, bitcoin’s implications are continuing to play out, for good or otherwise.

Nine years later, Stable Diffusion made me feel the same. Sure, we’ve all been hearing about artificial intelligence (AI) for years and feeling its consequences — the Siri digital assistant (which still asks me stupid questions), concerns over TikTok’s algorithm (although content recommendation is nothing new), and the promise (any time now!) of autonomous driving vehicles. But to me, Stable Diffusion is different. It does something that seems impossible. And it does it amazingly well, for free, on today’s computers.

For those who don’t know, Stable Diffusion is one of several state-of-the-art AI image generators. alongside DALL-E, Imagen and Midjourney. These tools can be used in a number of different ways, such as filling in gaps in a photo. But the most compelling application is creating brand new art from a plain English sentence called a “prompt”. For example, all of the images in this article were generated in a few seconds from the sentence: “beautiful digital art created by artificial human intelligence”. Don’t like the results? No problem, keep trying, again and again, until you do.

Unlike its competitors, Stable Diffusion is open source and doesn’t have crazy computing requirements. It can run on a regular home computer with a $500 graphics card, or a server that can be rented for a few dollars per day. This accessibility is what makes Stable Diffusion so remarkable. You can already download unofficial apps for Windows, Mac and iPhone or try it for free online (there’s sometimes a digital queue). So I’m guessing we’re just a few years away from AI art becoming a standard feature on our computers and phones, like web browsing and video playback are today.

Like bitcoin before it, Stable Diffusion totally triggered my inner geek. So over the past couple of months, I’ve been going down the rabbit hole, studying the paper and many of its most important references. I originally thought of trying to write a detailed explanation of the technology, in terms understandable to a layperson. But this turns out to be a daunting task, since there is such a huge amount of material to summarize. (Having said that, I’m a sucker for punishment and might try one day!)

In the meantime, here’s a brief and (almost) readable summary of the core elements of the Stable Diffusion system:

  • A “language model” which converts a sentence into a grid of numbers, capturing the meaning of each word in the context of the sentence as a whole. A good language model knows that the word “running” means something different in “my nose is running” and “my child is running”, even if the phrases are grammatically identical. Stable Diffusion uses language models created by others — Google’s BERT in the paper and OpenAI’s CLIP in the release. No need to reinvent this part of the wheel.
  • An “autoencoder” which can convert an image into a more compact representation, and a corresponding decoder which goes the other way. This is a type of compression, but it’s nothing like JPEG, which focuses on local similarities in color and shade. Instead, the autoencoder captures key properties of large areas of the image, based on how humans perceive them. For example, it identifies blue plates or fluffy dog fur.
  • A “diffusion model” which turns a messy image into a slightly more meaningful one. By applying this model repeatedly, a completely random grid of pixels can be gradually transformed into something real. Stable Diffusion’s key innovation is applying diffusion to an image’s compact representation, rather than its full set of pixels.
  • The diffusion model runs in a structure called a “U-Net”, which incorporates the language model’s output to produce an image matching the prompt provided. A U-Net is good at focusing on both local and global aspects of the image. For example, it can ensure that a dog only has one tail, and that this tail is shaped correctly.

So there you go: language model, autoencoder, diffusion model and U-Net. That’s a lot of technical names for some very clever stuff. But here comes the wild bit. None of these elements was programmed by human beings, like most of the software we use today. Instead, they were trained, using vast amounts of data collected from the Internet. So we don’t have a clue how they actually work.

To be less flippant, we do understand the overall system’s input and output, otherwise we couldn’t send in some text or pull out an image. But the real magic happens in the many “hidden layers” between. Each layer takes the results from one or more previous layers, performs some calculations on them, and passes the answer on. And what are these calculations? Trillions of sums, mainly multiplications and additions, performed on multi-dimensional grids containing billions of numbers. In other words, far too much math for humans to follow or comprehend.

These systems are trained using a vast corpus of information, nudging them repeatedly until they give good results. Take for example the diffusion model, whose job is to de-randomize an image. The training starts by taking an image (easy), degrading that image with some random noise (easy), challenging the model to undo the damage (hard), then comparing the results with the original (easy). Based on the results, the model’s calculations are tweaked, so it would do slightly better on the same image next time (easy, ish). Then this whole process is repeated, millions of times, using a different image each time. All the easy parts are done by regular code written by humans, but the model has to solve the hard bit all on its own. And here’s where the magic happens. If we design our model well. and train it for long enough, it learns to do what we want, to a remarkable degree of accuracy.

Now, training needs a lot of data. For example, Google’s language model was trained on 3 billion words, taken from BookCorpus (self-published books) and the English Wikipedia. The main part of Stable Diffusion was trained on 2 billion images and their associated English captions, trawled from the Internet by an organization called LAION. The whole process took almost a month, at an estimated cost of $600,000 in equipment hire and electricity. That may sound like a lot, but don’t forget that training is a one-time thing. For less than a million dollars, Stable Diffusion has put the power of AI art into the hands of billions of people.

It’s fun to compare this to the process of educating a human artist. In the US, it costs around $300,000 to raise a child, plus another $300,000 for a four-year degree at a top art school. That makes a total of $600,000, exactly the same as training Stable Diffusion. And how many images will our budding 22 year old artist see before graduating? Well, humans perceive around 30 images per second and are awake for an average of 16 hours per day. Do the math and you get to… 14 billion images, about 7 times more than Stable Diffusion. Of course, children spend most of their time looking at bedroom walls, books and classrooms, rather than online images (or so their parents hope). So the comparison is imperfect, but still provides food for thought.

Stable Diffusion is called a “deep learning” system, because of the many hidden calculation layers between its input and output. The design of these systems is inspired by the brain of humans and other mammals, in which billions of neurons are organized into somewhat identifiable layers, and are connected in trillions of ways. The calculations in deep learning loosely follow the way in which neurons stimulate and inhibit each other’s activity. And the training process is modeled on our caricatured understanding of biological learning, in which “neurons that fire together wire together”. In both deep learning and organic brains, it appears that knowledge and skills emerge from complex, impenetrable chaos.

There’s another way in which Stable Diffusion’s architecture can be compared with biology — its modular design. Over the past century, we’ve identified many brain regions which focus on particular tasks, such as vision, speech and muscle memory. Similarly, Stable Diffusion connects three discrete subsystems together, each of which has a specialized function. But why do we need this modularity? Why not just train one big system to convert prompts directly into a images, without the intermediate stages and representations?

The answer is that, theoretically, we could. But it would require even more calculations and a far larger training set of images. By designing a system which captures the internal structure of a problem, we can solve that problem with less time and money. But this doesn’t mean that Stable Diffusion’s design was obvious or easy. Finding an effective AI architecture for a particular application takes years of intuition, experience, trial and error. AI systems have evolved over time, with 100,000s of engineers and researchers testing different ideas to see what works. I take my hat off to the team behind Stable Diffusion, and I’m sure they’d pass on the compliment to their predecessors, peers and rivals.

When confronted with the power of a system like Stable Diffusion, it’s easy to get carried away with futuristic thoughts — about software corrupting our minds, the end of human employment or a singularity of runaway technological development. So it’s important to remember that these systems are still only performing computation, breaking no new ground in terms of the fundamentals of Computer Science. Sure, this is computation using a new paradigm, where the size of the program (or its data) is vast, and the input (billions of labelled images) is larger still. In the same way, a modern supercomputer is far more powerful than an ancient PC. But if we’re patient enough, there’s nothing one can do which the other cannot.

While keeping that sober approach in mind, it’s interesting to consider how Stable Diffusion and related technologies might affect computing and society over the next few decades. Let’s start with the obvious and gradually dial up the speculation:

  • Like every previous computer technology, Stable Diffusion will get better, faster and cheaper. New versions will appear, generating even more compelling images at higher resolutions. Graphics cards will drop in price and cloud-based services will proliferate, until we can generate images in real-time as we type and edit our prompts. The only way this doesn’t happen is if Stable Diffusion is superseded by something better.
  • Over the past few decades, computer users have made the transition from punch cards to command lines, mice to touch screens, search engines to voice control. Each advance brought the computer closer to a natural human way of interacting, but we’ve still had to learn something on our side — which commands to enter, how to point and click, which phrases Siri understands. This time will be no different — users will need to perfect the art of prompt writing, to guide AI to generate the content they desire. For Stable Diffusion, guides, examples and books about prompting have already appeared.
  • The technology behind Stable Diffusion will rapidly be extended to video, for which there are already prototypes from Google and Meta. Of course, going from static images to moving video turns a 2-dimensional problem into a 3-dimensional one, with the additional axis of time. But the transition is easier than it seems, since videos contain far more redundancy than images. With the exception of the occasional “hard cut”, they don’t change much from one frame to the next.
  • The legal status of generative AI will be highly contested. Human artists talk openly about their influences, without being accused of copying. So if an AI learns a model from billions of images, then uses that model to create something new, surely it’s no less original? Well, maybe. But what if Stable Diffusion creates a picture which looks just like an existing work? Since we don’t understand its internal calculations, how do we decide if this is coincidence or plagiarism? Even worse, Stable Diffusion can be explicitly prompted to use a particular artist’s style, leaving the imitated artist feeling far from flattered. In the programming space, an AI assistant called Copilot is already being sued for reproducing copyrighted code without attribution. Expect more to come.
  • Stable Diffusion is a stunning tool and will undoubtedly find its uses. But I don’t see it as economically disruptive, since the graphic design market is estimated at under 0.05% of global GDP. But what about video, music or games? Will AI transform the $3 trillion entertainment industry? And then there’s the workplace. Will Office 2032 draft spreadsheets and presentations for us, based on a few descriptive sentences?
  • Instead of just lowering costs, AI may drastically increase the amount of content being produced and consumed, by delivering music and videos personalized to an individual’s tastes. So forget about the dangers of TikTok addiction. Imagine a service which generates entertaining videos for you, in real time, learning your preferences from your swipes. Or a service which delivers the news with the exact angle and bias that suits your prejudices. If you think filter bubbles cause political polarization, what happens when every single person has their own “alternative facts”?
  • Generative AI relies on two preceding technologies, neither of which was built with it in mind. First is the “graphical processing unit” or GPU, whose development was driven by gaming and cryptocurrencies. As luck would have it, deep learning just happens to require the same sorts of calculations, in the same stupendous quantities. Second is the World Wide Web. If billions of people weren’t online, uploading and labeling their creations, then we couldn’t collect enough data to train these systems. All this makes me wonder: If generative AI goes mainstream, empowering an explosion of hybrid human-machine creativity, what unimaginable new technology will be built on the results?

Undoubtedly, lots of questions remain. But it’s safe to say that we’re at a crucial juncture in the history of computing, where the boundaries of the possible are shifting before our eyes. Bitcoin aside, the past few decades have seen many similar moments, including the first GUI, computer video, web browser and smartphone. Each of these ushered in a tsunami of innovation whose consequences were impossible to predict. My bet is that Stable Diffusion and generative AI will do the same again.

--

--

Gideon Greenspan

Developer, entrepreneur and lecturer. PhD Computer Science, MA Philosophy. Founder of MultiChain, Web Sudoku, Copyscape, Family Echo. From London to Tel Aviv.