Adventures in Generating

Since late 2022 I have been generating social media profile images using a custom trained model based on Stable Diffusion 1.5. Almost monthly, I would create a new (sometimes themed) image to post to Facebook, Discord, and Google. The following images were generated:

The first few were stylized, and most people that asked about them wanted to know what phone-app I used and what the ‘filter’ was called. The final image though… Before I explain to people that that image is not really me, and I am not really being ‘blinded by the light’, they want to know where I was, or why I am not smiling. Anyone that has played around with Stable Diffusion knows that a ‘perfect’ image is hard to achieve, and requires a lot of ‘prompt engineering,’ the term used to describe building text descriptors of the image you want generated. For every good image there are many bad ones. Stylized images are easier in a lot of ways; the uncanny valley isn’t so important when you want someone to look like they are from a comic book. Still, strange artifacts can present themselves. Take a look at some of these images, and notice the duplicated limbs and features.

Depending on certain poses, prompts, and configurations, I would get ‘close-but-no-cigar’ results…

I took all of the training images (images of my dumb face so Stable Diffusion can learn what I look like) in my kitchen. Notice that in the above simple images that white cabinetry and kitchen-related things sprawl behind me… If I remember correctly, the above images were created using the second or third version of the model. I made four separate models with varying configuration settings. Below is an example of how the relationship between the different models and the CFG scale affects the result. The CFG scale stands for ‘Classifier Free Guidance’. This is a value that allows Stable Diffusion to be more or less creative in generating images. For generating an image of a person from a trained model, higher CFGs may overcome bad quality training images, or results from models that were trained using fewer steps. Steps effectively equate to the amount of effort and time Stability Diffusion spends determining features of a given subject so as to build a model. For example, check out the following images:

Generally speaking, the ‘mg_person_2000_h_v1’ model (my naming conventions are not great…) presents the best results at low CFG scale values. The ‘mg_person_v3_1000_h’ model isn’t awful, but some features are a little exaggerated (in a very unflattering way…). I learned a lot by fiddling with settings, and while I eventually started to produce decent, passable realistic images, I did still make a few stylized ones too!

So, what’s next? Video. I was playing around with Deforum, and was able to generate some decent videos (though I cannot find them). I did find some of the image-to-video examples I made. Effectively, from a recorded driver video, animate a still image:

I may make some videos on my process, but a quick Google search can get you started if you want to do this too. I run all of this on my GPU locally, but you can use Collab notebooks if your GPU doesn’t meet the requirements.

Leave a Comment