Real or Fake? Testing the latest AI Image Generators

AI based text-to-image generation has been the most sensational and controversial AI technology breakthrough we’ve seen so far. In seconds you can bring to life a high fidelity photo of a cat wearing a cape riding a unicorn through the asteroid belt, or any number of other images. All you need is a well trained AI model, a few hundred GPUs and a creative idea.

Most companies have a constant need for new graphics – for ads, presentations, internal and external communications.  When you’re authoring documents and presentations, sometimes you need an image to illustrate your message. Locating the right photo or image by searching the web or looking through photographs takes a lot of time and then you settle for what you can get. This is where text-to-image generators can add value.

Text-to-image generators are not, and should not be a replacement for creative staff and design agencies. Instead, quickly generating images can improve business agility. For example: there have been many instances where I needed an illustration quickly but the design team was overloaded or I was having trouble communicating my creative vision.

OK. So what are my options?

There are hundreds of image generation services, but most of these are frontend wrappers for one of the “foundation models”. At the time of writing, there are six leading foundation models: OpenAI Dall-E, Flux, Meta Emu, Stable Diffusion, Midjourney, and Google Gemini Imagen. These are the models I tested, with a simple text-to-image request: “A couple buying a car”. I evaluated each generator with “Does the image look real enough to be used in a presentation?”

Here are the results, and my commentary. Each image links to a high-fidelity version if you want to take a closer look.

OpenAI DALL-E 3

My impression of DALL-E 3 was: the composition has a great layout, with smiling actors finalizing their car purchase. But then I noticed: the eyes and teeth are unnatural and appear painted on. The fingers on the female actor’s left hand are distorted, and the agent appears to have no mouth. Microsoft Bing Image Creator also uses DALL-E behind the scenes, you can see how it produces similar images, with the same eyes, teeth and finger problems.

Verdict: FAKE

Flux.1 Dev

A similar composition to DALL-E and Bing. Again, at first glance, the image looks OK, but on closer inspection the female actor’s left hand has six fingers. In my testing, Flux has other reality problems – in another attempt, the car’s front seats face each other instead of facing forward.

Verdict: FAKE

Meta Emu

The composition is good. You see reflections of the actors in the car’s surface, and both actors have the correct number of fingers. But the male actor’s beard, and both actors’ faces look “digital” and fake. Plus both actors have wide grins, but glassy eyes like they are dead inside.

Verdict: FAKE AND UNSETTLING

Adobe Firefly

The composition is good but the detail is the worst of all the AI generators I tested. Both of the female actor’s hands look badly mangled. Don’t know what happened to the male actor’s hand, but there is nothing at the end of his sleeve. This image was generated with Firefly 3. An earlier version Firefly 2 sometimes generated people with extra arms. Firefly 4 will be publicly available soon, hopefully these problems will be fixed.

Verdict: VERY FAKE

Stable Diffusion

Most of the details are correct, and there looks like some fun cars in the showroom. The composition isn’t great, but what bugs me the most is the couple seems more interested in the ceiling than buying a car. Plus the fingers of the male actor’s right hand are unnaturally long.

Verdict: FAKE

Midjourney

The composition is good. Like every other generator, the male actor has a beard. The couple look natural, each has the right number of fingers, and I get the sense they have found the car they are looking for. This could be a real photograph.

Verdict: COULD BE REAL

Google Gemini (Imagen)

The composition is good. The couple look natural and relaxed. The female actor appears to be directing the male actor where to sign. Bonus: Everyone has the right number of fingers. This could be a real photograph

Gemini Imagen has made rapid advances. Imagen 1 had the familiar mangled hands problems we see with other generators. Imagen 2 improved but still looked artificial. Imagen 3 fixes these problems, and has more realistic interactions, lighting, and reflections.

Verdict: COULD BE REAL

Are we there yet?

Yes and No. All of these text-to-image generators produce results that weren’t possible a year ago, and if you try a few times you’ll eventually generate something that is good enough. For me, if I needed an image of a couple buying a car, both Midjourney 6.1 and Google Imagen 3-002 immediately generated results I could use, the others have some catching up to do – the hype doesn’t yet meet reality. I’ll post updates as new models become available.

Getting to a 1+1=3 result

In my testing, I used a very simple “prompt” to test each generator. A prompt is your creative direction to the generator, and today it's the most common way for people and AI to work together as collaborative partners, instead of serving under a robotic overlord.

Understanding “prompt engineering” is how you can improve the results. The two images below show a simple example. Using Google Gemini, I modified the prompt to “A couple buying a car, laughing as their dog explores the car." 20 seconds later, Gemini delivered the two images below, and I like the resulting images a lot more. I am a visual person, I like seeing tension in a photograph, and when an image helps tell a story. Short of a professional designer, a strong understanding of prompt engineering can help you produce images that credibly add value to a story or presentation.

Searching the internet or organizing a photoshoot with a dog that likely won’t cooperate would have taken a lot longer—an obvious nonstarter. But there are new creative choices for business users to navigate. Users must develop the ability to discern when original design, original photography, image generators or stock photography are the most appropriate option.

In business, there is a place for text-to-image generators, but good results aren’t guaranteed by the generator alone. This is a new concept for most business people. We have grown accustomed to assessing software for what it does for us, rather than what we can do in concert with it.

For all the talk about which model generates the "best" images, the average user will not fully appreciate that this is an act of co-creation. Regardless of the model you choose, the onus is still on the human user to bring out the best in the model. That's both a problem and an opportunity.

I will be covering how we can improve results with better Integration and prompting soon.