Generative AI and combinatorial optimization, part 2: Why text-based image generators can’t render text properly

Language and chess are two of the best-known domains for combinatorial reasoning, and they share the fundamental property of all combinatorial domains: their discrete building blocks differ in quality, which means there is no in-between.

Oliver Beige
5 min readApr 24, 2024

Language, like chess, is one of the major domains of combinatorial optimization. Letters are structured combinations of lines and curves, words are structured combinations of letters, sentences are structured combinations of words. Texts are structured combinations of sentences.

Combinatorial optimization is a type of discrete optimization, which itself comes in two different shapes. The other, integer optimization, was invented when the US Navy notoriously complained that its consultants shouldn’t propose to build 1.3 aircraft carriers.

Integer optimization restricts the solution space to whole numbers, but at least the in-between is meaningful. We probably wouldn’t build one third of an aircraft carrier or hire three fifth of a worker, but we could deploy one three fifth of the time.

Integer optimization can be approximated by using standard constrained optimization and then rounding up or down. Every once in a while this produces wildly incorrect answers, but at least it’s a coherent way of proceeding, since the integers are on the same number line as the reals. So interpolating, which is what machine learning algorithms do a lot (albeit in high-dimensional vector spaces), is a meaningful activity that produces meaningful answers, even if they are excluded from the solution set.

This process of interpolation does not work in combinatorial optimization simply because combinatorial optimization deals with qualities rather than quantities. The items to be combined differ in kind, not in degree.

This means the in-between does not exist, is ill-defined, or simply produces completely misleading answers. What is the midpoint between an A and an E? If you go by the alphabet, the midpoint between the first and the fifth letter is the third, C, which probably wouldn’t work as a useful substitute in most scenarios. Bookprinters were more ingenious and invented ligatures such as Æ for a small set of letter combinations. But what is the ligature of R and B?

A musician trying to choose between Berlin, Munich and Cologne for a tour stop might triangulate the answer on a map and end up in the quaint Thuringian village of Lengfeld. For all its bucolic charm, in this case interpolation might defeat the purpose of attracting a large audience in a big city.

This quandary of applying a quantitative transformation to a quality is coming to the fore in the example of turning text into images. While large language models are doing quite well producing syntactically and semantically correct text (but tend to struggle on veracity), LLM-based image generators like DALL-E 3.5 (used for the examples here) struggle on a much more obvious level, similar to the chessboard examples in the first installment.

Unlike the chess example, where the image generator had no problem imagining a fox play chess against a badger despite the incongruity of such a scenario, and where the poorly rendered chessboard was just incidental to the main image, the prompted word itself forms the figure in this example (all prompts simply asked for “a sign that says _____”).

Somewhere in the process, likely on the side where the LLM interprets the textual prompt, the GenAI added some context which often produced remarkable visualizations of the sign itself, richly detailed, with few exceptions perfectly plausible, and often even fitting to the prompted word, no matter how abstract the meaning.

And then in almost all cases it failed to render the prompted word itself.

With few exceptions, most words (from an old GRE vocabulary) had simple spelling mistakes, with letter duplications, omissions, and swaps the most frequent alterations. Less frequently, the rendering also included homegrown ligatures.

Arguably, if confronted with the four images produced by DALL-E 3.5 via Bing Copilot, in most cases the viewer should be able to guess what the originally prompted word was, so we are more in “uncanny valley” territory than in any situation where the output is arbitrary. But even if AI-based language and image generators have made great strides at an astounding speed over the last year, combinatorial problems continue to flummox even the best models.

As someone who got his start on machine learning in the early 1990s trying to implement combinatorial optimization on then-primitive neural networks, I have to admit I am not surprised.

--

--

Oliver Beige
Oliver Beige

Written by Oliver Beige

I write about how technology shapes the world we live in.

No responses yet