It receives both the text and the image as a single stream of data containing up to 1280 tokens, and is trained using maximum likelihood to generate all of the tokens, one after another. Like GPT-3, DALL♾ is a transformer language model. We extend these findings to show that manipulating visual concepts through language is now within reach.
Image GPT showed that the same type of neural network can also be used to generate images with high fidelity. GPT-3 showed that language can be used to instruct a large neural network to perform a variety of text generation tasks.