AI Digital Painting: When to Move from AIGC to AIAD?
The Evolution of AI Digital Painting Technology
In the last year or so, AI technology has made visible breakthroughs in many fields with big models, and AI-based image processing and text processing technologies have gradually changed our lives for ordinary people. The AI is a typical example of how we can feel the power of AI on the one hand, and have to rethink our relationship with it on the other.
Let’s first look at the history of this wave of AI painting technology development.
In January 2021, OpenAI announced DALL-E, establishing the importance of diffusion models in this wave of technology development.
In October 2021, the open-source text-generating image tool disco-diffusion was created, and a number of products based on it have emerged since then.
In July 2022, DALL-E 2, OpenAI’s online drawing application for artificial intelligence, was publicly tested.
In August 2022, stability.ai opened up stability-diffusion, the most usable open source model available and on which many commercial products are based, such as NovelAI. on October 18, stability.ai announced the completion of a $101 million seed funding round, valuing it at $1 billion.
Since the release of several important open source models, some low-threshold AI painting generation platforms based on sub-development have exploded on social networks many times. Products such as Midjourney in early months and the recent release of Yijia AI have attracted many vegetarians with no technical background to participate in AI creation and discussion, and opened the traffic codes of many commercial channels.
Standing at this point in time to discuss AI painting and design
In addition to curiosity and speculation, there have been many practitioners in related industries discussing the impact of AIGC (Artificial Intelligence Generated Content) on various industries, including the art creation industry and the design industry. Since early AI models are better at generating imaginative and artistic works than real scenes, many discussions have focused on “Can AI creations be called art? “How can AI painting assist in creative generation?”
At that time, the quality of AI-generated realistic works was still quite strained, so there was not much discussion around environmental design (architecture, planning, landscape, urban design, etc.) related fields. However, with the rapid iteration of the model, just under a year later, the latest models are now performing well in realistic style works, and the previously criticized problems such as character distortion have been solved one by one.
The first time I wanted to write such an article was in September, although for various reasons dragged on for two months, but fortunately these two months have not yet occurred major changes to make this article an old news. Here I would like to discuss what AI digital painting technology means to the environmental design industry at this point in time. What are the new workflows/ways of working that can be implemented now? And to talk about when AIGC can move into AIAD (Artificial Intelligence Aided Design) based on the current maturity of the technology in all aspects?
Technical details about training models, etc.
I don’t want to get too much into technical details in this article, because I want to explore the more general case rather than the performance of a specific model with parameters. So I will write here what is technically relevant in the process of my own experimentation.
First of all, the model I use in most cases is the stable-diffusion model deployed on colab, which was used in the early days without GUI version, and recently with GUI version, which is relatively easier to use. If you want to try it, you can refer to the following.
AI Digital Painting stable-diffusion
Best & Easiest Way to Run stable-diffusion for Free (WebUI)
fast-stable-diffusion Github
Next, let’s briefly explain why we use stable-diffusion.
The main reason is that it is completely free, and with colab you don’t even need to consume your own computational power, as opposed to Midjourney, NovelAI, etc., which have limits or require payment.
Secondly, although it is a little more difficult to deploy than other open-and-use platforms, the learning cost is nothing if evaluated as a productivity tool, and it is a one-time deployment for long-term use.
Finally, although the different models have various advantages, and at the time of my testing, the quality of Midjourney was probably preferred, the quality advantage seems to be reflected in the degree of creativity in generating a small number of images that seem to be more in line with our needs without iterative debugging, but the importance of this feature is reduced in production, and at this stage We do not expect the AI to generate the work directly for us (because the environment design is also not the way to submit the finished product with images), I value the creativity and usable imagery presented in rapid iterations and modifications, and in this need to deploy their own models using batch training methods to train dozens or hundreds of images at a time may be the more common way.
How AI helps us design today
How does text2img alone work?
The first step in drawing with AI is of course text2img, which, as the name suggests, generates an image from text, usually by inputting a text description (prompt) of a certain length, and the AI first understands the meaning of the text through a semantic model, and then generates an image based on it.
The feature of such generation is that it maximizes the creativity of the AI model. Often, the inputter does not have a specific idea of what to draw at the beginning, but wants the AI to generate some unexpected results. There are official prompt writing guidelines for many models, but basically they need to include three aspects: subject content, elements/features, and style. Due to the use of textual expressions, many too specific descriptions such as spatial orientation are difficult for the AI to understand precisely, and of course another reason is that in order to ensure the rationality and quality of the generated images, the content of the text is often resampled during generation, which means that not everything written will necessarily have an impact on the AI’s creation. So many prompts are written not to describe everything in detail, but to let the AI know what you want through clearly oriented words. For example, there is no more amazing word in architecture-related prompts on the Internet than “Zaha”, which allows you to generate fantastical buildings that break through your imagination with just four letters.
Due to this feature, text generation is best used to find inspiration, input some vague intentions and elements for mass generation, and then select some fair results as a reference for design or performance. In fact, such an approach has been used in other design fields such as wedding design, clothing design, etc., only relatively speaking, environmental design requires more authenticity.
So what kind of images can AI generate now? In fact, if you go to a site like Lexica, you can see that there are already quite a lot of quality works related to cities, but since most of them are related to CG art and are not very realistic. Some of these images are already quite good in texture, such as a set of images of a future city generated with reference to the examples on Lexica.
The results generated above already show that the AI at this stage has performed quite well in terms of realism, and the overall image is not easily seen to be inconsistent at a cursory glance, although many flaws can be found if you zoom in and look closely. But this “basic non-aliasing” ability is important for generating realistic-looking images, which means that there needs to be basically correct spatial relationships, lighting and shadows, etc. Although these details may still look “incorrect” when viewed closely, most people do not look at all images with a geometric eye. It is this ability to be “largely incompatible” that opens up the possibilities of AI painting.
To test the possibilities and creativity of AI in more complex scene generation, I continued to try batch generation with waterfront city as the theme.
It can be seen that the composition of the images generated by the text batch is very diverse, and although the overall style is still similar, various types of scenes can be generated, indicating that the detailed description of the text is limited in binding the images, while the randomness of the generation can bring some new possibilities in the early creative stage of the solution. Conversely, the generated images are too inconsistent to be considered as a set, i.e., each image is random and unrelated to the other, which makes AI generation impossible to meet the needs of the solution and the subsequent process.
Is img2img a better choice?
If a designer were to imagine an ideal way to interact with AI, it would most likely be to sketch a design and have the AI generate a complete solution, so the traditional design process relies more on image information than on text. The img2img function in the existing model can be explored in this aspect, as opposed to text2img, which requires an additional input of a raw image for the AI to learn and then generate the result, a process in which the text prompt and the input raw image together affect the final result.
After understanding the basic operation of this function, one of the things we would like to try as designers is: can we input a hand-drawn sketch of the solution or a block model from a modeling software to let AI help us deepen the solution?
I tried generating relatively simple body block models and analysis maps as input, and while I did get some unexpected results, they seemed harder to combine with specific needs in a production environment. It can be noticed that AI does not generate realistic effects, but rather a style similar to the analysis rendering, which seems to be due to the fact that AI learns equally for the input images.
We all understand that a picture consists of color elements (color, material, shading, lighting) and morphological elements (morphological structure, spatial relationships), and we actually expected AI to focus on learning the general morphological relationships in the sketch and deepen the specific materials, scenery, and micro styles. However, it seems that the diffusion does not distinguish between color and structure, so when we want it to learn the morphological structure (i.e., the scheme), it tends to overlearn the color and underlearn the morphology, as if it is helping us to make a scheme instead of a rendering.
Considering the characteristics of the input image, this seems to tell us that we have higher requirements for the input image: as mentioned above, it is better to guide the AI in both color and structure. At first, I thought of hand-drawn sketches to determine the color and structure, but now I did not try more due to my personal drawing level, and another problem is that the painted sketches are easily learned by AI in a style closer to cartoon or painting. So then I came up with the idea of migrating the color of another image on top of a white mold base to achieve a synthetic effect, and then training the image several times in an iterative form to control the direction of generation.
The final image does look like an effect, but another flaw is still evident, that is, the impact control of the input image is global, but often we want it to deepen only a part of it. For example, in the above image I want it to have a variety of variations for the key buildings of the pier, but the surrounding environment should preferably not have too much variation in form and layout. This requires a model that allows us to set the intensity of learning for each area of the image separately.
Fortunately, the model now provides a similar feature, allowing the input image to be masked for drawing and then control the intensity of the masked parts separately. midjourney already supports controlling the intensity of learning directly by varying the transparency when drawing, but I have not used this control in depth for now as it may require iterations. However, this feature can be used to achieve some common needs in the design process.
In this example, the content generated within the mask fits the surrounding environment well, although the effects such as glass reflections are still quite inadequate, but it also illustrates the possibility of the process mentioned above. Back when stable-diffusion was first open-sourced, there were PS plugins that could generate partial results based on the mask, base image, and text description in PS, and then combine multiple parts together to make a complete work. I think it is worthwhile to promote such a process if we want AI to be useful in a formal production environment.
So is there any process that is relatively more mature and available today?
In my recent experience, using AI to generate intentional drawings from reference drawings is a relatively usable scenario. Due to some drawbacks mentioned earlier: too high randomness, insufficient learning of subtle structures, and requirements for input images in terms of both color and structure, we can directly input a small scene (not a bird’s eye view) effect or live view as a reference image for AI to help us generate an imagery image with similar composition and color scheme.
It can be seen that when the input image itself is of high quality, the output image is also quite good and has enough variations based on the overall composition and scene being close to the original image. This approach can solve two problems: first, if you do not find a very close image, you can use a similar composition to try to generate the desired image; second, if you find the image you want to use has copyright problems, you can use AI to generate an image that is close to the original image in order to avoid getting into trouble with copyright infringement.
AI’s ability of “basic non-contradiction” can help you turn an outrageously toned image into a realistic one that you can’t see at first glance. Here, I use a waterfront shopping street as an example, and after toning the original image, adding elements by smearing, and changing it to a night scene by inversion, I input the image for different scenes.
A short summary
Many of the preceding are my own attempts, while others that are not implemented are reasonable speculations. To summarize the features of current AI painting techniques and how we can exploit them.
The more prominent advantage of this generation of AI-generated images over the previous generation (GAN-based models) is that they can already generate images that are “basically non-contradictory”, which is a very important capability. Generating an image with a diffusion model now takes a few tens of seconds, while it would take tens or even hundreds of times longer with a rendering process. Based on this capability and the increasing speed of the model, the main scenario of AI for environmental design at this stage is to generate a large number of images for inspiration or imagery in a short period of time, basically hundreds of images in half an hour at a speed of a few tens of seconds, and with its relatively superior creativity (randomness), it is possible to bring more ideas to designers.
More detailed and demanding generation is not to say that it can’t be done now, as mentioned earlier, it requires high requirements for input control conditions, and it may take some time for designers to learn “how to talk to AI”. However, in my opinion, the finer control is also the more important direction for the future development of AIGC technology, especially industrialization, and perhaps soon there will be a design aid specifically for environmental designers based on these models that are performing well.
Finally, although the first two points are relatively optimistic, but it is undeniable that environmental design is to do design in three-dimensional space, while these AIGC models are now learning two-dimensional images, although the results can be useful in some parts of the work, but it does not change the fact that it can not understand the space. These tools can change the way we work to some extent, but they are not the source of change.
How will we move towards AIAD?
Since I have not studied most of the technologies in depth, what I say here are just a few thoughts based on my intuitive understanding of them.
First, it is easier to think that the diffusion model works so well, then it can be migrated from 2D to 3D, and the model can be trained with 3D data. This would solve one of the fundamental problems mentioned earlier: if what is generated is not a picture but a 3D model, then a model can be used to generate multiple effects, ensuring that they all come from one solution. In fact, there have been some preliminary attempts in this direction. I believe that according to the speed of the development of the previous diffusion model, it is not a distant thing to have some available models.
Second, in fact, from the environmental design profession, we do not need a particularly complex 3D model generation algorithm, Generative Design has been studied for many years, and there are already many mature solutions, whether rule-based or data and GAN model have a lot of commercial products. AI drawing technology can help us see more possibilities in the performance, although the current solution generation products may also come with some fast performance features, but most of them are based on the traditional “modeling-rendering” process, the efficiency and results are not satisfactory. For example, the design of the business atmosphere, crowd activities, etc. itself does not belong to the program, but for the performance of the program is more important part of the traditional process are still through the modeling or post way, the future AI drawing may be able to play a better role in such scenarios.
Finally, back to the more recent considerations, the existing technology (models) mentioned earlier for some of the needs of the production process has in fact basically met, but on the one hand, the lack of guidance and education for designers, traditional designers may not know how to understand and use AI-type tools. On the other hand, tools specifically for this field are not yet available, for example, many domestic tools based on stable-diffusion have improved ease of use through preset styles, etc. For us, we can also adjust the preset style, input more relevant image adjustment models, design tools more suitable for designers’ daily interaction, etc. to make the immediate step go faster. But the fact is that there is a limited market for designers as users in a single industry, commercially driven and perhaps not so much motivated to develop, designers should perhaps be more proactive in communicating and exploring to be ready for the changes that will come in the future.
Although the previous conclusion is that these image generation models do not bring transformative changes to the environmental design related industry, the development of technology is not only at this end of the spectrum, in fact, there are quite a lot of technologies that are gradually developing and maturing in lesser known aspects, and we may not be that far away from the changes of AIAD.