The conversation surrounding artificial intelligence (AI) and its role in art often vacillates between two polar extremes: one perceives AI as generating a deluge of low-quality “slop,” while the other fears it could lead to the wholesale automation of all creative work. However, a more constructive approach might involve viewing AI as a potential collaborator, enhancing human creativity rather than replacing it.
Yet, visual artists engaging with text-to-image tools have encountered significant challenges when trying to effectively direct AI’s creative output. For instance, asking an AI to create an image of a house is relatively straightforward. In contrast, directing it to produce a specific vision—a red house with four front-facing windows, a chimney, and ivy climbing up the left side—can prove to be an uphill battle. The lack of precision and predictability in current AI systems often frustrates artists who seek to translate detailed ideas into visual representations.
Experts in fields like computer science, cognitive psychology, and education at Stanford University believe they have a pathway to augment AI’s ability to truly augment human creativity. Funded by a Hoffman-Yee Research Grant from the Stanford Institute for Human-Centered AI (HAI), these scholars aim to foster a shared conceptual grounding between humans and generative AI. Their goal is to streamline the collaborative process for producing high-quality visual content ranging from illustrations to animations.
“While the models seem amazing, they are terrible collaborators,” states Maneesh Agrawala, a Stanford professor of computer science and co-principal investigator for the project. He emphasizes that creators often find themselves in the dark regarding what the AI will produce in response to specific text prompts. For example, asking for a “suburban single-family home” could yield an unexpected and undesired outcome like a modern duplex.
Deciphering the Human Creative Process
The Stanford team is tackling this complex issue from two complementary avenues. First, they are conducting experiments to comprehend how humans effectively collaborate while creating visual content. They’ve analyzed numerous chat logs and sketches from creative tasks to glean insights into how individuals communicate when working together.
“If we want to build AI systems that understand how humans think during creative projects, we should start by learning as much as we can from the way that people establish common conceptual ground with each other,” says Judith Fan, an assistant professor of psychology at Stanford. She notes that while not everyone communicates in the same manner, there remains a fundamental expectation of understanding in collaboration.
Building AI Tools that Understand Creators
The second approach involves constructing open-source AI tools informed by research on human creative communication. For instance, the tool ControlNet educates text-to-image diffusion models about layout and spatial composition. This resource employs two features—blocking and detailing—that echo the artistic process of starting with a rough sketch before elaborately detailing the work. Current AI models often struggle with grasping the nuances of arrangement and pose; ControlNet aims to bridge this gap by allowing creators to guide AI in aligning with their artistic vision.
Another innovation, FramePack, facilitates the generation of 3D videos from text prompts, enabling rich multi-scene storytelling. This tool capitalizes on AI’s ability to prioritize scene importance in alignment with narrative flow, mirroring how a human would naturally approach a storytelling project.
A third exploration revolves around the integration of neuro-symbolic AI, which melds neural networks with reasoning capabilities to enhance transparency and mitigate the “black box” limitations of traditional AI models. The researchers are developing a visual scene coding language that converts natural language instructions into executable code, allowing creators to maintain oversight and make iterative adjustments as the AI generates 3D scenes.
Reimagining Education Content
The promise of establishing a shared conceptual framework between humans and AI could yield transformative applications in various fields such as design, simulation, animation, robotics, and education. Agrawala notes that the Stanford team is collaborating with the gaming platform Roblox to empower players to create unique 3D objects from text prompts while adhering to built-in game constraints. This approach ensures that creators can engage in imaginative endeavors without inadvertently producing content that disrupts the gaming experience, such as weapons in nonviolent scenarios.
Beyond gaming, the researchers envision a future where individuals across skill levels—from casual hobbyists to professional visual artists—can seamlessly articulate their ideas using a blend of natural language, example content, code snippets, and other expressive modalities. It’s an exciting leap towards democratizing creativity through collaboration with AI.
“We’re serious about equipping the broader creative community with the tools they need to communicate with AI effectively,” adds Fan.
Interested in diving deeper? Watch this research team discuss their findings from the recent Hoffman Yee Symposium at Stanford HAI.


