Multiplayer Modular Protocols for Generative Aesthetics

Title: Multiplayer Modular Protocols for Generative Aesthetics

Team member names: Case Miller, Darren Zhu

Summary:

The first mainstream generative AI art product – Midjourney – was built on top of a seemingly unlikely user interface: Discord. However, the frenetic, collective nature of Discord’s chat being used for generative text-to-image prompts enabled both rapid human-to-human feedback (in the form of shared prompts and images) and reinforcement-learning-with-human-feedback (in the form of user-selected upscaled images as labels).

Yet, Midjourney’s reliance on textual language as the dominant interface for generating images has limited its tunability; DALL-E and other centralized generative image and video tools (e.g. Runway, Pika, etc) make similar trade-offs favoring usability over customizability. As a result, a distinct monistic genAI aesthetic has begun percolating across online content, finding itself everywhere from boomer Facebook pages to peer-reviewed scientific figures.

On the other end of the spectrum has been the emergence of ComfyUI, an open-source node-based editor for building highly bespoke generative image and video workflows. ComfyUI allows facile fine-tuning of models (usually Stable Diffusion-based) with the use of custom-built image LORAs (Low-Rank Adaptation). In addition, decentralized (and often unaffiliated) sets of researchers, developers, and artists continuously contribute new nodes based on the latest papers and demos – these span new animation and motion modules (e.g. AnimateDiff), spatial conditioning controls (e.g. ControlNets), real-time samplers (e.g. LCM), and many other emerging capabilities.

However, these nodes can often face challenges with both interoperability and replicability across different workflows and hardware. Furthermore, sharing and collaborating on workflows typically involve sending JSON files containing each configuration of ComfyUI nodes, and then re-loading them. Most workflows run too slowly on consumer-grade GPUs, and the products for running ComfyUI in the cloud remain relatively underdeveloped.

Thus, we propose a prototype that transforms ComfyUI into a real-time, cloud-based multiplayer tool – one that ideally enables both the usability and shareability of text-based prompting from centralized tools like Midjourney with the customizability and modularity of ComfyUI’s existing node ecosystem. Multiplayer collaborative frameworks like yjs, partykit.io, reflect.net, flume.dev, etc make it much easier to build web apps using conflict-free replicated data types (CRDTs). LLMs can be layered on top of each individual node to relieve the end-user from having to understand and hand-select every laborious low-level parameter, while still providing more granular control than one all-encompassing text prompt.

We hope that such a tool can enable a pluralistic aesthetic ecosystem to flourish: one that subverts the current individualized premium mediocrity of generative AI in favor of high-variance, anti-mimetic, collaborative creativity.

Q&A:

1. What is the existing target protocol you are hoping to improve or enhance?

Most broadly – generative aesthetics mediated by emerging tools; more specifically – ComfyUI (an open-source, generative node-based editor)

2. What is the core idea or insight about potential improvement you want to pursue?

ComfyUI benefits from the customizability and modularity as an open-source node-based editor that is primarily run locally, but lacks the usability and shareability of much simpler text prompt-based tools that are run in the cloud (e.g. Midjourney, Runway)

3. What is your discovery methodology for investigating the current state of the target protocol?

Mix of a) expert interviews (we are both personally close with folks that are building Midjourney, as well as various artists, filmmakers, and studios using these tools) and b) sentiment analysis / text scraping of Discord and Reddit communities that are generating and sharing generative image and video outputs and workflows

4. In what form will you prototype your improvement idea? Eg: Code, reference design implementation, draft proposal shared with experts for feedback, A/B test of ideas with a test audience, prototype hardware, etc.

  • May: Report that analyzes 1) existing node-based editors (e.g. Unreal, Touch Designer, Nuke, etc), 2) multiplayer visual tools (Figma, Miro, Framer, etc), and 3) generative image and video tools (Midjourney, Runway, Leonardo, etc)
  • June: Wireframe several versions of real-time multiplayer ComfyUI tool
  • July - August: Prototype real-time multiplayer ComfyUI tool

5. How will you field-test your improvement idea? Eg: run a restricted pilot at an event, simulation, workshop, etc.

  • May - June: Share report and wireframes with different studios and online generative art communities
  • July - August: Run small pilots and jam sessions with tool

6. Who will be able to judge the quality of your output? Ideally name a few suitable judges.

  • Members of the Banodoco discord (e.g. @POM)
  • ComfyUI / AnimateDiff communities (Yuwei Guo, Ceyuan Yang)
  • Stable Diffusion researchers (Robin Rombach, Patrick Esser)
  • @cerspence (creator of the first high-quality text-to-video model, Zeroscope)
  • Founders of various centralized generative image and video platforms (Midjourney, DALL-E, Pika, Runway, Ideogram, etc)

7. How will you publish and evangelize your improvement idea?

Share our report, open-source code, and tool with both IRL artists/filmmakers/studios and Reddit / Discord communities where we have existing strong ties

8. What is the success vision for your idea?

Short-term – sharing / discussions of our report and adoption of the tool (particularly for more elaborate or emergent workflows that require multiplayer collaboration); long-term – a kind of aesthetic Turing test where generative media has become so heterogeneous that it is no longer immediately identifiable as a kind of AI stock image or video

2 Likes