
Spatial reasoning
Understand the scene before touching a single pixel.
Instruction decomposition, relational grounding, and region-level understanding let edits follow the actual structure of the image.

JoyAI-Image-Edit / JD Open Source
JoyAI-Image-Edit unifies image understanding and generation so object move, camera control, long-text rendering, and scene consistency can be handled as grounded transformations.
Current status
The product surface is not open yet. Join the waiting list to get first access when the editing experience is ready.
8B multimodal reasoning model
16B diffusion transformer for pixel-level generation
Unified interface for understanding, generation, and editing
Project
Checkpoint
License
Core architecture
JoyAI-Image couples an 8B multimodal language model with a 16B multimodal diffusion transformer. The shared interface keeps spatial intent attached as the pipeline moves from instruction parsing to image manipulation.

Capabilities

Spatial reasoning
Instruction decomposition, relational grounding, and region-level understanding let edits follow the actual structure of the image.

Spatial editing
Object move, object rotation, and camera control are treated as grounded spatial operations, not vague style prompts.

Long-text rendering
JoyAI-Image-Edit is designed for multilingual text, structured poster layouts, and scenes where typography is part of the composition.
System strengths
The interface stays quiet on purpose. The proof comes from the research visuals, the architecture, and the operational claims.
Awakened spatial intelligence
Scene parsing and spatial relations stay attached to the edit pipeline, so language can refer to structure precisely.
Unified understanding + generation
An 8B MLLM and 16B MMDiT operate through one shared interface instead of behaving like disconnected tools.
Controllable visual transformations
Camera movement, object manipulation, and multi-view consistency are built into the model family from the start.
Research-driven training data
OpenSpatial, SpatialEdit, and long-text rendering data push the model toward grounded image editing rather than generic generation.
Training recipe
OpenSpatial supports scene reasoning, SpatialEdit sharpens instruction-guided manipulation, and long-text rendering data increases layout fidelity in image generation and editing tasks.

| Model | Primary task | Status |
|---|---|---|
| JoyAI-Image-Und | Multimodal understanding | Released |
| JoyAI-Image-Edit | Instruction-guided editing | Released |
| JoyAI-Image-Edit-Plus | Multi-image editing | Upcoming |
| JoyAI-Image | Text-to-image generation | Upcoming |
Waiting list
The current site should act as a clear product signal, not a half-ready editor. Leave your email and we will contact you when the experience is available.
Request access