The Complete Stable Diffusion Guide: Installation, Models, ControlNet & Best Practices

Everything you need to master the most powerful open-source AI image generator. From first install to advanced ControlNet workflows, LoRAs, and professional-quality outputs.

Updated Mar 2026 25 min read Beginner to Advanced

What is Stable Diffusion?

Stable Diffusion is a free, open-source AI image generation model created by Stability AI. It converts text prompts into highly detailed images using a latent diffusion process. Unlike cloud-based services such as Midjourney or DALL-E, Stable Diffusion runs locally on your own hardware, giving you unlimited generations, complete privacy, full model customization through LoRAs and embeddings, and zero ongoing subscription costs. It is the most flexible and extensible AI art tool available today.

What is Stable Diffusion?

Stable Diffusion is a deep learning model built on a technique called latent diffusion. Rather than working directly with full-resolution pixel data, it compresses images into a smaller latent space, applies the diffusion process there, and then decodes the result back to pixels. This architectural decision is what makes it possible to run on consumer-grade GPUs rather than requiring massive data center hardware.

The model was originally developed by CompVis at Ludwig Maximilian University of Munich, in collaboration with Stability AI and Runway. Since its public release in August 2022, Stable Diffusion has grown into the largest open-source AI art ecosystem in the world, with thousands of community-created models, extensions, and tools built around it.

What sets Stable Diffusion apart from competitors is its open-source nature. You can inspect the code, fine-tune the model on your own data, create custom LoRAs for specific styles or characters, and integrate it into production pipelines. The community has built an extraordinary ecosystem of fine-tuned checkpoints, each optimized for different aesthetic styles ranging from photorealism to anime to concept art.

How Latent Diffusion Works

The generation process starts with pure random noise in the latent space. A neural network called the U-Net is trained to predict and remove noise from this latent representation, guided by your text prompt which has been encoded by a CLIP text encoder. Through a series of iterative denoising steps (typically 20-50), the model gradually transforms noise into a coherent image that matches your description. A VAE (Variational Autoencoder) then decodes the final latent representation into a full-resolution image.

Understanding this pipeline matters because each component can be swapped, tuned, or extended. Different VAEs produce different color profiles, different text encoders handle prompts differently, and the sampling method you choose affects both quality and generation speed.

Installation: AUTOMATIC1111 & ComfyUI

Two interfaces dominate the Stable Diffusion ecosystem. Each serves a different kind of user, and many experienced artists use both depending on the task at hand.

AUTOMATIC1111 Web UI

AUTOMATIC1111 (commonly called A1111) is the most popular Stable Diffusion interface. It provides a traditional web-based UI with familiar form controls: text fields for prompts, sliders for settings, and dropdown menus for model selection. If you have used any web application before, you will feel at home with A1111.

A1111 Installation Steps

  1. Install Python 3.10.x from python.org. Avoid Python 3.12+ as some dependencies may have compatibility issues.
  2. Install Git from git-scm.com if not already present on your system.
  3. Clone the repository: Open a terminal and run git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git
  4. Download a model checkpoint (such as SD 1.5 or SDXL) and place the .safetensors file in the models/Stable-diffusion/ folder.
  5. Launch the UI by running webui-user.bat (Windows) or webui.sh (macOS/Linux). The first launch will download remaining dependencies automatically.
  6. Open your browser to http://127.0.0.1:7860 and begin generating.

A1111 shines for its massive extension ecosystem. The built-in Extensions tab lets you install ControlNet, ADetailer (automatic face fixing), regional prompting, and hundreds of other tools with a single click. For users who want a straightforward path to generating images with advanced features, A1111 is the recommended starting point.

ComfyUI

ComfyUI takes a fundamentally different approach. Instead of a traditional interface, it presents a node-based workflow canvas where you visually connect processing blocks. Each node represents a step in the pipeline: loading a model, encoding a prompt, running the sampler, decoding the latent, and saving the image. You wire these together by dragging connections between nodes.

ComfyUI Installation Steps

  1. Download ComfyUI from the GitHub releases page. Windows users can download the portable package which includes Python and all dependencies bundled together.
  2. Extract the archive to your preferred location. No separate Python installation is required for the portable version.
  3. Place model checkpoints in the models/checkpoints/ folder, LoRAs in models/loras/, and VAEs in models/vae/.
  4. Run run_nvidia_gpu.bat (Windows) or python main.py (macOS/Linux).
  5. Open your browser to http://127.0.0.1:8188 to access the node editor.

ComfyUI's node-based approach is more complex to learn but offers significant advantages. You can build reusable workflows, share them as JSON files, and achieve precise control over every step of the pipeline. ComfyUI also uses less VRAM than A1111 for equivalent operations, making it the better choice for GPUs with limited memory. Many professional and production workflows now run on ComfyUI.

If you are brand new to Stable Diffusion, start with AUTOMATIC1111. Once you understand the fundamentals of prompting, models, and samplers, consider transitioning to ComfyUI for more advanced work.

Models: SD 1.5, SDXL & SD3

Stable Diffusion is not a single model but a family of model architectures, each with its own strengths and hardware requirements. Understanding the differences between them helps you choose the right base for your projects.

Stable Diffusion 1.5

SD 1.5 remains the most widely used base model thanks to its low hardware requirements and the enormous library of fine-tuned checkpoints built on top of it. It generates images at 512x512 pixels natively and can be upscaled afterward. Thousands of community checkpoints on CivitAI and Hugging Face are based on SD 1.5, covering every conceivable style from hyperrealistic photography to stylized anime.

SD 1.5 requires as little as 4GB of VRAM, making it accessible to users with older or budget GPUs. The trade-off is that its native resolution is relatively low, and it can struggle with complex compositions, accurate human anatomy, and detailed text rendering within images.

Stable Diffusion XL (SDXL)

SDXL represents a major leap in quality. It generates images at 1024x1024 pixels natively, produces dramatically better composition and anatomy, and handles complex prompts with greater fidelity. SDXL uses a dual text encoder system (CLIP ViT-L and OpenCLIP ViT-bigG), which gives it a deeper understanding of prompt semantics.

The cost is higher hardware requirements: you need at least 8GB of VRAM for comfortable generation, with 12GB recommended. SDXL also introduced a refiner model that can be run as a second pass to enhance fine details, though many fine-tuned SDXL checkpoints produce excellent results without it.

Stable Diffusion 3 (SD3)

SD3 is the latest architecture from Stability AI, built on a Multimodal Diffusion Transformer (MMDiT) rather than the U-Net used in previous versions. This transformer-based architecture brings superior text rendering capabilities, better prompt adherence, and improved handling of spatial relationships between objects.

SD3 comes in multiple sizes. SD3 Medium is suitable for consumer hardware with 12GB+ VRAM, while larger variants require professional GPU setups. The community fine-tune ecosystem for SD3 is growing but is still smaller than what exists for SD 1.5 and SDXL.

Feature SD 1.5 SDXL SD3
Native Resolution 512x512 1024x1024 1024x1024
Min VRAM 4 GB 8 GB 12 GB
Architecture U-Net U-Net (larger) MMDiT (Transformer)
Text Rendering Poor Moderate Strong
Community Models Thousands Hundreds Growing
Prompt Adherence Good Very Good Excellent
Speed Fast Moderate Moderate

LoRA & Embeddings

One of Stable Diffusion's most powerful features is the ability to extend base models with lightweight add-ons that teach the model new concepts without replacing or retraining the entire checkpoint.

LoRAs (Low-Rank Adaptations)

LoRAs are small files (typically 10-200 MB) that modify a subset of the model's weights to introduce a new style, character, or concept. They are the primary method for customizing Stable Diffusion output. You can find thousands of community-created LoRAs on CivitAI, covering everything from specific artistic styles to individual characters to lighting techniques.

Using a LoRA is straightforward. In AUTOMATIC1111, you include the trigger keyword in your prompt and reference the LoRA file using the syntax <lora:filename:weight>, where weight controls how strongly the LoRA influences the output (typically 0.5 to 1.0). In ComfyUI, you add a LoRA Loader node between your checkpoint loader and the rest of the pipeline.

You can stack multiple LoRAs in a single generation. For example, you might combine a style LoRA for watercolor painting with a character LoRA for a specific subject, and a detail LoRA for enhanced textures. The key is balancing the weights so no single LoRA dominates the output.

Prompt Example: Using Multiple LoRAs
masterpiece, best quality, 1girl standing in a sunlit garden, flowing white dress, wind in hair, soft golden hour lighting, detailed flowers, depth of field, <lora:watercolor_style:0.7> <lora:detailed_eyes:0.4>
Negative: worst quality, low quality, blurry, deformed, extra limbs, bad anatomy, watermark, text, signature

Textual Inversions (Embeddings)

Textual inversions, also called embeddings, are even smaller than LoRAs (usually under 100 KB). Instead of modifying model weights, they teach the text encoder new "words" that represent specific concepts. They are most commonly used for negative prompts, where community-created embeddings like EasyNegative or BadDream encode a broad set of quality-reducing artifacts into a single trigger word, simplifying your negative prompt significantly.

ControlNet

ControlNet is arguably the most important extension in the Stable Diffusion ecosystem. It solves one of the fundamental challenges of AI image generation: precise spatial control. While text prompts can describe what you want in an image, they are inherently imprecise about where things should be positioned and how they should be posed.

ControlNet works by adding a parallel neural network that conditions the generation process on a control signal. This signal can be derived from an existing image or created from scratch. The most commonly used control types include:

Prompt Example: ControlNet OpenPose
professional photograph of a dancer mid-leap, studio lighting, dramatic shadows, elegant pose, sharp focus, 85mm lens, f/2.8, award-winning photography [ControlNet: OpenPose | Weight: 1.0 | Reference: dance_pose.png]
Negative: cartoon, illustration, painting, blurry, low quality, deformed limbs

ControlNet models need to match your base model architecture. SD 1.5 ControlNet models will not work with SDXL checkpoints and vice versa. Always download the correct version for the base model you are using.

Image-to-Image (img2img)

Image-to-image generation starts with an existing image rather than pure noise. Instead of beginning the denoising process from random latent noise, the model encodes your input image into latent space and adds a controlled amount of noise, determined by the denoising strength parameter. It then runs the normal denoising process, guided by your text prompt.

The denoising strength controls how much the output deviates from the input. At 0.3, the output closely resembles the original with subtle style changes. At 0.7, the composition is loosely preserved but details change significantly. At 1.0, the input image is essentially ignored and you get a fresh generation.

Common img2img Use Cases

Prompt Example: img2img Style Transfer
oil painting in the style of Monet, impressionist brushwork, vibrant colors, outdoor garden scene, dappled sunlight, thick impasto technique, museum quality [img2img | Denoising Strength: 0.55 | Input: garden_photo.jpg]

Inpainting

Inpainting allows you to selectively regenerate portions of an image while keeping the rest untouched. You paint a mask over the area you want to change, write a prompt describing what should appear there, and the model fills in the masked region while seamlessly blending with the surrounding pixels.

Inpainting is indispensable for fixing common AI artifacts. Deformed hands, awkward facial features, unwanted objects, or inconsistent background elements can all be corrected with targeted inpainting passes. It transforms Stable Diffusion from a one-shot generator into an iterative editing tool.

Inpainting Tips for Best Results

Prompt Engineering Best Practices

Writing effective prompts for Stable Diffusion differs from prompting cloud-based models like Midjourney. Because SD checkpoints vary dramatically in how they interpret text, developing strong prompting habits is essential for consistent results.

Prompt Structure

A well-structured Stable Diffusion prompt typically follows this pattern: subject, action/pose, environment, lighting, camera/lens, style, quality modifiers. Placing the most important elements first gives them greater weight in the generation.

Prompt Example: Structured Photorealistic Prompt
portrait of an elderly Japanese craftsman in a woodworking workshop, focused expression, holding a hand plane, surrounded by wood shavings, warm tungsten lighting from a single window, shallow depth of field, Fujifilm X-T4, 56mm f/1.2, documentary photography, editorial quality, 8k
Negative: cartoon, anime, illustration, 3d render, blurry, deformed, extra fingers, poorly drawn hands, watermark

Quality Modifiers and Trigger Words

Many fine-tuned checkpoints respond to specific quality trigger words. Common ones include masterpiece, best quality, highly detailed, 8k, sharp focus, and professional. However, these are not universal. Always check the model card on CivitAI or Hugging Face for the checkpoint-specific trigger words and recommended settings.

Negative Prompts

Negative prompts are just as important as positive prompts in Stable Diffusion. They tell the model what to avoid. A good baseline negative prompt addresses common artifacts: worst quality, low quality, normal quality, lowres, blurry, deformed, extra limbs, bad anatomy, bad hands, watermark, text, signature, cropped. For even better results, use quality-focused embeddings like EasyNegative in your negative prompt.

Prompt Weighting

Both A1111 and ComfyUI support prompt weighting to emphasize or de-emphasize specific terms. In A1111, wrapping a term in parentheses increases its weight: (detailed eyes:1.3) gives that concept 30% more influence. You can also decrease weight: (background:0.7). This fine-grained control helps you steer the model toward your intended result without restructuring the entire prompt.

Prompt Example: Weighted Fantasy Scene
(epic fantasy landscape:1.2), ancient stone castle on a cliff edge, (dramatic storm clouds:1.3), lightning in the distance, vast ocean below, volumetric fog, (cinematic lighting:1.1), matte painting style, concept art, artstation, 4k
Negative: photo, realistic, modern buildings, cars, text, watermark, low quality

Strengths & Weaknesses

Understanding where Stable Diffusion excels and where it falls short helps you decide when to use it versus alternative tools, and how to work around its limitations.

Strengths

  • Completely free and open-source
  • Runs locally with full privacy
  • Unlimited generations at no cost
  • Massive ecosystem of fine-tuned models
  • LoRA and embedding customization
  • ControlNet for precise spatial control
  • Inpainting and img2img workflows
  • Full API access for automation
  • Can be integrated into production pipelines
  • Active community and rapid development

Weaknesses

  • Requires capable GPU hardware
  • Steeper learning curve than cloud tools
  • Manual installation and maintenance
  • Base models weaker than Midjourney at aesthetic defaults
  • Text rendering limited (improving with SD3)
  • Anatomy errors more frequent than closed-source alternatives
  • Model/checkpoint management can be storage-intensive
  • Extension compatibility issues between versions

Video Tutorials

These curated video tutorials walk you through the setup process and core workflows visually. Watching alongside this guide will accelerate your learning significantly.

Stable Diffusion Beginner Setup Guide

Advanced Stable Diffusion Workflows

Ready to Build Your First Image?

Use the FavoriteImage prompt builder to generate structured prompts for Stable Diffusion, Midjourney, and DALL-E.

Open Image Builder

Frequently Asked Questions

Stable Diffusion is a free, open-source AI image generation model developed by Stability AI. It uses a latent diffusion process to transform text descriptions into detailed images. Unlike cloud-based tools such as Midjourney or DALL-E, Stable Diffusion can run entirely on your own computer, giving you full control over the generation process, unlimited generations, and complete privacy.

For SD 1.5, you need at least an NVIDIA GPU with 4GB VRAM (GTX 1650 or better). For SDXL, 8GB VRAM is recommended (RTX 3060 or better). For SD3, 12GB+ VRAM is ideal. AMD GPUs work via DirectML on Windows but with reduced performance. Apple Silicon Macs can run Stable Diffusion through optimized backends like the MLX framework or the Core ML-based diffusers pipeline.

Yes. Stable Diffusion is completely free and open-source. The model weights are freely available under permissive licenses, and community interfaces like AUTOMATIC1111 and ComfyUI are also free. Your only cost is the hardware to run it locally or, optionally, renting cloud GPU time through services like RunPod, Vast.ai, or Google Colab.

AUTOMATIC1111 (A1111) provides a traditional web-based interface with menus, sliders, and buttons, making it ideal for beginners and users who want a straightforward experience. ComfyUI uses a node-based workflow system where you visually connect processing steps, offering greater flexibility, better VRAM efficiency, and precise control over the generation pipeline. ComfyUI also supports workflow sharing via JSON files. Many users start with A1111 and transition to ComfyUI as they advance.

ControlNet is an extension that gives you precise control over image composition by conditioning the generation on additional input signals such as edge maps, depth maps, pose skeletons, or scribbles. It solves the fundamental problem of text prompts being unable to reliably specify exact positions, poses, and spatial relationships. ControlNet is essential for character art, architectural visualization, and any workflow that requires compositional precision.

LoRAs (Low-Rank Adaptations) are small, efficient model add-ons that teach Stable Diffusion new concepts, styles, or characters without replacing the entire base model. They are typically 10-200MB in size and can be mixed and matched. You can combine a style LoRA with a character LoRA and a detail LoRA in a single generation. LoRAs are the primary method for customizing Stable Diffusion output and thousands are freely available on CivitAI.

Midjourney produces polished, aesthetically pleasing images with minimal effort and requires no hardware or setup. Stable Diffusion requires more technical knowledge but offers unlimited free generations, full privacy, vastly greater customization through LoRAs and ControlNet, the ability to run offline, and complete control over the generation pipeline. Midjourney excels at quick, beautiful results; Stable Diffusion excels at specialized, production, and high-volume use cases.