1
17
2
11

Abstract

Text-to-image (T2I) generation models have significantly advanced in recent years. However, effective interaction with these models is challenging for average users due to the need for specialized prompt engineering knowledge and the inability to perform multi-turn image generation, hindering a dynamic and iterative creation process. Recent attempts have tried to equip Multi-modal Large Language Models (MLLMs) with T2I models to bring the user's natural language instructions into reality. Hence, the output modality of MLLMs is extended, and the multi-turn generation quality of T2I models is enhanced thanks to the strong multi-modal comprehension ability of MLLMs. However, many of these works face challenges in identifying correct output modalities and generating coherent images accordingly as the number of output modalities increases and the conversations go deeper. Therefore, we propose DialogGen, an effective pipeline to align off-the-shelf MLLMs and T2I models to build a Multi-modal Interactive Dialogue System (MIDS) for multi-turn Text-to-Image generation. It is composed of drawing prompt alignment, careful training data curation, and error correction. Moreover, as the field of MIDS flourishes, comprehensive benchmarks are urgently needed to evaluate MIDS fairly in terms of output modality correctness and multi-modal output coherence. To address this issue, we introduce the Multi-modal Dialogue Benchmark (DialogBen), a comprehensive bilingual benchmark designed to assess the ability of MLLMs to generate accurate and coherent multi-modal content that supports image editing. It contains two evaluation metrics to measure the model's ability to switch modalities and the coherence of the output images. Our extensive experiments on DialogBen and user study demonstrate the effectiveness of DialogGen compared with other State-of-the-Art models.

Paper: https://arxiv.org/abs/2403.08857

Code: https://github.com/tencent/HunyuanDiT

Demo: https://huggingface.co/spaces/multimodalart/HunyuanDiT

Project Page: https://dit.hunyuan.tencent.com/

Model Weights: https://huggingface.co/Tencent-Hunyuan/HunyuanDiT

  • NVIDIA GPU with CUDA support is required.
  • It's only been tested on Linux with V100 and A100 GPUs.
  • A Minimum of 11GB of VRAM is required, 32GB recommended.

3
8
4
3
5
5
6
20

Hey folks, I've been training Loras now for a while, and have some scripts I really like that I've been working with. However, I realized I haven't been keeping up lately, so, is SDXL still the best for Loras? And by that I mean before with 1.5 and standard SDXL is the most accurate quality I've received.

My Loras seem to work fine with Lightning and Turbo models, but is there anything else I should look into? Any major things that you've changed in the last 6 months to your trainings? Is sdxl_base the best basemodel to train off of?

7
18
8
9
9
19

Hello there, fellow lemmings!

As the title suggests, I've made an app that changes your wallpaper regularly based on the parameters you set. It uses AI Horde (which uses Stable Diffusion) to generate the images and you can set the interval at which your wallpaper changes (from 1 time per day to 48 times per day).

Other features:

  • completely free (in fact, will be open source next week)
  • easy to use, but offers also advanced options
  • there's a help section with all the parameters explained
  • works on both a phone and tablet

There's also an optional Premium subscription which puts you higher in the priority queue, but most of the time it's not needed - unless you're generating during the most busy time, the wait is not really that long even without Premium. Note that if you're a user of the AI Horde, you can put your own api key there and get the same benefits as having a Premium.

The open source version (released next week) will also allow (toggleable on or off) NSFW content, which is not possible in the Google Play version. It also doesn't contain the Premium subscription.

Also please check https://dontkillmyapp.com and see steps for your phone vendor, this app needs to work in the background and most vendors kill background apps. It will ask for an exception for battery saver on its own when you set the schedule for the first time, but for some vendors that's not enough.


Do let me know what do you think about the app! And if you want me to tag you when I make a release post for the open source version, let me know. Also huge thanks to @db0@lemmy.dbzer0.com for creating and maintaining the AI Horde, which makes this (and other cool apps) possible!

10
10

Abstract

We propose Pure and Lightning ID customization (PuLID), a novel tuning-free ID customization method for text-to-image generation. By incorporating a Lightning T2I branch with a standard diffusion one, PuLID introduces both contrastive alignment loss and accurate ID loss, minimizing disruption to the original model and ensuring high ID fidelity. Experiments show that PuLID achieves superior performance in both ID fidelity and editability. Another attractive property of PuLID is that the image elements (e.g., background, lighting, composition, and style) before and after the ID insertion are kept as consistent as possible. Codes and models will be available at this https URL

Paper: https://arxiv.org/abs/2404.16022

Code: https://github.com/ToTheBeginning/PuLID

11
7
12
5
13
5

This stuff moves so fast I really can't keep up and a lot of the research posted here goes a bit over my head. I'm looking for something that doesn't seem too out of the question given things like CLIPSeg. Is there some tool or library out there that will accept an image and a prompt and then generate a mask within the image that generally corresponds to the prompt?

For example, if I had a picture of an empty park and gave the prompt "little girl flying a kite" I should get back a mask vaguely in the shape of a child with a sort of blob mask in the sky for the kite. Of course from there I could use the mask to inpaint those things. I would really like to be able to layer an image kind of like Photoshop so it's not all-or-nothing and focus on one element at a time. I could do the masking manually but of course we all want fewer steps in our workflows.

14
18

Abstract

The rapid development of diffusion models has triggered diverse applications. Identity-preserving text-to-image generation (ID-T2I) particularly has received significant attention due to its wide range of application scenarios like AI portrait and advertising. While existing ID-T2I methods have demonstrated impressive results, several key challenges remain: (1) It is hard to maintain the identity characteristics of reference portraits accurately, (2) The generated images lack aesthetic appeal especially while enforcing identity retention, and (3) There is a limitation that cannot be compatible with LoRA-based and Adapter-based methods simultaneously. To address these issues, we present \textbf{ID-Aligner}, a general feedback learning framework to enhance ID-T2I performance. To resolve identity features lost, we introduce identity consistency reward fine-tuning to utilize the feedback from face detection and recognition models to improve generated identity preservation. Furthermore, we propose identity aesthetic reward fine-tuning l everaging rewards from human-annotated preference data and automatically constructed feedback on character structure generation to provide aesthetic tuning signals. Thanks to its universal feedback fine-tuning framework, our method can be readily applied to both LoRA and Adapter models, achieving consistent performance gains. Extensive experiments on SD1.5 and SDXL diffusion models validate the effectiveness of our approach.

Paper: https://arxiv.org/abs/2404.15449

Code: https://github.com/Weifeng-Chen/ID-Aligner (coming soon)

Model: https://huggingface.co/ (coming soon)

Project Page: https://idaligner.github.io/

15
2
16
18

Abstract

Diffusion models have become a mainstream approach for high-resolution image synthesis. However, directly generating higher-resolution images from pretrained diffusion models will encounter unreasonable object duplication and exponentially increase the generation time. In this paper, we discover that object duplication arises from feature duplication in the deep blocks of the U-Net. Concurrently, We pinpoint the extended generation times to self-attention redundancy in U-Net's top blocks. To address these issues, we propose a tuning-free higher-resolution framework named HiDiffusion. Specifically, HiDiffusion contains Resolution-Aware U-Net~(RAU-Net) that dynamically adjusts the feature map size to resolve object duplication and engages Modified Shifted Window Multi-head Self-Attention(MSW-MSA) that utilizes optimized window attention to reduce computations. we can integrate HiDiffusion into various pretrained diffusion models to scale image generation resolutions even to 4096×4096 at 1.5-6× the inference speed of previous methods. Extensive experiments demonstrate that our approach can address object duplication and heavy computation issues, achieving state-of-the-art performance on higher-resolution image synthesis tasks.

Paper: https://arxiv.org/abs/2311.17528

Code: https://github.com/megvii-model/HiDiffusion

Colab Demo: https://colab.research.google.com/drive/1EiBn9lSnPZTU4cikRRaBBexs429M-qty?usp=drive_link

Project Page: https://hidiffusion.github.io/

17
9
submitted 3 weeks ago* (last edited 3 weeks ago) by Even_Adder@lemmy.dbzer0.com to c/stable_diffusion@lemmy.dbzer0.com
18
9
19
6

Abstract

Recently, a series of diffusion-aware distillation algorithms have emerged to alleviate the computational overhead associated with the multi-step inference process of Diffusion Models (DMs). Current distillation techniques often dichotomize into two distinct aspects: i) ODE Trajectory Preservation; and ii) ODE Trajectory Reformulation. However, these approaches suffer from severe performance degradation or domain shifts. To address these limitations, we propose Hyper-SD, a novel framework that synergistically amalgamates the advantages of ODE Trajectory Preservation and Reformulation, while maintaining near-lossless performance during step compression. Firstly, we introduce Trajectory Segmented Consistency Distillation to progressively perform consistent distillation within pre-defined time-step segments, which facilitates the preservation of the original ODE trajectory from a higher-order perspective. Secondly, we incorporate human feedback learning to boost the performance of the model in a low-step regime and mitigate the performance loss incurred by the distillation process. Thirdly, we integrate score distillation to further improve the low-step generation capability of the model and offer the first attempt to leverage a unified LoRA to support the inference process at all steps. Extensive experiments and user studies demonstrate that Hyper-SD achieves SOTA performance from 1 to 8 inference steps for both SDXL and SD1.5. For example, Hyper-SDXL surpasses SDXL-Lightning by +0.68 in CLIP Score and +0.51 in Aes Score in the 1-step inference.

Paper:

Hugging Face Repo: https://huggingface.co/ByteDance/Hyper-SD

T2I Demo: https://huggingface.co/spaces/ByteDance/Hyper-SDXL-1Step-T2I

Scribble Demo: https://huggingface.co/spaces/ByteDance/Hyper-SD15-Scribble

Project Page: https://hyper-sd.github.io/

20
19

21
4
22
18
submitted 1 month ago* (last edited 1 month ago) by Even_Adder@lemmy.dbzer0.com to c/stable_diffusion@lemmy.dbzer0.com

Abstract

The intensive computational burden of Stable Diffusion (SD) for text-to-image generation poses a significant hurdle for its practical application. To tackle this challenge, recent research focuses on methods to reduce sampling steps, such as Latent Consistency Model (LCM), and on employing architectural optimizations, including pruning and knowledge distillation. Diverging from existing approaches, we uniquely start with a compact SD variant, BK-SDM. We observe that directly applying LCM to BK-SDM with commonly used crawled datasets yields unsatisfactory results. It leads us to develop two strategies: (1) leveraging high-quality image-text pairs from leading generative models and (2) designing an advanced distillation process tailored for LCM. Through our thorough exploration of quantization, profiling, and on-device deployment, we achieve rapid generation of photo-realistic, text-aligned images in just two steps, with latency under one second on resource-limited edge devices.

Paper: https://arxiv.org/abs/2404.11925

23
21

tl;dr: We can control components of generated images with a pretrained diffusion model. We use this to generate various perceptual illusions.

Abstract

Given a factorization of an image into a sum of linear components, we present a zero-shot method to control each individual component through diffusion model sampling. For example, we can decompose an image into low and high spatial frequencies and condition these components on different text prompts. This produces hybrid images, which change appearance depending on viewing distance. By decomposing an image into three frequency subbands, we can generate hybrid images with three prompts. We also use a decomposition into grayscale and color components to produce images whose appearance changes when they are viewed in grayscale, a phenomena that naturally occurs under dim lighting. And we explore a decomposition by a motion blur kernel, which produces images that change appearance under motion blurring. Our method works by denoising with a composite noise estimate, built from the components of noise estimates conditioned on different prompts. We also show that for certain decompositions, our method recovers prior approaches to compositional generation and spatial control. Finally, we show that we can extend our approach to generate hybrid images from real images. We do this by holding one component fixed and generating the remaining components, effectively solving an inverse problem.

Paper: https://arxiv.org/abs/2404.11615

Code: https://github.com/dangeng/visual_anagrams

Project Page: https://dangeng.github.io/factorized_diffusion/

24
64
Testing Stable Diffusion 3 (lemmy.dbzer0.com)

Tonight I tested out SD3 via the API. I sadly do not have as much time as with my Stable Cascade review, so there will only be four images for most tests (instead of 9), and no SD2 controls.

Test #1: Inversion of expectations.

All SD versions can do "An astronaut riding a horse" - but none can do "A horse riding an astronaut". Can SD3?

Mmm, nope.

(You'll note that I also added to the prompt, "In the background are the words 'Stable Diffusion Now Does Text And Understands Spatial Relationships'", to test text as well)

Trying to help it out, I tried changing the first part of the prompt to: "A horse riding piggyback on the shoulders of an astronaut" . No dice.

As for the text aspect, we can see that it's not all there. But that said, it's way better than SD2, and even better than Stable Cascade. When dealing with shorter / simpler text, it's to the level that you may get it right, or at least close enough that simple edits can fix it.

Test #2: Spatial relationships

Diffusion models tend not to understand spatial relationships, about how elements are oriented relative to each other. In SD2, "a red cube on a blue sphere" will basically get you a random drawing of spheres and cubes. Stable Cascade showed maybe slightly better success, but not much. Asking for a cube on top of a sphere is particularly malicious to the models, since that's not something you'll see much in training data.

Here I asked for a complex scene:

"A red cube on a blue sphere to the left of a green cone, all sitting on top of a glass prism."

So we can see that none of the runs got it exactly right. But they all showed a surprising degree of "understanding" of the scene. This is definitely greatly improved over earlier models.

Test 3: Linguistic parsing.

Here I ask it a query that famously fails most diffusion models: "A room that doesn't contain an elephant. There are no elephants in the room." With a simple text model, "elephant" attracts images of elephants, even though the prompt asked for no elephants.

And SD3? Well, it fails too.

One might counter with, "Well, that's what negative prompts are for", but that misses the point - the point is whether the text model actually has a decent understanding of what the user is asking for. It does not.

Test 4: Excessive diversity

Do we avoid the "Gemini Scenario" here? Prompting: "The US founding fathers hold a vigorous debate."

That's a pass. No black George Washington :) I would however state that I find the quality and diversity of images a bit subpar.

Test 5: Insufficient diversity

What about the opposite problem - stereotyping? I prompted with "A loving American family.". And the results were so bad that I ran a secod set of four images:

The fact that there's zero diversity is the least of our worries. Look at those images - ugh, KILL IT WITH FIRE! They re beyond glitchy. This is like SD1.5-level glitchiness. It's extremely disappointing to see this in SD3, and I can't help but think that this must be a bug.

Test 6: Many objects

A final test was to see how the prompt could cope with having to draw many unique items in a scene. Prompt: "A clock, a frog, a cow, Bill Clinton, six eggs, an antique lantern, a banana, two seashells, a window, a HP printer, a poster for Star Wars, a broom, and an Amazon parrot."

The results are... meh. Mostly right, but all have flaws. And all the same style.

Test 7: Aesthetics and foreign characters

I did a final test on a useful use-case: creating an ad / cover page style image with nice aesthetics. In this case, an ad / cover for beets, with Icelandic text. Prompt: "A dramatic chiaroscuro photograph ad of a freshly picked beet under a spotlight, with dramatic text behind it that reads only 'Rauðrófur'." I also used a high aspect ratio.

As far as ads / cover images go, I think the aesthetics are quite workable. Foreign text though... I guess that's too ambitious. The eth (ð) is a total non-starter, and it also omitted the accent on the ó. I also did another test (not pictured here) of tomatoes with the text "Tómatar". The accent over the ó was only present in 2 of the 8 images, and one of those added a second accent over the second "a" for no reason.

Conclusions

  1. Aesthetics can be quite good,but not always. Needs more experimentation to see whether it's better than Stable Cascade (which is lovely in practice)

  2. Diversity... I'd put it lower than SD2 but higher than Stable Cascade (which is terrible in terms of diversity).

  3. Text: not all the way there, but definitely getting into "workable" territory.

  4. Prompt understanding: Spatial awareness is significantly improved, but it really doesn't "understand" prompts fully yet.

  5. Diversity: could use a bit more work.

  6. Glitchiness: VERY, at least when it comes to shots like the families.

25
23
view more: next ›

Stable Diffusion

4042 readers
8 users here now

Discuss matters related to our favourite AI Art generation technology

Also see

Other communities

founded 11 months ago
MODERATORS