where's the scheduler config?
cannot load
Added. For now, the scheduler still needs to be created manually as demonstrated in://github.com/HiDream-ai/HiDream-I1/main/inference.py for Dev and Fast models.
i'm adding to SimpleTuner, that abstraction is already created. thank you for adding the scheduler config.
however, can you also add text_encoder_4 and tokenizer_4 ? it will be needed for upstreaming Diffusers official support.
We are unable to add Llama 3.1 weights, which is text_encoder_4. The model weights need be accessed at Meta's Llama 3.1-8B-Instruct repository https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct.
oh no. why not? it seems like the whole repo here can be gated due to following the Llama license, as your model is a derivative of it?
oh no. why not? it seems like the whole repo here can be gated due to following the Llama license, as your model is a derivative of it?
The only thing bad about this model is the use of LLAMA 3.1 instead of the other Apache 2.0 models.
it's not the only bad thing, but it's certainly the current primary obstacle to making use of this model.
but yeah, regarding license, it doesn't make sense that these weights are MIT licensed when they're a derivative of Llama 3.1 which requires the Meta Community License be followed.
but yeah, regarding license, it doesn't make sense that these weights are MIT licensed when they're a derivative of Llama 3.1 which requires the Meta Community License be followed.
From what I gather from 'Ostris' it's not that difficult to adapt these weights to another LLM. From his findings on Flux, it's very cheap to finetune it to use an LLM for text encoding. I wonder if we could do a similar thing to this model and remove T5, and clips. Just use an Apache 2.0/MIT licensed LLM. Ostris's next Flex version will supposedly use an LLM, so it's definitely possible.
where did you get that idea? from what I've seen, he's never retrained a model to use a LLM. one time he tried to train an adapter that matched ELLA, but I don't think it was ever released.
Llama 3.1 is decoder-only architecture, it is not generating 'meaningful embeds' unlike T5 does. see the Sana research paper that the HiDream authors referenced.
when you open up a conversation with Llama 3.1 8B Instruct, just, without any other context, feed it your image caption. see what it prints out. that's the embed you're going to be training the model on.
when I try it, you know, it just spews out garbage about the text. some kind of analysis is performed. but it doesn't contain direct information about the incoming caption. it hallucinates, make up possibilities and assumptions.
also, i wouldn't suggest it's "very cheap" to finetune a model to completely different embedding distribution from a new LLM. that is essentially worse than retraining from scratch. it takes a lot longer, the model's biases are heavily connected to the structure from Llama 3.1 8B. "very cheap" is relative, maybe. but certainly misleading. it's not something you can do in a few days. it takes weeks, especially the "cheaper" you go with compute levels.
also, i wouldn't suggest it's "very cheap" to finetune a model to completely different embedding distribution from a new LLM. that is essentially worse than retraining from scratch. it takes a lot longer, the model's biases are heavily connected to the structure from Llama 3.1 8B. "very cheap" is relative, maybe. but certainly misleading. it's not something you can do in a few days. it takes weeks, especially the "cheaper" you go with compute levels.
"Very cheap" was definitely meant as relative to training from scratch. Ostris tested with Flux and was able to produce ok results with just 3k steps. I don't know how different Hidream is to Flux, though it certainly will take a good amount of compute to adapt it to a LLM. Here are his findings https://x.com/ostrisai/status/1890123073728496121
interestingly he never released that training code or the weights, I'd just assume since nothing has come of it since then that something was wrong with the tests he ran and they were not in fact doing what he expected. same thing the way he had the height and width flipped for a while so that his webp images were stretched in funky ways and he didn't know.
here's his Flex2 architecture implementation. it uses T5 and CLIP: https://github.com/ostris/ai-toolkit/commit/5365200da18a4f05173aa13252cf82487ab97de8#diff-346da60d06da25f45042766f128dfe1b42f84ebfb44ec6e5b49540daf2336ed6R118
here's his Flex2 architecture implementation. it uses T5 and CLIP: https://github.com/ostris/ai-toolkit/commit/5365200da18a4f05173aa13252cf82487ab97de8#diff-346da60d06da25f45042766f128dfe1b42f84ebfb44ec6e5b49540daf2336ed6R118
We might be able to use this? https://huggingface.co/OpenSciLM/Llama-3.1_OpenScholar-8B (AllenAI, Meta, University collab, they don't say anything about adhering to LLAMA 3.1 license, seems like a true Apache 2.0 finetune of LLAMA 3.1) Also there are some other finetunes from universities and some small companies with apache 2.0/MIT license. Almost all of these finetunes also have a page on arxiv. Might be that they are able to change their license when significant research is involved.