Jade Qwen 3 4B

A systems progamming Qwen finetune.

Model description

This model was finetuned using synthetic conversations generated using Qwen 3 8B, Qwen 3 4B, and Qwen 3 30B 3A.

The first set of data with which we generated the conversations was taken from these books.

The result was a high quality, Grammar, Logic, Rhetoric and Math dataset.

The next set of data was a small collection of documentation hosted as mdBooks.

The next set of data was a small set of source code repositories.

The source code conversations are available here.

The next set of data was the documentation for each command supported by tealdeer.

The last set of data was every manpage currently on my Macbook Air M2.

The mdBooks and tealdeer supported command docs were combined with data generated from Apple (aarch64) related systems programming books. The synthetic conversations are available here.

The manpages conversations are available here.

A mix of CoT and /nothink prompts were used when generating the datasets but the sets skew towards /nothink.

All of the datasets are one-shot prompts, except for the source code conversations which ask 2 follow up questions after the initial prompt.

I first merged all of these datasets together in order and finetuned a knowledge LoRA adapter with Axolotl on a single H200 for 3 epochs and ended up with an eval loss of 0.62 down from 0.9643.

I then generated an identity dataset, using the abilities GPT4-o and Claude Sonnet to employ some degress of wit and sarcasm, and finetuned a knowledge LoRA adapter with Axolotl on a single H200 for 30 epochs and ended up with an eval loss of 2.3335 down from 7.7014.

Finally, I merged these datasets in Qwen 3 4B using the DARE TIES merging method.

weighting the knowledge dataset 1.5 and the identity dataset 0.5 and setting the density to 0.9. I tried every combination of weights and density to get the one shot identity questions to trigger witty responses but could not do so with a merged adapter. They however work spledidly when using the LoRA adapter seperately.

The result is a fast Qwen 3 model that seems to retain the updated knowledge base it was trained on while lacking a lot of the personality I hoped for. I'm currently researching ways to weight the identity data more optimally. I've also noticed the model can get a little manpages obsessed with a focus on Perl (unfortunately) as the bulk of the manpages generated on my system (me, a non-perl tool using developer, oh my god how much of what we do touches perl at some point) are for Perl documentation.

I've made 8bit and 4bit MLX quantizations available of this bf16 model.

❤️

dougiefresh
/

jade_qwen3_4b

Jade Qwen 3 4B

Model description

Model tree for dougiefresh/jade_qwen3_4b

Datasets used to train dougiefresh/jade_qwen3_4b