J C

dark-pen

AI & ML interests

None yet

Recent Activity

liked a dataset 36 minutes ago
eaddario/imatrix-calibration
reacted to eaddario's post with 🔥 36 minutes ago
Tensor-wise (TWQ) and Layer-wise quantization (LWQ) now available in llama.cpp! As of version b5125 users can now do TWQ, whereby you quantize a whole tensor at a specific level, or perform LWQ by choosing specific layers per tensor/s The new --tensor-type option enables llama-quantize to apply user-defined quant levels to any combination of allowed tensors (i.e. tensors with 2 or more dimensions) and layer number, with support for regex patterns. For example, to TWQ the Attention Value tensor you would use --tensor-type attn_v=q6_k and to perform LWQ you'll use something like --tensor-type "\.([0-9]|1[01257]|31)\.attn_v=q4_k" In the next few days/weeks I'll update the models in my HF repo (and will add some others) but https://huggingface.co/eaddario/DeepSeek-R1-Distill-Llama-8B-GGUF and https://huggingface.co/eaddario/DeepSeek-R1-Distill-Qwen-7B-GGUF have been already LWQed. For reference, compared to the naive Q4_K_M model, the LWQ Qwen-7B is almost 11% smaller (4.68GB vs 4.18GB) with only a 0.35% penalty on PPL! I'll update the https://medium.com/@eaddario/squeezing-tensor-bits-the-quest-for-smaller-llms-86b23bd052ca post to explain the process in detail, but in the meantime the following links will provide some background: - Changes to llama-quantize: https://github.com/ggml-org/llama.cpp/pull/12511 - TWQ & LWQ tests: https://github.com/ggml-org/llama.cpp/discussions/12741 - Modified llama-imatrix (not yet merged) used to generate imatrix statistics to guide the TWQ and LWQ process: https://github.com/ggml-org/llama.cpp/pull/12718
liked a dataset about 12 hours ago
yzwang/X2I-in-context-learning
View all activity

Organizations

None yet

dark-pen's activity

upvoted an article 11 days ago
view article
Article

Hugging Face and Cloudflare Partner to Make Real-Time Speech and Video Seamless with FastRTC

20