Model files for prefill benchmarks on Apple Neural Engine:

https://docs.google.com/spreadsheets/d/1OCxn730D5h8rvS2IHsSi0UBYbsP_lV-W-0uVdVDCvIk

ANEMLL 0.3.0-Alpha https://github.com/Anemll/Anemll/releases change mode="kmeans" to mode="uniform" for faster processing in llama_converter.py line 402 Example export: ./anemll/utils/convert_model.sh \ --model ~/Models/HF/Llama-3.1-Nemotron-Nano-8B-v1 \ --output ~/Models/ANE/anemll-Nemotron-8B-ch4-b512-w512 \ --context 512 \ --batch 512 \ --lut1 "" \ --lut2 4 \ --lut3 "" \ --chunk 4 --restart 4

Source model: https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-8B-v1 Chuncks : https://huggingface.co/anemll/ANEMLL-Prefill-bench

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support