Model files for prefill benchmarks on Apple Neural Engine:
https://docs.google.com/spreadsheets/d/1OCxn730D5h8rvS2IHsSi0UBYbsP_lV-W-0uVdVDCvIk
ANEMLL 0.3.0-Alpha https://github.com/Anemll/Anemll/releases change mode="kmeans" to mode="uniform" for faster processing in llama_converter.py line 402 Example export: ./anemll/utils/convert_model.sh \ --model ~/Models/HF/Llama-3.1-Nemotron-Nano-8B-v1 \ --output ~/Models/ANE/anemll-Nemotron-8B-ch4-b512-w512 \ --context 512 \ --batch 512 \ --lut1 "" \ --lut2 4 \ --lut3 "" \ --chunk 4 --restart 4
Source model: https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-8B-v1 Chuncks : https://huggingface.co/anemll/ANEMLL-Prefill-bench