```bash
git clone https://github.com/kIshizaki-sci/AutoAWQ.git
pip install -U transformers
pip install -e ./AutoAWQ
```

In [1]:
from awq import AutoAWQForCausalLM
import torch
import transformers
from transformers import AutoProcessor

In [2]:
print('torch version : ', torch.__version__)
print('transformers version : ', transformers.__version__)

torch version :  2.4.1+cu124
transformers version :  4.51.3


In [3]:
quant_path = 'kishizaki-sci/Llama-4-Scout-17B-16E-Instruct-AWQ'

In [4]:
%%time
model = AutoAWQForCausalLM.from_quantized(quant_path, torch_dtype=torch.float16, use_cache=True, device_map='auto')
processor = AutoProcessor.from_pretrained(quant_path)

config.json:   0%|          | 0.00/3.55k [00:00<?, ?B/s]



Fetching 25 files:   0%|          | 0/25 [00:00<?, ?it/s]

llama4_inference.ipynb:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.57k [00:00<?, ?B/s]

model-00002-of-00013.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

README.md:   0%|          | 0.00/138 [00:00<?, ?B/s]

model-00001-of-00013.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00004-of-00013.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00005-of-00013.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

model-00003-of-00013.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00006-of-00013.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00007-of-00013.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00008-of-00013.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00009-of-00013.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00010-of-00013.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00011-of-00013.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00012-of-00013.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00013-of-00013.safetensors:   0%|          | 0.00/2.79G [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/1.13M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/128 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/448 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/27.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/237k [00:00<?, ?B/s]

You have set `use_cache` to `False`, but cache_implementation is set to hybrid. cache_implementation will have no effect.


Replacing MoE Block...:   0%|          | 0/48 [00:00<?, ?it/s]

Replacing layers...: 100%|██████████| 48/48 [00:07<00:00,  6.18it/s]


CPU times: user 1h 44min 26s, sys: 26min 26s, total: 2h 10min 53s
Wall time: 28min 55s


In [5]:
!nvidia-smi

Sat Apr 19 22:53:42 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-80GB          On  |   00000000:87:00.0 Off |                    0 |
| N/A   30C    P0             71W /  400W |   60415MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

In [6]:
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What does means the torsion free in the general relativit?"},
        ]
    },
]

In [7]:
inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.model.device)

In [8]:
%%time
outputs = model.generate(
    **inputs,
    max_new_tokens=2048,
)

CPU times: user 4min 28s, sys: 3min 39s, total: 8min 7s
Wall time: 8min 8s


In [9]:
response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0]
print(response)

A question from the realm of differential geometry and general relativity!

In general relativity, "torsion-free" refers to a property of a connection on a manifold, specifically in the context of Riemannian geometry.

**Torsion** is a measure of how much a connection "twists" or "turns" a vector as it is parallel-transported around a closed loop. In other words, it's a measure of how much the connection deviates from being "flat" or "Euclidean".

A **torsion-free connection**, also known as a **symmetric connection**, is a connection that has zero torsion. This means that when you parallel-transport a vector around a closed loop, it returns to its original orientation, without any twisting or turning.

In mathematical terms, a torsion-free connection satisfies the following condition:

$$\Gamma^i_{jk} = \Gamma^i_{kj}$$

where $\Gamma^i_{jk}$ are the Christoffel symbols of the second kind, which define the connection.

In general relativity, the Levi-Civita connection is a fundamental 