--- base_model: llm-jp/llm-jp-3-13b tags: - text-generation-inference - transformers - unsloth - llama - trl license: apache-2.0 language: - en --- # Uploaded model - **Developed by:** erikomaru - **License:** apache-2.0 - **Finetuned from model :** llm-jp/llm-jp-3-13b This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library. [](https://github.com/unslothai/unsloth) # Sample use This script uses a qLoRA adapter trained with the Unsloth library to generate outputs for the ELYZA-tasks-100-TV benchmark tasks. It assumes that the adapter has been uploaded to Hugging Face. The code is designed specifically for loading the model using the Unsloth library and performing inference. The model generates accurate and efficient outputs by leveraging advanced quantization techniques (4-bit NF4) and fine-tuned weights. It is ideal for Japanese natural language processing tasks that require precise instruction-following capabilities. ```python # Import necessary libraries from transformers import ( AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, ) import torch from tqdm import tqdm import json # Set your Hugging Face token and model name HF_TOKEN = "your-token" # Replace with your token model_name = "erikomaru/llm-jp-3-13b-it" # Step 1: Configure 4-bit quantization settings bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, ) # Step 2: Load the model with LoRA adapters model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=bnb_config, device_map="auto", token = HF_TOKEN ) # Step 3: Load the tokenizer tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True, token = HF_TOKEN) # Integrate the LoRA adapter into the base model. model = PeftModel.from_pretrained(model, adapter_id, token=HF_TOKEN) # Step 4: Load dataset datasets = [] with open("./elyza-tasks-100-TV_0.jsonl", "r") as f: item = "" for line in f: line = line.strip() item += line if item.endswith("}"): datasets.append(json.loads(item)) item = "" # Perform inference using the model. # Switch the model to inference mode FastLanguageModel.for_inference(model) # Step 5: Run inference on dataset results = [] for data in tqdm(datasets): input = data["input"] # Construct the prompt prompt = f"""### 指示 {input} ### 回答 """ # Tokenize input tokenized_input = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt").to(model.device) attention_mask = torch.ones_like(tokenized_input) # Generate output with torch.no_grad(): outputs = model.generate( tokenized_input, attention_mask=attention_mask, max_new_tokens=100, do_sample=False, repetition_penalty=1.2, pad_token_id=tokenizer.eos_token_id )[0] output = tokenizer.decode(outputs[tokenized_input.size(1):], skip_special_tokens=True) results.append({"task_id": data["task_id"], "input": input, "output": output}) # Step 6: Save results to a JSONL file import re jsonl_id = re.sub(".*/", "", adapter_id) with open(f"./{jsonl_id}-outputs.jsonl", 'w', encoding='utf-8') as f: for result in results: json.dump(result, f, ensure_ascii=False) # ensure_ascii=False for handling non-ASCII characters f.write('\n') ```