qt-spyro-hf commited on
Commit
823317d
·
verified ·
1 Parent(s): 3dde0a6

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -4
README.md CHANGED
@@ -18,16 +18,15 @@ By accessing this model, you are agreeing to the Llama 2 terms and conditions of
18
 
19
  ## Usage:
20
 
21
- CodeLlama-13B-QML is a medium-sized Language Model that requires significant computing resources to perform with inference (response) times suitable for automatic code completion. Therefore, it should be used with GPU accelerator, either in cloud environment such as AWS, Google Cloud, Microsoft Azure, or local.
22
 
23
  Large Language Models, including CodeLlama-13B-QML, are not designed to be deployed in isolation but instead should be deployed as part of an overall AI system with additional safety guardrails as required. Developers are expected to deploy system safeguards when building AI systems.
24
 
25
  ## How to run CodeLlama-13B-QML in cloud deployment:
26
 
27
- The configuration depends on your chosen cloud technology.
28
 
29
- Running a CodeLlama-13b-QML in the cloud requires working with Docker and vLLM for optimal performance. Make sure all required dependencies are installed (transformers, accelerate and peft modules). Use bfloat16 precision. The setup leverages the base model from Hugging Face (requiring an access token) combined with adapter weights from the repository. Using vLLM enables efficient inference with an OpenAI-compatible API endpoint, making integration straightforward. vLLM serves as a highly optimized backend that implements request batching and queuing mechanisms, providing excellent serving optimization. The deployment container should be run on an instance with GPU accelerator.
30
- The configuration has been thoroughly tested on Ubuntu 22.04 LTS running NVIDIA driver with A100 80GB GPUs, demonstrating stable and efficient performance.
31
 
32
  ## How to run CodeLlama-13B-QML in ollama:
33
 
 
18
 
19
  ## Usage:
20
 
21
+ CodeLlama-13B-QML is a medium-sized Language Model that requires significant computing resources to perform with inference (response) times suitable for automatic code completion. Therefore, it should be used with a GPU accelerator, either in the cloud environment such as AWS, Google Cloud, Microsoft Azure, or locally.
22
 
23
  Large Language Models, including CodeLlama-13B-QML, are not designed to be deployed in isolation but instead should be deployed as part of an overall AI system with additional safety guardrails as required. Developers are expected to deploy system safeguards when building AI systems.
24
 
25
  ## How to run CodeLlama-13B-QML in cloud deployment:
26
 
27
+ The configuration depends on the chosen cloud technology.
28
 
29
+ Running a CodeLlama-13b-QML in the cloud requires working with Docker and vLLM for optimal performance. Make sure all required dependencies are installed (transformers, accelerate and peft modules). Use bfloat16 precision. The setup leverages the base model from Hugging Face (requiring an access token) combined with adapter weights from the repository. Using vLLM enables efficient inference with an OpenAI-compatible API endpoint, making integration straightforward. vLLM serves as a highly optimized backend that implements request batching and queuing mechanisms, providing excellent serving optimization. The docker container should be run on an instance with GPU accelerator. The configuration has been thoroughly tested on Ubuntu 22.04 LTS running NVIDIA driver with A100 80GB GPUs, demonstrating stable and efficient performance.
 
30
 
31
  ## How to run CodeLlama-13B-QML in ollama:
32