Using managed resources
Requirements before starting to work with deployments.
Models API
xCloud allows you to easily fine-tune and deploy some open source models. The distinction between deployments and model APIs lies in the fact that with the latter there's no requirement for you to craft custom inference code or be concerned about optimizations and related aspects – we manage all of that on your behalf. Your unique task would be to configure the fine-tuning job, like the number of epochs and the data that you want to use, and the deployment.
The current supported models that you can deploy or use as base models to do your fine-tunings are:
xCloud-LLaMA-V2-7B
: it usesmeta-llama/Llama-2-7b-hf
from HuggingFacexCloud-LLaMA-V2-13B
: it usesmeta-llama/Llama-2-13b-hf
from HuggingFacexCloud-LLaMA-V2-7B-Chat
: it usesmeta-llama/Llama-2-7b-chat-hf
from HuggingFacexCloud-LLaMA-V2-13B-Chat
: it usesmeta-llama/Llama-2-13b-chat-hf
from HuggingFace
On this tutorial we will fine-tune the xCloud-LLaMA-V2-7B
model on the Dolly dataset, an open-source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.
1. Finetuning
1.1. Prepare your dataset
Your initial task involves converting the Dolly dataset into the xCloud format. For that you will have to accomplish the following steps. You can check the format itself in the info box that you can find below.
- Load the Dolly dataset using the HF
datasets
library and write a function to format it.
from datasets import load_dataset
dataset = load_dataset("databricks/databricks-dolly-15k")
def format_samples(sample):
sample["input"] = "### Instruction ###\n{}\n\n{}".format(sample['instruction'], sample['context'])
sample["output"] = "Response: {}".format(sample['response'])
return sample
- Get a random sample and print it to confirm your satisfaction with the format.
from random import randrange
example = format_samples(dataset['train'][randrange(len(dataset))])
print(example)
- Finally, format your dataset and save it in your local file system in the JSONL format.
# Format dataset
format_dataset = dataset.map(
format_samples,
remove_columns=dataset["train"].column_names,
)
# Save it in JSONL format
format_dataset['train'].to_json("dolly_xcloud.jsonl")
The file that you provide should be a JSONL file with the field text
in every JSON document. Below you can find an example:
{"text": "This is a example"}
{"text": "Unservised fine-tuning example"}
The file you supply must be in JSONL format, containing input
and output
fields in each JSON document. The input
field represents the input you'd feed to the model during inference, while the output
field should reflect the model's intended inference output. Below, an illustrative example is provided:
{"input":"### Instruction ###\nWhich is a species of fish? Tope or Rope\n\n","output":"Response: Tope"}
{"input":"### Instruction ###\nWhy can camels survive for long without water?\n\n","output":"Response: Camels use the fat in their humps to keep them filled with energy and hydration for long periods of time."}
1.2. Configure the fine-tuning job
- Upload the dataset and create the fine-tuning job.
from xcloud import ModelsAPIFinetuning, ModelsAPIFinetuningType, ModelsAPIModelFamily, ModelsAPIClient
finetuning_name = "my-first-finetuning"
model_name = "my-first-model"
dataset_id = ModelsAPIClient.upload_dataset(dataset_path="dolly_xcloud.jsonl")
finetuning = ModelsAPIFinetuning(
finetuning_name=finetuning_name,
model_name=model_name,
finetuning_type=ModelsAPIFinetuningType.INSTRUCTION_FINETUNING,
model_family=ModelsAPIModelFamily.LLAMA_V2_7B,
num_epochs=1,
dataset_id=dataset_id,
location=Location.AZURE_EAST_US
)
finetuning = ModelsAPIClient.create_finetuning(finetuning=finetuning)
We are currently supporting the following compute regions:
- Location.GCP_US_CENTRAL_1
- Location.AZURE_EAST_US
- Check the status of the fine-tuning job.
finetuning = ModelsAPIClient.get_finetuning_by_name(finetuning_name=finetuning_name)
print(finetuning.status)
Below you can find the properties of a fine-tuning object.
- finetuning_name: the name for your fine-tuning job.
- model_name: the name of your fine-tuned model
- finetuning_type: you can select between
ModelsAPIFinetuningType.UNSUPERVISED_FINETUNING
andModelsAPIFinetuningType.INSTRUCTION_FINETUNING
- model_family: the base model that will be used for the fine-tuning. You can select between:
ModelsAPIModelFamily.LLAMA_V2_7B
: it usesmeta-llama/Llama-2-7b-hf
from HuggingFaceModelsAPIModelFamily.LLAMA_V2_13B
: it usesmeta-llama/Llama-2-13b-hf
from HuggingFaceModelsAPIModelFamily.LLAMA_V2_7B_CHAT
: it usesmeta-llama/Llama-2-7b-chat-hf
from HuggingFaceModelsAPIModelFamily.LLAMA_V2_13B_CHAT
: it usesmeta-llama/Llama-2-13b-chat-hf
from HuggingFace
- num_epochs: the number of epochs that you want to train your model.
- dataset_id: the ID of the dataset used for the fine-tuning.
- metadata: contains information related to the creation, cancellation, and deletion of the finetuning job.
2. Models
You can list the existing base models and the fine-tuned ones. For that run the following Python code:
models = ModelsAPIClient.get_models()
print(models)
You can also get a specific model.
model = ModelsAPIClient.get_model_by_name(model_name=model_name)
print(model)
At any moment you can delete the model. Take into account that if you have any running fine-tunings or deployments linked to that model and you delete it, all of them will be also deleted.
model = ModelsAPIClient.delete_model(model_name=model_name)
A model object has the following properties:
- workspace_id: workspace in which the model was created.
- model_name: the model name
- model_type: it can be one of these 2 values
ModelsAPIModelType.FINETUNED_MODEL
andModelsAPIModelType.BASE_MODEL
. - status: if it is
Status.READY
then this model is ready to be deployed. If the model status isStatus.NOT_READY
it means that the model is still being fine-tuned or that the fine-tuning job has failed. - metadata: contains information related to the creation, cancellation, and deletion of the model.
3. Deployments
3.1. Configure the deployment
- Create the deployment.
from xcloud import ModelsAPIDeployment, DeploymentSpecs, Scaling
deployment_name = "my-first-deployment"
deployment = ModelsAPIDeployment(
deployment_name=deployment_name,
model_name=model_name,
deployment_specs=DeploymentSpecs(
authentication=True,
scaling=Scaling(min_replicas=1, max_replicas=1)
),
location=Location.AZURE_EAST_US
)
from xcloud import ModelsAPIClient
deployment = ModelsAPIClient.create_deployment(deployment)
We are currently supporting the following compute regions:
- Location.GCP_US_CENTRAL_1
- Location.AZURE_EAST_US
- Get your deployment.
from xcloud import ModelsAPIClient
deployment = ModelsAPIClient.get_deployment_by_name(deployment_name)
A deployment object has the following properties:
- workspace_id: workspace in which the deployment was created.
- deployment_name: the deployment name.
- status: status of the deployment.
- deployment_specs: deployment configurations like the autoscaling, auth and dynamic batching.
- batcher (dynamic batching): you can define the maximum batch size and waiting time (in milliseconds). The server will wait for a maximum of
max_latency
to create a batch of up tomax_batch_size
samples. - scaling: You can set the
min_replicas
that will always be running, and themax_replicas
that will scale up based on the chosen metric. Two available metrics for autoscaling deployments are:SCALE_METRIC.CONCURRENCY
: represents the number of simultaneous requests that each replica can process. With atarget_scaling_value
of 1, if there are 3 concurrent requests, the deployment will scale up to 3 replicas.SCALE_METRIC.RPS
: sets a target for requests per second per replica.
- authentication: when set to
True
, an API key will be generated for the deployment, which must be used for making inferences.
- batcher (dynamic batching): you can define the maximum batch size and waiting time (in milliseconds). The server will wait for a maximum of
- inference: information needed to do the inference
- ready_endpoint: to check if the deployment is ready to start receiving inferences.
- infer_endpoint: the endpoint where the inferences should be sent.
- api_key: API key in case the authentication is enabled.
- metadata: contains information related to the creation, cancellation, and deletion of the deployment.
- Wait until the deployment is ready
ModelsAPIClient.wait_until_deployment_is_ready(deployment_name=deployment_name)
- Do the inference.
import requests
response = requests.post(
url=deployment.inference.infer_endpoint,
headers={
"x-api-key": deployment.inference.api_key
},
json={
"instances": ["### Instruction ###\nGenerate a joke\n\nResponse:"]
}
)
response.raise_for_status()
response.json()
- You can delete the deployment if you no longer want to access it.
deleted_deployment = ModelsAPIClient.delete_deployment(deployment_name=deployment_name)