Skip to main content

Using managed resources

Models API

xCloud allows you to easily fine-tune and deploy some open source models. The distinction between deployments and model APIs lies in the fact that with the latter there's no requirement for you to craft custom inference code or be concerned about optimizations and related aspects – we manage all of that on your behalf. Your unique task would be to configure the fine-tuning job, like the number of epochs and the data that you want to use, and the deployment.

The current supported models that you can deploy or use as base models to do your fine-tunings are:

  • xCloud-LLaMA-V2-7B: it uses meta-llama/Llama-2-7b-hf from HuggingFace
  • xCloud-LLaMA-V2-13B: it uses meta-llama/Llama-2-13b-hf from HuggingFace
  • xCloud-LLaMA-V2-7B-Chat: it uses meta-llama/Llama-2-7b-chat-hf from HuggingFace
  • xCloud-LLaMA-V2-13B-Chat: it uses meta-llama/Llama-2-13b-chat-hf from HuggingFace

On this tutorial we will fine-tune the xCloud-LLaMA-V2-7B model on the Dolly dataset, an open-source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.

1. Finetuning

1.1. Prepare your dataset

Your initial task involves converting the Dolly dataset into the xCloud format. For that you will have to accomplish the following steps. You can check the format itself in the info box that you can find below.

  1. Load the Dolly dataset using the HF datasets library and write a function to format it.
from datasets import load_dataset

dataset = load_dataset("databricks/databricks-dolly-15k")

def format_samples(sample):
sample["input"] = "### Instruction ###\n{}\n\n{}".format(sample['instruction'], sample['context'])
sample["output"] = "Response: {}".format(sample['response'])

return sample
  1. Get a random sample and print it to confirm your satisfaction with the format.
from random import randrange

example = format_samples(dataset['train'][randrange(len(dataset))])
print(example)
  1. Finally, format your dataset and save it in your local file system in the JSONL format.
# Format dataset
format_dataset = dataset.map(
format_samples,
remove_columns=dataset["train"].column_names,
)

# Save it in JSONL format
format_dataset['train'].to_json("dolly_xcloud.jsonl")
info
xCloud dataset format for unsupervised fine-tuning

The file that you provide should be a JSONL file with the field text in every JSON document. Below you can find an example:

{"text": "This is a example"} 
{"text": "Unservised fine-tuning example"}
xCloud dataset format for instruction fine-tuning

The file you supply must be in JSONL format, containing input and output fields in each JSON document. The input field represents the input you'd feed to the model during inference, while the output field should reflect the model's intended inference output. Below, an illustrative example is provided:

{"input":"### Instruction ###\nWhich is a species of fish? Tope or Rope\n\n","output":"Response: Tope"}
{"input":"### Instruction ###\nWhy can camels survive for long without water?\n\n","output":"Response: Camels use the fat in their humps to keep them filled with energy and hydration for long periods of time."}

1.2. Configure the fine-tuning job

  1. Upload the dataset and create the fine-tuning job.

from xcloud import ModelsAPIFinetuning, ModelsAPIFinetuningType, ModelsAPIModelFamily, ModelsAPIClient

finetuning_name = "my-first-finetuning"
model_name = "my-first-model"
dataset_id = ModelsAPIClient.upload_dataset(dataset_path="dolly_xcloud.jsonl")

finetuning = ModelsAPIFinetuning(
finetuning_name=finetuning_name,
model_name=model_name,
finetuning_type=ModelsAPIFinetuningType.INSTRUCTION_FINETUNING,
model_family=ModelsAPIModelFamily.LLAMA_V2_7B,
num_epochs=1,
dataset_id=dataset_id,
location=Location.AZURE_EAST_US
)

finetuning = ModelsAPIClient.create_finetuning(finetuning=finetuning)
info

We are currently supporting the following compute regions:

  • Location.GCP_US_CENTRAL_1
  • Location.AZURE_EAST_US
We highly advise opting for the Location.AZURE_EAST_US region when using A100 (80GB) GPUs due to ongoing high demand issues in GCP.
  1. Check the status of the fine-tuning job.
finetuning = ModelsAPIClient.get_finetuning_by_name(finetuning_name=finetuning_name)
print(finetuning.status)
info

Below you can find the properties of a fine-tuning object.

  • finetuning_name: the name for your fine-tuning job.
  • model_name: the name of your fine-tuned model
  • finetuning_type: you can select between ModelsAPIFinetuningType.UNSUPERVISED_FINETUNING and ModelsAPIFinetuningType.INSTRUCTION_FINETUNING
  • model_family: the base model that will be used for the fine-tuning. You can select between:
    • ModelsAPIModelFamily.LLAMA_V2_7B: it uses meta-llama/Llama-2-7b-hf from HuggingFace
    • ModelsAPIModelFamily.LLAMA_V2_13B: it uses meta-llama/Llama-2-13b-hf from HuggingFace
    • ModelsAPIModelFamily.LLAMA_V2_7B_CHAT: it uses meta-llama/Llama-2-7b-chat-hf from HuggingFace
    • ModelsAPIModelFamily.LLAMA_V2_13B_CHAT: it uses meta-llama/Llama-2-13b-chat-hf from HuggingFace
  • num_epochs: the number of epochs that you want to train your model.
  • dataset_id: the ID of the dataset used for the fine-tuning.
  • metadata: contains information related to the creation, cancellation, and deletion of the finetuning job.

2. Models

You can list the existing base models and the fine-tuned ones. For that run the following Python code:

models = ModelsAPIClient.get_models()
print(models)

You can also get a specific model.

model = ModelsAPIClient.get_model_by_name(model_name=model_name)
print(model)

At any moment you can delete the model. Take into account that if you have any running fine-tunings or deployments linked to that model and you delete it, all of them will be also deleted.

model = ModelsAPIClient.delete_model(model_name=model_name)
info

A model object has the following properties:

  • workspace_id: workspace in which the model was created.
  • model_name: the model name
  • model_type: it can be one of these 2 values ModelsAPIModelType.FINETUNED_MODEL and ModelsAPIModelType.BASE_MODEL.
  • status: if it is Status.READY then this model is ready to be deployed. If the model status is Status.NOT_READY it means that the model is still being fine-tuned or that the fine-tuning job has failed.
  • metadata: contains information related to the creation, cancellation, and deletion of the model.

3. Deployments

3.1. Configure the deployment

  1. Create the deployment.
from xcloud import ModelsAPIDeployment, DeploymentSpecs, Scaling

deployment_name = "my-first-deployment"

deployment = ModelsAPIDeployment(
deployment_name=deployment_name,
model_name=model_name,
deployment_specs=DeploymentSpecs(
authentication=True,
scaling=Scaling(min_replicas=1, max_replicas=1)
),
location=Location.AZURE_EAST_US
)

from xcloud import ModelsAPIClient

deployment = ModelsAPIClient.create_deployment(deployment)
info

We are currently supporting the following compute regions:

  • Location.GCP_US_CENTRAL_1
  • Location.AZURE_EAST_US
We highly advise opting for the Location.AZURE_EAST_US region when using A100 (80GB) GPUs due to ongoing high demand issues in GCP.
  1. Get your deployment.
from xcloud import ModelsAPIClient

deployment = ModelsAPIClient.get_deployment_by_name(deployment_name)
info

A deployment object has the following properties:

  • workspace_id: workspace in which the deployment was created.
  • deployment_name: the deployment name.
  • status: status of the deployment.
  • deployment_specs: deployment configurations like the autoscaling, auth and dynamic batching.
    • batcher (dynamic batching): you can define the maximum batch size and waiting time (in milliseconds). The server will wait for a maximum of max_latency to create a batch of up to max_batch_size samples.
    • scaling: You can set the min_replicas that will always be running, and the max_replicas that will scale up based on the chosen metric. Two available metrics for autoscaling deployments are:
      • SCALE_METRIC.CONCURRENCY: represents the number of simultaneous requests that each replica can process. With a target_scaling_value of 1, if there are 3 concurrent requests, the deployment will scale up to 3 replicas.
      • SCALE_METRIC.RPS: sets a target for requests per second per replica.
    • authentication: when set to True, an API key will be generated for the deployment, which must be used for making inferences.
  • inference: information needed to do the inference
    • ready_endpoint: to check if the deployment is ready to start receiving inferences.
    • infer_endpoint: the endpoint where the inferences should be sent.
    • api_key: API key in case the authentication is enabled.
  • metadata: contains information related to the creation, cancellation, and deletion of the deployment.
  1. Wait until the deployment is ready
ModelsAPIClient.wait_until_deployment_is_ready(deployment_name=deployment_name)
  1. Do the inference.
import requests

response = requests.post(
url=deployment.inference.infer_endpoint,
headers={
"x-api-key": deployment.inference.api_key
},
json={
"instances": ["### Instruction ###\nGenerate a joke\n\nResponse:"]
}
)

response.raise_for_status()

response.json()
  1. You can delete the deployment if you no longer want to access it.
deleted_deployment = ModelsAPIClient.delete_deployment(deployment_name=deployment_name)