Using your own VPC
Requirements before starting to work with deployments.
Models API
xCloud allows you to easily fine-tune and deploy some open source models. The distinction between deployments and model APIs lies in the fact that with the latter there's no requirement for you to craft custom inference code or be concerned about optimizations and related aspects – we manage all of that on your behalf. Your unique task would be to configure the fine-tuning job, like the number of epochs and the data that you want to use, and the deployment.
The current supported models that you can deploy or use as base models to do your fine-tunings are:
xCloud-LLaMA-V2-7B
: it usesmeta-llama/Llama-2-7b-hf
from HuggingFacexCloud-LLaMA-V2-13B
: it usesmeta-llama/Llama-2-13b-hf
from HuggingFacexCloud-LLaMA-V2-7B-Chat
: it usesmeta-llama/Llama-2-7b-chat-hf
from HuggingFacexCloud-LLaMA-V2-13B-Chat
: it usesmeta-llama/Llama-2-13b-chat-hf
from HuggingFace
On this tutorial we will fine-tune the xCloud-LLaMA-V2-7B
model on the Dolly dataset, an open-source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.
1. Finetuning
1.1. Prepare your dataset
Your initial task involves converting the Dolly dataset into the xCloud format. For that you will have to accomplish the following steps. You can check the format itself in the info box that you can find below.
- Load the Dolly dataset using the HF
datasets
library and write a function to format it.
from datasets import load_dataset
dataset = load_dataset("databricks/databricks-dolly-15k")
def format_samples(sample):
sample["input"] = "### Instruction ###\n{}\n\n{}".format(sample['instruction'], sample['context'])
sample["output"] = "Response: {}".format(sample['response'])
return sample
- Get a random sample and print it to confirm your satisfaction with the format.
from random import randrange
example = format_samples(dataset['train'][randrange(len(dataset))])
print(example)
- Finally, format your dataset and save it in your local file system in the JSONL format.
# Format dataset
format_dataset = dataset.map(
format_samples,
remove_columns=dataset["train"].column_names,
)
# Save it in JSONL format
format_dataset['train'].to_json("dolly_xcloud.jsonl")
The file that you provide should be a JSONL file with the field text
in every JSON document. Below you can find an example:
{"text": "This is a example"}
{"text": "Unservised fine-tuning example"}
The file you supply must be in JSONL format, containing input
and output
fields in each JSON document. The input
field represents the input you'd feed to the model during inference, while the output
field should reflect the model's intended inference output. Below, an illustrative example is provided:
{"input":"### Instruction ###\nWhich is a species of fish? Tope or Rope\n\n","output":"Response: Tope"}
{"input":"### Instruction ###\nWhy can camels survive for long without water?\n\n","output":"Response: Camels use the fat in their humps to keep them filled with energy and hydration for long periods of time."}
1.2. Configure the fine-tuning job
- Upload your dataset to S3. In this guide we will be using the following S3 base path to save all the artifacts
s3://xcloud/on_premise_model_apis/
.
- Create the fine-tuning job. You will have to specify the path where you have saved your dataset (
dataset_cloud_path
), where you want to save the fine-tuning code that will be automatically uploaded to your S3 path (saving_finetuning_code_cloud_path
), where you want to save the fine-tuned model (saving_model_cloud_path
), the credentials to access S3 (credentials
) and finally your cloud link (link_name
).
from xcloud import (
OnPremiseModelsAPIFinetuning,
ModelsAPIFinetuningType,
ModelsAPIModelFamily,
OnPremiseModelsAPIClient,
Credentials,
Cloud
)
import os
finetuning_name = "my-first-finetuning"
link_name = "my-link"
finetuning = OnPremiseModelsAPIFinetuning(
finetuning_name=finetuning_name,
finetuning_type=ModelsAPIFinetuningType.INSTRUCTION_FINETUNING,
model_family=ModelsAPIModelFamily.LLAMA_V2_7B,
num_epochs=1,
dataset_cloud_path="s3://xcloud/on_premise_model_apis/dolly_xcloud.jsonl",
saving_finetuning_code_cloud_path="s3://xcloud/on_premise_model_apis/code",
saving_model_cloud_path="s3://xcloud/on_premise_model_apis/finetuned_model",
credentials=Credentials(
cloud=Cloud.AWS,
aws_access_key_id=os.environ["AWS_ACCESS_KEY_ID"],
aws_secret_access_key=os.environ["AWS_SECRET_ACCESS_KEY"],
aws_region=os.environ["AWS_DEFAULT_REGION"]
),
link_name=link_name
)
finetuning = OnPremiseModelsAPIClient.create_finetuning(finetuning=finetuning)
- Check the status of the fine-tuning job.
finetuning = OnPremiseModelsAPIClient.get_finetuning_by_name(finetuning_name=finetuning_name)
print(finetuning.status)
Below you can find the properties of a fine-tuning object.
- finetuning_name: the name for your fine-tuning job.
- finetuning_type: you can select between
ModelsAPIFinetuningType.UNSUPERVISED_FINETUNING
andModelsAPIFinetuningType.INSTRUCTION_FINETUNING
- model_family: the base model that will be used for the fine-tuning. You can select between:
ModelsAPIModelFamily.LLAMA_V2_7B
: it usesmeta-llama/Llama-2-7b-hf
from HuggingFaceModelsAPIModelFamily.LLAMA_V2_13B
: it usesmeta-llama/Llama-2-13b-hf
from HuggingFaceModelsAPIModelFamily.LLAMA_V2_7B_CHAT
: it usesmeta-llama/Llama-2-7b-chat-hf
from HuggingFaceModelsAPIModelFamily.LLAMA_V2_13B_CHAT
: it usesmeta-llama/Llama-2-13b-chat-hf
from HuggingFace
- num_epochs: the number of epochs that you want to train your model.
- dataset_cloud_path: the cloud path of the dataset used for the fine-tuning.
- saving_finetuning_code_cloud_path: S3 path where the fine-tuning code will automatically updated.
- saving_model_cloud_path: S3 path where the fine-tuned model will be saved.
- credentials: the credentials to access S3. The credentials are not stored in our databases, that's why when you fetch a fine-tuning job the credentials will be empty.
- link_name: compute cluster link in which the fine-tuning job will be executed.
- status: status of the fine-tuning job.
- metadata: contains information related to the creation, cancellation, and deletion of the finetuning job.
2. Deployments
2.1. Configure the deployment
- Create the deployment.
from xcloud import (
OnPremiseModelsAPIDeployment,
DeploymentSpecs,
Scaling,
ModelsAPIModelFamily,
Credentials,
Cloud
)
deployment_name = "my-first-deployment"
deployment = OnPremiseModelsAPIDeployment(
deployment_name=deployment_name,
deployment_specs=DeploymentSpecs(
authentication=True,
scaling=Scaling(min_replicas=1, max_replicas=1)
),
model_family=ModelsAPIModelFamily.LLAMA_V2_7B,
model_cloud_path="s3://xcloud/on_premise_model_apis/finetuned_model",
credentials=Credentials(
cloud=Cloud.AWS,
aws_access_key_id=os.environ["AWS_ACCESS_KEY_ID"],
aws_secret_access_key=os.environ["AWS_SECRET_ACCESS_KEY"],
aws_region=os.environ["AWS_DEFAULT_REGION"]
),
link_name="test-link"
)
from xcloud import OnPremiseModelsAPIClient
deployment = OnPremiseModelsAPIClient.create_deployment(deployment)
- Get your deployment.
from xcloud import OnPremiseModelsAPIClient
deployment = OnPremiseModelsAPIClient.get_deployment_by_name(deployment_name)
A deployment object has the following properties:
- workspace_id: workspace in which the deployment was created.
- deployment_name: the deployment name.
- status: status of the deployment.
- model_family: the model family (type) that will be deployed. It is needed to apply specific optimizations to that model.
- model_cloud_path: S3 path containing the model weights. You can use the weights of a fine-tuned model. If this property is None, the default weights (depending on the model family) will be used.
- credentials: the credentials to access S3. The credentials are not stored in our databases, that's why when you fetch a deployment job the credentials will be empty.
- link_name: compute cluster link where the deployment will be running.
- deployment_specs: deployment configurations like the autoscaling, auth and dynamic batching.
- batcher (dynamic batching): you can define the maximum batch size and waiting time (in milliseconds). The server will wait for a maximum of
max_latency
to create a batch of up tomax_batch_size
samples. - scaling: You can set the
min_replicas
that will always be running, and themax_replicas
that will scale up based on the chosen metric. Two available metrics for autoscaling deployments are:SCALE_METRIC.CONCURRENCY
: represents the number of simultaneous requests that each replica can process. With atarget_scaling_value
of 1, if there are 3 concurrent requests, the deployment will scale up to 3 replicas.SCALE_METRIC.RPS
: sets a target for requests per second per replica.
- authentication: when set to
True
, an API key will be generated for the deployment, which must be used for making inferences.
- batcher (dynamic batching): you can define the maximum batch size and waiting time (in milliseconds). The server will wait for a maximum of
- inference: information needed to do the inference
- ready_endpoint: to check if the deployment is ready to start receiving inferences.
- infer_endpoint: the endpoint where the inferences should be sent.
- api_key: API key in case the authentication is enabled.
- metadata: contains information related to the creation, cancellation, and deletion of the deployment.
- Wait until the deployment is ready
OnPremiseModelsAPIClient.wait_until_deployment_is_ready(deployment_name=deployment_name)
- Do the inference.
import requests
response = requests.post(
url=deployment.inference.infer_endpoint,
headers={
"x-api-key": deployment.inference.api_key
},
json={
"instances": ["### Instruction ###\nGenerate a joke\n\nResponse:"]
}
)
response.raise_for_status()
response.json()
- You can delete the deployment if you no longer want to access it.
deleted_deployment = OnPremiseModelsAPIClient.delete_deployment(deployment_name=deployment_name)