Create a Managed Inference Job (vLLM)
Create a vLLM Managed Inference Job with the CLI, then confirm it.
Create a Managed Inference Job from the CLI. Answer the prompts or pass flags, and CosmicAC deploys the language model behind an OpenAI-compatible chat endpoint.
Prerequisites
You need the following before you start:
- A running CosmicAC deployment. See Installation.
- The CosmicAC CLI installed and configured. See Install the CLI.
Steps
Create the job
CosmicAC recommends serving parameters, environment variables, and hardware for each supported model. If you do not follow these recommendations, the model can fail to deploy. See Recommended model parameters.
Create the job interactively by answering prompts, or non-interactively by passing flags.
Start the interactive job setup:
cosmicac jobs createSelect Managed Inference (vLLM) as the job type.
Set these fields:
- Job name — a name to identify the job.
- Tags — comma-separated labels for the job.
- Location — the region where the job runs.
- GPU type — the GPU to use. The CLI lists the GPUs available in your location.
- GPU count — the number of GPUs.
- Model — the model to serve. Select a listed model, or enter a Hugging Face model ID to bring your own.
- Data type — the numeric precision the model runs at.
- Quantisation — how to compress the model weights.
- Tensor parallel — how many GPUs to split the model across.
- GPU memory utilization — the fraction of GPU memory to use.
- Max model length — the maximum context length.
- Max concurrent sequences — the maximum requests handled at once.
- Reasoning parser — the parser for reasoning output.
- Video & image input — whether the model accepts multimodal input.
- Endpoint name — a name for the endpoint, used in its URL path.
- Replicas — how many copies of the model to run.
- Require Authorization header — whether callers must send an API key. See Create an API key.
- Root disk size — the VM root disk size in GB. Choose 250, 500, or 1000.
- Environment variables — optional variables passed to the serving container.
You can serve any model that vLLM supports. Browse the Hugging Face model hub or the vLLM supported models list to find one.
To serve a speech-to-text model instead, see Create a Managed Inference Job (Parakeet).
The Job configuration reference describes each field and its CLI flag.
Confirm the deployment
List your jobs to confirm CosmicAC created the job:
cosmicac jobs listThe job appears in the table with its ID, name, tags, and status. Wait for it to provision. The endpoint is ready to serve once its status is running.