Conversation
| @@ -0,0 +1,75 @@ | |||
| {{- if .Values.backendRuntime.enabled -}} | |||
There was a problem hiding this comment.
To be honest: I think this is a bit redundant, because vllm cpu image is maintained separately (see: https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo), and at this stage we seem to have to use another backend runtime to maintain it. 🤔
root@VM-0-13-ubuntu:/home/ubuntu# kubectl get pods
NAME READY STATUS RESTARTS AGE
qwen3-0--6b-0 1/1 Running 0 24m
root@VM-0-13-ubuntu:/home/ubuntu# kubectl get pods -oyaml | grep image
image: public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.8.5
imagePullPolicy: IfNotPresent
image: inftyai/model-loader:v0.0.10
imagePullPolicy: IfNotPresent
image: public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.8.5
imageID: public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo@sha256:36444ac581f98dc4336a7044e4c9858d5b3997a7bda1e952c03af8b6917c8311
image: docker.io/inftyai/model-loader:v0.0.10
imageID: docker.io/inftyai/model-loader@sha256:b67a8bb3acbc496a62801b2110056b9774e52ddc029b379c7370113c7879c7d9
|
root@VM-0-13-ubuntu:/home/ubuntu# kubectl port-forward svc/qwen3-0--6b-lb 8080:8080
Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080
|
root@VM-0-13-ubuntu:/home/ubuntu# curl -X POST http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "qwen3-0--6b",
"messages": [
{
"role": "user",
"content": "Who are you?"
}
]
}' | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1436 100 1273 100 163 111 14 0:00:11 0:00:11 --:--:-- 393
{
"id": "chatcmpl-7f5a02b2bd964760832cdf7f5e0a104d",
"object": "chat.completion",
"created": 1751164548,
"model": "qwen3-0--6b",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"reasoning_content": null,
"content": "<think>\nOkay, the user is asking, \"Who are you?\" So, I need to respond appropriately. Let me start by recalling the previous interactions. The user mentioned they are asking about my identity, so I should confirm my name and provide a brief description.\n\nI should make sure to be friendly and offer further assistance. Maybe mention that I am a language model, but also highlight that I can help with various tasks. It's important to keep the response straightforward and conversational.\n\nWait, should I use a specific name? The user might be expecting a name, so I should include that. Also, avoid any technical jargon. Keep the tone natural and helpful. Let me put that together.\n</think>\n\nI am a language model developed by OpenAI. I can help with a wide range of tasks, from answering questions to providing information. How can I assist you today?",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 12,
"total_tokens": 190,
"completion_tokens": 178,
"prompt_tokens_details": null
},
"prompt_logprobs": null
}
root@VM-0-13-ubuntu:/home/ubuntu#
|
|
Seeing that other projects have vllm cpu type example, I think we also can integrate it into our backendruntime. |
|
/kind feature |
|
The image seems only support the x86 architecture. https://docs.vllm.ai/en/latest/getting_started/installation/cpu.html?h=cpu#pre-built-images |
Yes, according to the docs, currently vllm cpu only supports specific architectures 🤔 |
| app.kubernetes.io/name: backendruntime | ||
| app.kubernetes.io/part-of: llmaz | ||
| app.kubernetes.io/created-by: llmaz | ||
| name: vllmcpu |
There was a problem hiding this comment.
I think we can move it to the examples instead of the part of the default template.
There was a problem hiding this comment.
This seems more reasonable, and I will change. In examples, we provide users another way to support vllm(in cpu).
78af4c3 to
a482905
Compare
|
cc @kerthcet |
|
/assign @kerthcet |
|
friendly ping @kerthcet |
|
@googs1025 would you mind to rebase the code? |
Signed-off-by: googs1025 <googs1025@gmail.com>
98083d0 to
5754ae3
Compare
done 😄 |
What this PR does / why we need it
Which issue(s) this PR fixes
Fixes #
Special notes for your reviewer
Does this PR introduce a user-facing change?