ubicloud/ai_models.yml at eren-kill_child_progs_of_k8sprovision

Files

Junhao Li fad8fa7cbf Enable speculative decoding for DeepSeek R1 32B

Speculative decoding accelerates inference by using a smaller draft model
to generate multiple token candidates, which are then efficiently verified
by the original model in parallel, reducing latency while maintaining quality.

Here, we use alamios/DeepSeek-R1-DRAFT-Qwen2.5-0.5B as the draft model
for DeepSeek R1 32B. This slightly reduces performance at high concurrency
by about 10% but significantly improves inference speed at low concurrency by
~100%.

2025-03-06 19:48:34 -05:00

9.1 KiB

Raw Permalink Blame History

View Raw

9.1 KiB Raw Permalink Blame History

9.1 KiB

Raw Permalink Blame History