Files
ubicloud/config/ai_models.yml
Junhao Li fad8fa7cbf Enable speculative decoding for DeepSeek R1 32B
Speculative decoding accelerates inference by using a smaller draft model
to generate multiple token candidates, which are then efficiently verified
by the original model in parallel, reducing latency while maintaining quality.

Here, we use alamios/DeepSeek-R1-DRAFT-Qwen2.5-0.5B as the draft model
for DeepSeek R1 32B. This slightly reduces performance at high concurrency
by about 10% but significantly improves inference speed at low concurrency by
~100%.
2025-03-06 19:48:34 -05:00

9.1 KiB