Speculative decoding accelerates inference by using a smaller draft model to generate multiple token candidates, which are then efficiently verified by the original model in parallel, reducing latency while maintaining quality. Here, we use alamios/DeepSeek-R1-DRAFT-Qwen2.5-0.5B as the draft model for DeepSeek R1 32B. This slightly reduces performance at high concurrency by about 10% but significantly improves inference speed at low concurrency by ~100%.
9.1 KiB
9.1 KiB