Lab notebook: Edit completion #1

I continue to be interested in late-2024-era edit completion, the “Fill in the Middle” (FIM) models. You know, what Copilot used to do, back before it started generating “mini diffs.” Why?

I like and use agentic workflows. But as many people are realizing, it’s super easy to lose track of what’s happening in your code, with terrible consequences. So I want to reinvest in human-in-the-loop tools, too. (And slightly weaker agentic models, but more on that later.)
The new-school edit completion offered by Copilot and Zed’s Zeta2 actually slows me down. It overlays diffs on my buffer, which is visually disorienting at speed. And it proposes edits further from the current cursor, which take me longer to mentally process. Personally, the new style feels like hunt-and-peck. The older style felt like really fast touch typing.

Mind you, I’m a very specific sort of user. I want to know how my code works. I want my code to be clean. And I can read a half-page code completion in moments, thanks to way too many years of reading PRs.

Initial experiments

All experiments performed in Zed, which does less post-processing of the raw model output than some tools. All evaluations are purely subjective.

New-school models (generating diffs). Zeta2 is honestly pretty underwhelming right now. The completions are very generic. And Zeta2 seems to be bad about taking the context into account. It will complete a function, sure. But I’d swap Zeta2 for late 2024 Copilot in a heartbeat.

Old-school models (FIM, inserting at cursor). Let’s go down the list so far:

ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF:Q8_0: The classic, default choice. This isn’t terrible, and it gives more context-aware completions than Zeta2. But it’s generations old, and I want to know if anything is new and shiny.
mradermacher/Seed-Coder-8B-Base-i1-GGUF:Q6_K. This is the raw base that went into Zeta2, I think? It doesn’t seem to be useful in Zed, because the inserted text feels pretty raw. This might work better in a smarter harness. But I’m dropping it for now.
JetBrains/Mellum-4b-base-gguf:Q8_0. Downloaded, but not yet tested.
unsloth/Qwen3.6-35B-A3B-GGUF:IQ4_XS. This is unexpectedly good! Worth further experimentation.

Refining Qwen3.6 35B A3B: Changing order from PSM to SPM

Qwen typically uses FIM, “Fill in the Middle” completion. This uses 3 magic tokens:

/// Qwen FIM prefix marker.
const PRE: &str = "<|fim_prefix|>";
/// Qwen FIM suffix marker.
const SUF: &str = "<|fim_suffix|>";
/// Qwen FIM middle marker (model generates after this).
const MID: &str = "<|fim_middle|>";

We have two possible flavors. The original is “PSM” compeletion, “prefix, suffix, middle”:

{PRE}{prefix}{SUF}{suffix}{MID}

But since the prefix grows with each keystroke, we can’t cache the entire message. We could get much better caching with “SPM” order:

{SUF}{suffix}{PRE}{prefix}{MID}

Here, we can cache everything up to the final {MID} character, and resume generation with a longer prefix. Whooo, speed!

But Zed doesn’t support SPM completion, only PSM. So I fired up a copy of Claude Code (as one does), and asked, “Hey, write me a Rust proxy server (using my standard conventions) that intercepts /completion, and translates PSM to SPM please.”

Results: Extremely disappointing. SPM format confuses Qwen3.6 35B A3B pretty badly. But then I thought, “Hey, even if we’re running in /completion mode, this is still an instruction-tuned model. Can we prompt it?” One unscientific tweak later:

You are a code-completion tool. You receive input in
fim_suffix+fim_prefix+fim_middle order, and your job
is to generate what the user would be likely to type
next. When in doubt, keep it short. Think of this like
generating a diff in agentic coding mode. You're trying
to insert the right text to make a working program that
does what the user wants. If there's no obvious next
step, generate nothing.

{SUF}{suffix}{PRE}{prefix}{MID}

This is still pretty bad, but it’s better. You can tell it’s trying to be an SPM autocompleter, though it’s still the worst of the bunch.

Possible next steps:

What if we modify the proxy to transform /completion into a /chat/completions request, with a real prompt, real text inputs, and tool for insert_at_cursor(text)? Can we access more of the model’s intelligence?
Qwen3.6 35B A3B is small enough to fine-tune! We could look up file completion data sets, and try to create a LoRA adapter. We could even use something like tree-sitter to generate custom completion examples. Would that give us something useful?

I also notice that FIM-style models are notoriously bad at choosing a good stopping place. This can be fixed with a lot of regexes. But what if our fine-tuning data took care to demonstrate good stopping places?

Initial experiments

Refining Qwen3.6 35B A3B: Changing order from PSM to SPM

More posts