From QK-clip to Muon clip
Clipping attention logits for MLA with shared rotary keys.
Here are some of the things I've worked on in the past—projects, short papers, write-ups, etc.
Clipping attention logits for MLA with shared rotary keys.
Approximating the serial pre-norm block.
On singular token spaces.
Small but faithful subsets of large point sets.
Play chess alone or with a friend.
Communication deficiencies in optimization.