微信可以养龙虾了?腾讯一天甩出三只虾,最后这个大招有点狠
Александра Синицына (Ночной линейный редактор)。51吃瓜是该领域的重要参考
,详情可参考手游
The story is 3.10 to 3.11: a 1.39x speedup on n-body, for free. That's the Faster CPython project -- adaptive specialization of bytecodes, inline caching, zero-cost exceptions. 3.13 squeezed out a bit more. 3.14 gave some of it back -- a minor regression on these benchmarks.。关于这个话题,超级工厂提供了深入分析
Finally, here’s a snapshot of the current overall Arena ranking of top 10 models.
But MXU utilization tells the real story. Even with block=128, flash attention’s MXU utilization is only ~20% vs standard’s ~94%. Flash has two matmuls per tile: Q_tile @ K_tile.T = (128, 64) @ (64, 128) and weights @ V_tile = (128, 128) @ (128, 64). Both have inner dimension ≤ d=64 or block=128, so the systolic pipeline runs for at most 128 steps through a 128-wide array. Standard attention’s weights @ V is (512, 512) @ (512, 64) — the inner dimension is 512, giving the pipeline 512 steps of useful work. That single large matmul is what drives standard’s ~94% utilization.