Anthropic Performance Challenge

---
Cycles
---
Speedup

Kernel Code

Benchmarks

SLOT LIMITS

Output

📚 Tutorial

Quick Start

The Goal

Minimize the cycle count. The baseline runs in ~147,000 cycles. Top solutions achieve <1,500 cycles.

The Machine

A VLIW SIMD processor that can execute multiple operations per cycle:

Key Insight

The baseline uses ~1 slot per cycle. Pack independent operations together!

Operations

// ALU
kb.add("alu", ["+", dest, a, b])
kb.add("alu", ["^", dest, a, b])
kb.add("alu", ["%", dest, a, b])

// Load
kb.add("load", ["const", dest, value])
kb.add("load", ["load", dest, addr])
kb.add("load", ["vload", dest, addr])

// Store
kb.add("store", ["store", addr, src])

// Flow
kb.add("flow", ["select", d, cond, a, b])

Packing Operations

Instead of kb.add() one at a time, build instruction bundles:

// One cycle, multiple ops:
kb.instrs.push({
    "alu": [
        ["+", a, b, c],
        ["*", d, e, f]
    ],
    "load": [
        ["load", g, h]
    ]
});

Vectors (SIMD)

Process 8 items at once with VLEN=8:

// Vector add (8 elements)
kb.add("valu", ["+", vDest, vA, vB])

// Load 8 consecutive values
kb.add("load", ["vload", vDest, addr])

Tips