Project
Systems of Ashish
PLE LayerLookup
Gemma-style Per-Layer Embeddings trained from scratch. Every transformer layer gets its own tiny token table, streamed per-token from a memory-mapped file. Beats dense at matched inference VRAM — 149.88 PPL vs 160.39 PPL.
pytorchpythontransformersmmaparchitecture