GIM: Improved Interpretability for Large Language Models Paper • 2505.17630 • Published May 23, 2025 • 1
Sparse Auto-Encoders (SAEs) for Mechanistic Interpretability Collection A compilation of sparse auto-encoders trained on large language models. • 37 items • Updated 18 days ago • 18
Accumulating Context Changes the Beliefs of Language Models Paper • 2511.01805 • Published Nov 3, 2025 • 2
🧩 Word games Collection A collection of resources for word games in various languages • 16 items • Updated Sep 24, 2025 • 2
Latent Reasoning in LLMs as a Vocabulary-Space Superposition Paper • 2510.15522 • Published Oct 17, 2025 • 3
Interpreting Language Models Through Concept Descriptions: A Survey Paper • 2510.01048 • Published Oct 1, 2025 • 2
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability? Paper • 2507.08802 • Published Jul 11, 2025 • 1
Hallucination Probes Collection https://arxiv.org/abs/2509.03531 • 5 items • Updated Oct 15, 2025 • 2
RelP: Faithful and Efficient Circuit Discovery via Relevance Patching Paper • 2508.21258 • Published Aug 28, 2025 • 3
view article Article Exploring Environments Hub: Your Language Model needs better (open) environments to learn Sep 4, 2025 • 29
Apertus LLM Collection Democratizing Open and Compliant LLMs for Global Language Environments: 8B and 70B open-data open-weights models, multilingual in >1000 languages • 4 items • Updated Oct 1, 2025 • 318
CRISP: Persistent Concept Unlearning via Sparse Autoencoders Paper • 2508.13650 • Published Aug 19, 2025 • 16
Rethinking Crowd-Sourced Evaluation of Neuron Explanations Paper • 2506.07985 • Published Jun 9, 2025 • 1
Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors Paper • 2505.11770 • Published May 17, 2025 • 2
Persona Vectors: Monitoring and Controlling Character Traits in Language Models Paper • 2507.21509 • Published Jul 29, 2025 • 32