Gabriele Sarti's picture

Gabriele Sarti

gsarti

·

https://gsarti.com

AI & ML interests

Interpretability for generative language models

Recent Activity

upvoted a collection 15 days ago

Activation Oracles

updated a collection 16 days ago

🔍 Interpretability & Analysis of LMs

upvoted a paper 16 days ago

GIM: Improved Interpretability for Large Language Models

View all activity

Organizations

upvoted a collection 15 days ago

Activation Oracles

12 items • Updated 8 days ago • 7

upvoted a paper 16 days ago

GIM: Improved Interpretability for Large Language Models

Paper • 2505.17630 • Published May 23, 2025 • 1

upvoted a collection 20 days ago

Sparse Auto-Encoders (SAEs) for Mechanistic Interpretability

A compilation of sparse auto-encoders trained on large language models. • 37 items • Updated 18 days ago • 18

upvoted a paper about 1 month ago

Accumulating Context Changes the Beliefs of Language Models

Paper • 2511.01805 • Published Nov 3, 2025 • 2

upvoted an article about 2 months ago

Article

SYNTH: the new data frontier

Nov 10, 2025

•

7

upvoted a collection 2 months ago

🧩 Word games

A collection of resources for word games in various languages • 16 items • Updated Sep 24, 2025 • 2

upvoted 5 papers 3 months ago

Latent Reasoning in LLMs as a Vocabulary-Space Superposition

Paper • 2510.15522 • Published Oct 17, 2025 • 3

Language Models are Injective and Hence Invertible

Paper • 2510.15511 • Published Oct 17, 2025 • 69

Eliciting Secret Knowledge from Language Models

Paper • 2510.01070 • Published Oct 1, 2025 • 5

Interpreting Language Models Through Concept Descriptions: A Survey

Paper • 2510.01048 • Published Oct 1, 2025 • 2

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

Paper • 2507.08802 • Published Jul 11, 2025 • 1

upvoted an article 3 months ago

Article

There is no such thing as a tokenizer-free lunch

Sep 25, 2025

•

94

upvoted a collection 4 months ago

Hallucination Probes

https://arxiv.org/abs/2509.03531 • 5 items • Updated Oct 15, 2025 • 2

upvoted a paper 4 months ago

RelP: Faithful and Efficient Circuit Discovery via Relevance Patching

Paper • 2508.21258 • Published Aug 28, 2025 • 3

upvoted an article 4 months ago

Article

Exploring Environments Hub: Your Language Model needs better (open) environments to learn

Sep 4, 2025

•

29

upvoted a collection 4 months ago

Apertus LLM

Democratizing Open and Compliant LLMs for Global Language Environments: 8B and 70B open-data open-weights models, multilingual in >1000 languages • 4 items • Updated Oct 1, 2025 • 318

upvoted 4 papers 5 months ago

CRISP: Persistent Concept Unlearning via Sparse Autoencoders

Paper • 2508.13650 • Published Aug 19, 2025 • 16

Rethinking Crowd-Sourced Evaluation of Neuron Explanations

Paper • 2506.07985 • Published Jun 9, 2025 • 1

Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

Paper • 2505.11770 • Published May 17, 2025 • 2

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Paper • 2507.21509 • Published Jul 29, 2025 • 32