Detecting LLM weight corruption and semantic drift

Buk81 · December 30, 2025, 1:20am

I’ve been running experiments on detecting two things in open-weight LLMs:

Weight tampering: Controlled noise injection (1% to 100%) into model weights
Semantic drift: How models respond to content outside training distribution

The method I developed originated from building anomaly detection for edge devices with hard constraints (low RAM, real-time). I tested whether the same principles apply to LLMs.

Models tested

All models from HuggingFace:

google/gemma-7b
mistralai/Mistral-7B-v0.1
swiss-ai/Apertus-8B-2509
EleutherAI/pythia-6.9b
microsoft/phi-2
tiiuae/falcon-7b

Results: Weight Integrity

Metric: Effective Rank of weight matrices (higher = more damaged)

Model	Intact	5% Noise	20% Noise	100% Noise	Range
gemma-7b	362.2	398.3	409.3	412.2	+50.0
Mistral-7B	381.6	412.0	411.8	412.1	+30.5
phi-2	385.0	409.7	412.0	412.0	+27.0
Apertus-8B	386.6	389.3	394.5	407.7	+21.2
pythia-6.9b	398.3	411.3	411.9	411.7	+13.4
falcon-7b	402.1	411.0	411.9	411.9	+9.9

Finding: Signal increases monotonically with noise level. Detects gradual tampering where hash checks (binary) show nothing.

Results: Semantic Drift (OOD Detection)

Metric: Uniformity score of activations (higher = further from training distribution)

Test categories:

L0: Known facts (Berlin, Einstein, water)
L1: Known fiction (Hogwarts, Gandalf)
L2: Unknown tokens (gibberish)

Model	L0 (facts)	L2 (gibberish)	Range	Pattern
gemma-7b	0.258	0.825	+0.57	✓ Staircase
Apertus-8B	0.470	0.978	+0.51	✓ Staircase
Mistral-7B	0.319	0.804	+0.49	✓ Staircase
pythia-6.9b	0.486	0.609	+0.12	✓ Staircase
phi-2	0.47	0.58	+0.11	✓ Staircase
falcon-7b*	0.319	0.264	-0.06	✓ Inverted

*Falcon requires layer 8 instead of last layer, with inverted polarity

Finding: The further from training distribution, the stronger the signal. This is OOD detection, not hallucination detection (plausible false facts would not trigger this).

Cross-Model Validation

6/6 models validated (100%) across 6 organizations: Google, Mistral AI, Swiss AI, EleutherAI, Microsoft, TII.

Key insight

The method appears universal across architectures. The signal exists in all tested models.

Only two parameters vary per model:

Extraction layer (most: last layer, Falcon: layer 8)
Polarity (most: normal, Falcon: inverted)

Want to test it?

If you’re curious whether this works on a model you’re working with:

Send me a HuggingFace model ID – I’ll run the analysis and share the results.

Email: [email protected] | Twitter: @waiter_no1

Looking for

Feedback on methodology
Independent replication attempts
Collaboration on extending this

Methodology details available for serious interest.

Topic		Replies	Views
AI LLM model bias Intermediate	0	159	January 16, 2024
Monitoring ML and LLM models in production for drift, trust, and safety Show and Tell	2	48	July 21, 2025
Evidence of latent collapse geometry in frontier LLMs? Research	3	43	December 31, 2025
Making fine-tuned LLM model more stable Beginners	3	1041	December 30, 2023
Securing Large Vision-Language Models via Deterministic Orchestration Layers Awesome paper	2	19	December 30, 2025

Detecting LLM weight corruption and semantic drift

Related topics