I’ve been running experiments on detecting two things in open-weight LLMs:
- Weight tampering: Controlled noise injection (1% to 100%) into model weights
- Semantic drift: How models respond to content outside training distribution
The method I developed originated from building anomaly detection for edge devices with hard constraints (low RAM, real-time). I tested whether the same principles apply to LLMs.
Models tested
All models from HuggingFace:
- google/gemma-7b
- mistralai/Mistral-7B-v0.1
- swiss-ai/Apertus-8B-2509
- EleutherAI/pythia-6.9b
- microsoft/phi-2
- tiiuae/falcon-7b
Results: Weight Integrity
Metric: Effective Rank of weight matrices (higher = more damaged)
| Model | Intact | 5% Noise | 20% Noise | 100% Noise | Range |
|---|---|---|---|---|---|
| gemma-7b | 362.2 | 398.3 | 409.3 | 412.2 | +50.0 |
| Mistral-7B | 381.6 | 412.0 | 411.8 | 412.1 | +30.5 |
| phi-2 | 385.0 | 409.7 | 412.0 | 412.0 | +27.0 |
| Apertus-8B | 386.6 | 389.3 | 394.5 | 407.7 | +21.2 |
| pythia-6.9b | 398.3 | 411.3 | 411.9 | 411.7 | +13.4 |
| falcon-7b | 402.1 | 411.0 | 411.9 | 411.9 | +9.9 |
Finding: Signal increases monotonically with noise level. Detects gradual tampering where hash checks (binary) show nothing.
Results: Semantic Drift (OOD Detection)
Metric: Uniformity score of activations (higher = further from training distribution)
Test categories:
- L0: Known facts (Berlin, Einstein, water)
- L1: Known fiction (Hogwarts, Gandalf)
- L2: Unknown tokens (gibberish)
| Model | L0 (facts) | L2 (gibberish) | Range | Pattern |
|---|---|---|---|---|
| gemma-7b | 0.258 | 0.825 | +0.57 | ✓ Staircase |
| Apertus-8B | 0.470 | 0.978 | +0.51 | ✓ Staircase |
| Mistral-7B | 0.319 | 0.804 | +0.49 | ✓ Staircase |
| pythia-6.9b | 0.486 | 0.609 | +0.12 | ✓ Staircase |
| phi-2 | 0.47 | 0.58 | +0.11 | ✓ Staircase |
| falcon-7b* | 0.319 | 0.264 | -0.06 | ✓ Inverted |
*Falcon requires layer 8 instead of last layer, with inverted polarity
Finding: The further from training distribution, the stronger the signal. This is OOD detection, not hallucination detection (plausible false facts would not trigger this).
Cross-Model Validation
6/6 models validated (100%) across 6 organizations: Google, Mistral AI, Swiss AI, EleutherAI, Microsoft, TII.
Key insight
The method appears universal across architectures. The signal exists in all tested models.
Only two parameters vary per model:
- Extraction layer (most: last layer, Falcon: layer 8)
- Polarity (most: normal, Falcon: inverted)
Want to test it?
If you’re curious whether this works on a model you’re working with:
Send me a HuggingFace model ID – I’ll run the analysis and share the results.
Email: [email protected] | Twitter: @waiter_no1
Looking for
- Feedback on methodology
- Independent replication attempts
- Collaboration on extending this
Methodology details available for serious interest.
