math-embed
A 768-dimensional embedding model fine-tuned for mathematical document retrieval, with a focus on combinatorics and related areas (representation theory, symmetric functions, algebraic combinatorics). Built on SPECTER2 and trained using knowledge-graph-guided contrastive learning.
Performance
Benchmarked on mathematical paper retrieval (108 queries, 4,794 paper chunks):
| Model | MRR | NDCG@10 |
|---|---|---|
| math-embed (this model) | 0.816 | 0.736 |
| OpenAI text-embedding-3-small | 0.461 | 0.324 |
| SPECTER2 (proximity adapter) | 0.360 | 0.225 |
| SciNCL | 0.306 | 0.205 |
Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("RobBobin/math-embed")
# Embed queries and documents
queries = ["Kostka polynomials", "representation theory of symmetric groups"]
docs = ["We study the combinatorial properties of Kostka numbers..."]
query_embs = model.encode(queries)
doc_embs = model.encode(docs)
Matryoshka dimensions
Trained with Matryoshka Representation Learning β you can truncate embeddings to smaller dimensions (512, 256, 128) with graceful degradation:
# Use 256-dim embeddings for faster retrieval
embs = model.encode(texts)
embs_256 = embs[:, :256]
Training
Method
- Loss: MultipleNegativesRankingLoss + MatryoshkaLoss
- Training data: 22,609 (anchor, positive) pairs generated from a knowledge graph of mathematical concepts
- Direct pairs: concept name/description β chunks from that concept's source papers
- Edge pairs: cross-concept pairs from knowledge graph edges (e.g., "generalizes", "extends")
- Base model:
allenai/specter2_base(SciBERT pre-trained on 6M citation triplets)
Configuration
- Epochs: 3
- Batch size: 8 (effective 32 with gradient accumulation)
- Learning rate: 2e-5
- Max sequence length: 256 tokens
- Matryoshka dims: [768, 512, 256, 128]
Model lineage
BERT (Google, 110M params)
ββ SciBERT (Allen AI, retrained on scientific papers)
ββ SPECTER2 base (Allen AI, + 6M citation triplets)
ββ math-embed (this model, + KG-derived concept-chunk pairs)
Approach
The knowledge graph was constructed by an LLM (GPT-4o-mini) from 75 mathematical research papers, identifying 559 concepts and 486 relationships. This graph provides structured ground truth: each concept maps to specific papers, and those papers' chunks serve as positive training examples.
This is a form of knowledge distillation β a large language model's understanding of mathematical relationships is distilled into a small, fast embedding model suitable for retrieval.
Limitations
- Trained specifically on combinatorics papers (symmetric functions, representation theory, partition identities, algebraic combinatorics)
- May not generalize well to other areas of mathematics or other scientific domains without additional fine-tuning
- 256-token context window (standard for BERT-based models)
Citation
See the accompanying paper: Knowledge-Graph-Guided Fine-Tuning of Embedding Models for Mathematical Document Retrieval
- Downloads last month
- 44
Model tree for RobBobin/math-embed
Base model
allenai/specter2_base