YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Param-1-7B-MoE

Param-1-7B-MoE is a multilingual large language model developed under the Param-1 family as part of BharatGen – A Suite of Generative AI Technologies for India. With 7 billion parameters and a Mixture of Experts (MoE) architecture, the model is designed to better understand and generate text across English, Hindi, and 14 additional Indian languages.

The model is pretrained from scratch with a strong focus on linguistic diversity, cultural context, and large-scale multilingual representation relevant to India.


Key Highlights

  • 7B parameter Mixture of Experts (MoE) language model
  • Multilingual: English, Hindi + 14 Indian languages
  • Trained on 4 trillion tokens
  • Uses 64 specialized experts, dynamically activated per token
  • Supports long-context understanding (up to 4096 tokens)
  • Designed as a pretrained (PT) base model for downstream fine-tuning

Supported Languages

In addition to English and Hindi, the model has been trained on data from the following 14 Indian languages:

  • Assamese
  • Bengali
  • Gujarati
  • Kannada
  • Maithili
  • Malayalam
  • Marathi
  • Nepali
  • Oriya
  • Punjabi
  • Sanskrit
  • Sindhi
  • Tamil
  • Telugu

This broad language coverage enables better performance in region-specific applications and improves inclusivity across India’s linguistic landscape.


Model Inference

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "bharatgenai/Param-1-7B"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=False)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    device_map="auto"
)

prompt = "Your prompt here"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=300,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        temperature=0.6,
        eos_token_id=tokenizer.eos_token_id,
        use_cache=False
    )

print(tokenizer.decode(output[0], skip_special_tokens=True))

Model Architecture

  • Architecture: Transformer (Decoder-only) with Mixture of Experts (MoE)
  • Number of parameters: ~7B
  • Active parameters: 1.04B
  • Number of experts: 64
  • Expert routing: Token-level sparse activation, Top-K = 8 experts
  • Maximum sequence length: 4096
  • Positional embeddings: Rotary (RoPE)
  • Attention mechanism: Optimized attention with modern activation techniques
  • Precision: bf16-mixed

Training Data

The model is pretrained on a large-scale multilingual corpus totaling 4 trillion tokens.

Dataset Composition

PT 1 : Pre-Training Phase 1

  • English + Hindi: ~`500B` tokens
  • Code + Math ~`500B` tokens
  • 14 Indian languages: ~`1.0T` tokens

PT-2 : Pre-Training Phase 2

Note: All figures represent net token counts. Multilingual, SFT, RL, and replay data are re-expressions of the same underlying knowledge and are not additive.

  • General Knowledge: ~`470B` tokens

  • Technical Knowledge (Research + STEM + Philosophy + India-focused research): ~`270B` tokens

  • Education, Exams, Domain Specialisation, Math, Benchmarks & Code: ~`330B` tokens

  • Conversational Knowledge (Forums, Q&A): ~`10B` tokens

  • Indic Multilingual Expansion (16 languages; synthetic + OCR + translation + personas): ~`920B` tokens

  • Alignment & Stabilization (within budget):

    • SFT-style: ~`45B` tokens
    • RL / Preference / Safety: ~`4.9B` tokens
    • PT-1 replay (low-resource stabilization): ~`95B` tokens
    • LLM Self-Identity & BharatGen knowledge: ~`0.1B` tokens

Total PT-2 Tokens: ~`2.0T`

The data mixture was curated and balanced using CLIMB (Clustering-based Iterative Data Mixture Bootstrapping), an advanced data filtering and mixing technique from NVIDIA. This ensures high-quality training signals and fair representation across languages.


Training Details

  • Training framework: NVIDIA NeMo
  • Training infrastructure: Yotta Shakti Cloud
  • Hardware: NVIDIA H100 GPUs

Limitations

  • This is a pretrained base model and may require fine-tuning for instruction-following or chat-based applications.
  • Model outputs may reflect biases present in large-scale multilingual web data.
  • Performance may vary across low-resource domains and specialized tasks.

License

This model is released under the BharatGen non-commercial license.

Please refer to the LICENSE file for detailed terms and conditions.


Downloads last month
20
Safetensors
Model size
7B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support