📗 SPECTER2–FAPESP Cluster (Multiclass Classification on FAPESP Grande Area do Conhecimento (Level 1))

This model is a fine-tuned version of allenai/specter2_base on the None dataset. It achieves the following results on the evaluation set:

  • Loss: 1.2011
  • Accuracy: 0.8212
  • Precision Micro: 0.8212
  • Precision Macro: 0.8176
  • Recall Micro: 0.8212
  • Recall Macro: 0.8198
  • F1 Micro: 0.8212
  • F1 Macro: 0.8178

Model description

This model is a fine-tuned version of SPECTER2 (allenai/specter2_base) adapted for multiclass classification across the 8 Grande Áreas do Conhecimento of FAPESP.

The model accepts the title, abstract, or title + abstract of a research projects and assigns it to exactly one of the Areas (e.g., Linguistics, Literature and Arts; Health Sciences; Biological Sciences).

Key characteristics:

  • Base model: allenai/specter2_base
  • Task: multiclass document classification
  • Labels: 8 Cluster Areas
  • Activation: softmax
  • Loss: CrossEntropyLoss
  • Output: single best-matching FAPESP's Cluster Area category

FAPESP's Clusters represents broad disciplinary domains designed for high-level categorization of R&I documents.

Intended uses & limitations

This multiclass model is suitable for:

  • Assigning publications to top-level scientific disciplines
  • Enriching metadata in:
    • repositories
    • research output systems
    • funding and project datasets
    • bibliometric dashboards
  • Supporting scientometric analyses such as:
    • broad-discipline portfolio mapping
    • domain-level clustering
    • modeling research diversification
  • Classifying documents when only title/abstract is available

The model supports inputs such as:

  • title only
  • abstract only
  • title + abstract (recommended)

Limitations

  • Documents spanning multiple fields must be forced into one label—an inherent limitation of multiclass classification.
  • The training labels come from FAPESP funded projects, not manual expert annotation.
  • Not suitable for:
    • downstream tasks requiring multilabel outputs
    • WoS Categories or ASJC Areas (use separate models)
    • clinical or regulatory decision-making

Predictions should be treated as field-level disciplinary metadata.

Training and evaluation data

The training and evaluation dataset was constructed from publicly available FAPESP (Fundação de Amparo à Pesquisa do Estado de São Paulo) research project records. These records cover funded research projects and scholarships across all scientific domains in Brazil.

The dataset was assembled using the following CSV downloads provided by FAPESP:

  • Auxílios em andamento (ongoing research grants)
  • Auxílios concluídos (completed research grants)
  • Bolsas no Brasil em andamento (ongoing domestic scholarships)
  • Bolsas no Brasil concluídas (completed domestic scholarships)
  • Bolsas no exterior em andamento (ongoing international scholarships)
  • Bolsas no exterior concluídas (completed international scholarships)

Each record contains metadata such as project titles, abstracts, funding type, and scientific classifications.
From these files, the following fields were extracted and standardized:

  • Title (English)
  • Abstract (English)
  • Grande Área do Conhecimento (major scientific domain)
  • Área do Conhecimento (field of study)

Only entries containing at least one English component (title or abstract) were retained.
Scientific areas were normalized and mapped to a controlled English taxonomy to ensure consistency and comparability across records.

The final dataset consists of labeled scientific text samples distributed across multiple domains, providing a balanced corpus for supervised classification.

Training procedure

Preprocessing

  • Input text constructed as:
    abstract
  • Tokenization using the SPECTER2 tokenizer
  • Maximum sequence length: 512 tokens

Model

  • Base model: allenai/specter2_base
  • Classification head: linear layer → softmax
  • Loss: CrossEntropyLoss

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 16
  • eval_batch_size: 16
  • seed: 42
  • optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • num_epochs: 10

Training results

Training Loss Epoch Step Validation Loss Accuracy Precision Micro Precision Macro Recall Micro Recall Macro F1 Micro F1 Macro
0.5662 1.0 3807 0.5312 0.8141 0.8141 0.8121 0.8141 0.8063 0.8141 0.8087
0.4099 2.0 7614 0.5368 0.8182 0.8182 0.8118 0.8182 0.8177 0.8182 0.8141
0.2703 3.0 11421 0.6214 0.8173 0.8173 0.8112 0.8173 0.8176 0.8173 0.8125
0.1604 4.0 15228 0.8681 0.8259 0.8259 0.8265 0.8259 0.8219 0.8259 0.8230
0.1161 5.0 19035 1.0616 0.8229 0.8229 0.8207 0.8229 0.8214 0.8229 0.8205
0.0864 6.0 22842 1.2011 0.8212 0.8212 0.8176 0.8212 0.8198 0.8212 0.8178

Evaluation results

precision recall f1-score support
Agronomical Sciences 0.848943 0.805158 0.826471 349
Applied Social Sciences 0.745152 0.890728 0.811463 302
Biological Sciences 0.835052 0.826531 0.830769 686
Engineering 0.836036 0.890595 0.862454 521
Health Sciences 0.828283 0.833333 0.8308 492
Humanities 0.891648 0.816116 0.852211 484
Linguistics, Literature and Arts 0.855346 0.85 0.852665 160
Physical Sciences and Mathematics 0.872576 0.807692 0.838881 390
accuracy 0.838357 0.838357 0.838357 0.838357
macro avg 0.839129 0.840019 0.838214 3384
weighted avg 0.841008 0.838357 0.838523 3384

Framework versions

  • Transformers 4.57.1
  • Pytorch 2.8.0+cu126
  • Datasets 3.6.0
  • Tokenizers 0.22.1
Downloads last month
7
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SIRIS-Lab/specter2-fapesp-cluster-multiclass

Finetuned
(28)
this model

Evaluation results