I created a benchmark to evaluate the quality of the Russian language in LLMs. Details: - A set of 100 (default)/250/500 questions on general chat/creative writing domains. - LLM-as-a-Judge, but with clear criteria for marking answers. - Focuses on errors that are typical of LLMs in Russian, such as mixed grammatical genders, characters from other alphabets, and made-up words. - Everything is under an open license!
Analysis of results: - The best models are still closed source ones, such as Sonnet 4.5, Gemini, and GPT-4o. However, some open models are very close. - GPT-5 is terrible. I thought it would be better. - Of the open models, Gemma-3-27b-it and Vistral-24B are unrivaled. - Ruadapt significantly reduces errors compared to Qwen. - Qwen3 and GPT-oss are very bad. They're even worse than I expected. - Qwen3-Next is better than Qwen3. It seems like they added Russian to train dataset. - DeepSeek V3 has few errors, but V3.2-Exp is almost twice as bad.
Collection of 206,204 Public Domain multimedia files featuring:
- Comprehensive metadata: title, description, creator name, keywords, original page URL, and more. - Contains various media types including images, clip art, artwork, fonts, videos, and TV shows. - All content explicitly released into the public domain under the CC0 license. - Organized in a single train split with 206,204 entries.
reacted to prithivMLmods's
post with ๐ฅ8 months ago
Dropping downstream tasks using newly initialized parameters and weights supports domain-specific image classification post-training, based on the SigLIP-2 models: Patch-16/224, Patch-16/256, and Patch-32/256. For more details, please refer to the respective model cards : ๐ค
I am fascinated by models learning from prompts and rewards - no example answers needed like in Supervised Fine-Tuning.
After the DeepSeek boom, everyone is trying GRPO with GSM8K or the Countdown Game...
I wanted a different challenge, like ๐๐ฒ๐ฎ๐ฐ๐ต๐ถ๐ป๐ด ๐ฎ ๐บ๐ผ๐ฑ๐ฒ๐น ๐๐ผ ๐ฐ๐ฟ๐ฒ๐ฎ๐๐ฒ ๐ฎ ๐๐ฐ๐ต๐ฒ๐ฑ๐๐น๐ฒ ๐ณ๐ฟ๐ผ๐บ ๐ฎ ๐น๐ถ๐๐ ๐ผ๐ณ ๐ฒ๐๐ฒ๐ป๐๐ ๐ฎ๐ป๐ฑ ๐ฝ๐ฟ๐ถ๐ผ๐ฟ๐ถ๐๐ถ๐ฒ๐.
Choosing an original problem forced me to: ๐ค Think about the problem setting ๐งฌ Generate data ๐ค Choose the right base model ๐ Design reward functions (and experiencing reward hacking) ๐ Run multiple rounds of training, hoping that my model would learn something.
Collection of 3,655,810 Scalable Vector Graphics (SVG) icons featuring: - Sourced from SVGFind across diverse categories & styles - Includes metadata: unique ID, title, tags, data pack, and license information - Contains minified SVG markup for direct use or processing - Organized into splits based on license type (Creative Commons: 3,645,444 icons, Public Domain: 10,366 icons)
With over 3.6 million icons, this appears to be the largest SVG dataset on Hugging Face to date. If you're aware of a larger SVG collection, please let me know and I'll update this post with a reference to the largest dataset.
3 replies
ยท
reacted to ImranzamanML's
post with ๐8 months ago
In this work, we tackle some major challenges in Arabic multi-label emotion classification especially the issues of class imbalance and label correlation that often hurt model performance, particularly for minority emotions.
Our approach:
Stacked contextual embeddings from fine-tuned ArabicBERT, MarBERT, and AraBERT models.
A meta-learning strategy that builds richer representations.
A hybrid loss function combining class weighting, label correlation matrices, and contrastive learning to better handle class imbalances.
๐ Extensive experiments show significant improvements across Precision, Recall, F1-Score, Jaccard Accuracy, and Hamming Loss. ๐ The hybrid loss function in particular helped close the gap between majority and minority classes!
We also performed ablation studies to break down each componentโs contribution and the results consistently validated our design choices.
This framework isn't just for Arabic it offers a generalizable path for improving multi-label emotion classification in other low-resource languages and domains.
Big thanks to my co-authors: Muhammad Azeem Aslam, Wang Jun, Nisar Ahmed, Li Yanan, Hu Hongfei, Wang Shiyu, and Xin Liu!
Would love to hear your thoughts on this work! ๐
Collection of 217,510 Scalable Vector Graphics (SVG) icons featuring:
- Sourced from SVGRepo.com across diverse categories & styles - Includes metadata: title, tags, source collection, and specific license - Contains minified SVG markup for direct use or processing - Organized into splits based on individual icon license (e.g., MIT, CC0, Apache)
Collection of 536,231 question-answer pairs featuring:
- Human-posed questions and machine-generated responses for SFT - Bilingual content in Russian and English with linked IDs - Derived from 739k+ real user queries, primarily educational topics - Includes unique IDs and machine-generated category labels
This dataset provides a resource for supervised fine-tuning (SFT) of large language models, cross-lingual research, and understanding model responses to diverse user prompts. Released to the public domain under CC0 1.0 license.
Tutorial ๐ฅ Training a non-English reasoning model with GRPO and Unsloth
I wanted to share my experiment with training reasoning models in languages other than English/Chinese.
Using Llama 3.1 8B as base, GRPO trainer from trl, and Unsloth optimizations, I got a working prototype in Bulgarian after ~5 hours on an L40S GPU. The approach should work for any language where the base model has some pre-training coverage.
(Probably) the first "longCoT" dataset for the Russian language created via Deeseek-R1.
- Prompts taken from the Sky-T1 dataset and translated via Llama3.3-70B. - Answers and reasoning generated by Deepseek-R1 (685B). - 16.4K samples in total, โ12.4K Russian-only (in the rest, either the answer or reasoning is in English). - Languages in the answers and reasoning are labeled using fasttext.
Dataset highlights: - 182,405 presentations from ppt4web.ru, a platform for storing and viewing presentations covering a wide range of educational materials - Primarily in Russian, with content in English, Kazakh, Ukrainian, and Belarusian - Each entry includes: URL, title, download URL, and filepath - Contains original PPTX files (converted from PPT for consistency) in addition to metadata - Data covers a broad spectrum of educational topics and subjects - Dedicated to the public domain under Creative Commons Zero (CC0) license
The dataset can be used for analyzing educational presentation content across various subjects in multiple languages, text classification tasks, and information retrieval systems. It's particularly valuable for examining trends in education, teaching methodologies, and presentation materials used across different academic disciplines. The inclusion of original files allows for in-depth analysis of presentation formats and structures commonly used in educational settings, providing insights into the diverse range of subjects and teaching approaches.
reacted to averoo's
post with ๐ฅabout 1 year ago
Hello, researchers! I've tried to made reading HF Daily Papers easier and made a tool that does reviews with LLMs like Claude 3.5, GPT-4o and sometimes FLUX.
๐ Classification by topics ๐ Sorting by publication date and HF addition date ๐ Syncing every 2 hours ๐ป Hosted on GitHub ๐ English, Russian, and Chinese ๐ Top by week/month (in progress)
๐Introducing ัะฐัะณะฟั-ะฒ-ัะพััะธะธ.ัั (meaning in English would be something like chatgpt-in-russia[.]rf) Q&A Dataset - nyuuzyou/chatgpt-in-russia-qa
Dataset highlights: - 628,186 question-answer pairs from ัะฐัะณะฟั-ะฒ-ัะพััะธะธ.ัั, a Russian question-answering website - Monolingual content in Russian - Each entry includes: URL, question, and response - Data reflects user-generated questions and language model-generated answers - Licensed under Creative Commons Zero (CC0) for unrestricted use
The dataset can be used for the purpose of analyzing trends in the use of AI to answer questions in Russia. Additionally, it can be useful for examining language patterns and topic distributions.
reacted to nyuuzyou's
post with โค๏ธover 1 year ago
Dataset highlights: - Metadata for 580,977 image files from rule34.world - Monolingual content: English tags and metadata - Each entry includes: URL, image URL, filepath, tags, and like count - Data reflects images available on the Rule34.world platform up to August/September 2024 - Licensed under Creative Commons Zero (CC0) for unrestricted use This dataset offers a unique window into online anime and art communities, particularly those focused on adult content. It provides opportunities for analyzing tagging trends, image popularity, and content patterns in user-generated art platforms.
This dataset contains high-quality images and tags, making it a great source of data for training LoRA models.