A set of SMILES datasets canonicalized with RDKit and 33% randomly augmented for robust, diverse molecular ML training.