Abstract:
Objective This study proposes a clustering framework for Chinese materia medica (CMM) based on a large language model (LLM), aiming to explore potential compatibility patterns among CMMs from the semantic perspective of CMM property theory.
Methods First, a CMM property knowledge base was constructed based on Chinese Materia Medica, including 567 commonly used CMMs characterized by four properties, five flavors, and meridian tropism. Then, 49 CMMs derived from 10 prescriptions for Zangdu (脏毒, pathogenic toxins) recorded in Waike Zhengzong (《外科正宗》, Orthodox Manual of External Medicine) and Yangke Xinde Ji (《疡科心得集》, Collected Insights on Ulcer Medicine) were selected as the experimental dataset. Five semantic representation methods—One-Hot, Word2Vec, Bidirectional Encoder Representations from Transformers (BERT), Beijing Academy of Artificial Intelligence General Embedding (BGE), and Qwen—were applied to encode CMM property information into vector representations. Subsequently, t-distributed Stochastic Neighbor Embedding (t-SNE) was used for nonlinear dimensionality reduction on high-dimensional semantic vectors, followed by k-means clustering (k = 7). Clustering performance was evaluated using the Silhouette Score (SS), Davies-Bouldin Index (DBI), and Calinski-Harabasz Index (CHI).
Results The Qwen-based clustering method, CMM-EmbedCluster, achieved the highest SS (0.607 4) and CHI (158.057 2), as well as the lowest DBI (0.499 5), indicating improved cluster separation and compactness compared with other methods. Visualization of CMM clustering results showed that the clusters were well separated in the low-dimensional space, with strong inter-cluster discrimination and high intra-cluster functional consistency. Further interpretability analysis of CMM clustering results revealed stable structural differences among clusters in terms of four properties, five flavors, and meridian tropism, forming functional partitions consistent with CMM property theory.
Conclusion CMM-EmbedCluster utilizes an LLM to achieve semantic-level representation and clustering of CMMs within the framework of CMM property theory, providing support for exploring potential compatibility patterns among CMMs from the perspective of CMM property semantics.