CMM-EmbedCluster:一种融合大语言模型与中药药性理论的中药聚类框架

CMM-EmbedCluster: a clustering framework for Chinese materia medica based on large language model and Chinese materia medica property theory

  • 摘要:
    目的 本研究提出一种基于大语言模型的中药聚类框架,旨在从中药药性理论的语义层面挖掘中药间的潜在配伍规律。
    方法 首先,基于《中药学》建立包含四气、五味及归经的567个中药的药性知识数据库。其次,基于《外科正宗》和《疡科心得集》中记载的10首脏毒相关方剂,提取其中49味中药作为实验数据集,采用One-Hot、Word2Vec、双向 Transformer 编码表示(BERT)、北京智源人工智能研究院通用嵌入(BGE)、Qwen五种语义表示方法对药性知识进行向量化表示。最后,利用t分布随机近邻嵌入(t-SNE)算法对高维语义向量进行非线性降维,并基于k-means算法(k = 7)完成聚类分析。采用轮廓系数(SS)、戴维斯-博尔丁指数(DBI)和卡林斯基-哈拉巴斯指数(CHI)对聚类性能进行评价。
    结果 基于Qwen的聚类方法CMM-EmbedCluster取得了最高的SS (0.607 4)和CHI(158.057 2),以及最低的DBI (0.499 5),表明其在类间分离度和类内紧密性方面优于其他方法。中药聚类结果的可视化显示,各聚类在低维空间中具有良好的分离性,类间区分度较高,且类内中药在功能属性上表现出较高一致性。进一步的聚类结果可解释性分析表明,不同聚类在四气、五味及归经等方面呈现出稳定的结构性差异,形成与中药药性理论相一致的功能分区特征。
    结论 CMM-EmbedCluster有效融合大语言模型与中药药性理论,实现中药的语义级表示与聚类,为从中药药性语义角度探索中药间潜在配伍规律提供了支持。

     

    Abstract:
    Objective This study proposes a clustering framework for Chinese materia medica (CMM) based on a large language model (LLM), aiming to explore potential compatibility patterns among CMMs from the semantic perspective of CMM property theory.
    Methods First, a CMM property knowledge base was constructed based on Chinese Materia Medica, including 567 commonly used CMMs characterized by four properties, five flavors, and meridian tropism. Then, 49 CMMs derived from 10 prescriptions for Zangdu (脏毒, pathogenic toxins) recorded in Waike Zhengzong (《外科正宗》, Orthodox Manual of External Medicine) and Yangke Xinde Ji (《疡科心得集》, Collected Insights on Ulcer Medicine) were selected as the experimental dataset. Five semantic representation methods—One-Hot, Word2Vec, Bidirectional Encoder Representations from Transformers (BERT), Beijing Academy of Artificial Intelligence General Embedding (BGE), and Qwen—were applied to encode CMM property information into vector representations. Subsequently, t-distributed Stochastic Neighbor Embedding (t-SNE) was used for nonlinear dimensionality reduction on high-dimensional semantic vectors, followed by k-means clustering (k = 7). Clustering performance was evaluated using the Silhouette Score (SS), Davies-Bouldin Index (DBI), and Calinski-Harabasz Index (CHI).
    Results The Qwen-based clustering method, CMM-EmbedCluster, achieved the highest SS (0.607 4) and CHI (158.057 2), as well as the lowest DBI (0.499 5), indicating improved cluster separation and compactness compared with other methods. Visualization of CMM clustering results showed that the clusters were well separated in the low-dimensional space, with strong inter-cluster discrimination and high intra-cluster functional consistency. Further interpretability analysis of CMM clustering results revealed stable structural differences among clusters in terms of four properties, five flavors, and meridian tropism, forming functional partitions consistent with CMM property theory.
    Conclusion CMM-EmbedCluster utilizes an LLM to achieve semantic-level representation and clustering of CMMs within the framework of CMM property theory, providing support for exploring potential compatibility patterns among CMMs from the perspective of CMM property semantics.

     

/

返回文章
返回