人工智能辅助舌诊的可视化分析与临床可解释性评价

刘梨会; 胡凯文; 周亚娜

doi:10.1016/j.dcmed.2026.05.004

人工智能辅助舌诊的可视化分析与临床可解释性评价

Visualization analysis and clinical interpretability evaluation of artificial intelligence-assisted tongue diagnosis

摘要

摘要:
目的通过文献计量学分析勾勒人工智能（AI）辅助舌诊的研究全景，并通过诊断准确性试验的meta分析定量评价其诊断准确性与临床可解释性。
方法文献计量学分析通过检索Web of Science核心合集（WoSCC）中2014年1月1日至2025年12月31日发表的AI辅助舌诊相关英文论文与综述，采用Bibliometrix、VOSviewer与CiteSpace，从年度发文量与学科分布、期刊与引文特征、国家/地区与机构合作、作者网络、关键词共现以及关键词突现检测等多个维度进行综合分析。诊断准确性试验 meta分析按照诊断准确性试验的系统评价/meta分析报告规范（PRISMA-DTA），系统检索Scopus、PubMed、Web of Science与中国知网（CNKI）四个数据库。采用双变量随机效应模型的层次汇总受试者工作特征曲线（HSROC）合并敏感度与特异度，并按疾病类别、AI模型架构及样本量分层进行亚组分析。方法学质量采用诊断试验准确性研究质量评价工具第2版（QUADAS-2）进行评估，发表偏倚通过Deeks漏斗图不对称性检验评估。
结果文献计量学分析共纳入198篇文献。2014 − 2025年该领域年度发文量增长24.5倍（由2014年2篇增至2025年49篇），2022 − 2025年发文量占全部文献的65.2%。中国贡献了约83.5%的机构归属，其中上海中医药大学发文最多，许家佗为发文最多的作者。关键词分析识别出AI与深度学习架构、图像处理与分割、中医特异性应用、疾病特异性应用四个主题集群，并呈现出从传统机器学习向深度学习、transformer架构、可解释性AI及多模态融合架构演进的时序特征。诊断准确性meta分析共纳入16项研究（14 755名受试者），覆盖代谢与肝脏疾病、肿瘤与口腔病变、心血管风险、糖尿病等多个领域。合并敏感度为90.3% 95% 置信区间（ CI）：86.7% − 93.1%，合并特异度为93.0%（95% CI：90.6% − 94.7%），汇总受试者工作特征（SROC）曲线下面积（AUC）为0.961；异质性显著（敏感度I² = 95.8%；特异度I² = 92.1%）。亚组分析显示，不同疾病类别、AI架构与样本量分层之间性能总体一致，Deeks检验未提示显著的发表偏倚（P = 0.258）。
结论 AI辅助舌诊已快速发展，其合并诊断性能与既有筛查方式相当，提示其有望作为一种互补、便捷可及的辅助决策工具应用于临床。

Abstract:
Objective To map the research landscape of artificial intelligence (AI)-assisted tongue diagnosis through bibliometric analysis and to quantify its diagnostic accuracy and clinical interpretability through a diagnostic test accuracy (DTA) meta-analysis.
Methods For the bibliometric analysis, the Web of Science Core Collection (WoSCC) was queried for English-language articles and reviews on AI-assisted tongue diagnosis published between January 1, 2014 and December 31, 2025, and analysed using Bibliometrix, VOSviewer, and CiteSpace, with major output dimensions including annual publication output and disciplinary distribution, journal and citation characteristics, country/region and institutional collaboration, author networks, keyword co-occurrence, and keyword burst detection. For the DTA meta-analysis, four databases Scopus, PubMed, Web of Science, and China National Knowledge Infrastructure (CNKI) were searched in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses of Diagnostic Test Accuracy (PRISMA-DTA) guidelines. A bivariate random-effects model hierarchical summary receiver operating characteristic (HSROC) was used to pool sensitivity and specificity, with subgroup analyses by disease category, AI model architecture, and sample-size strata. Methodological quality was assessed with the Quality Assessment of Diagnostic Accuracy Studies version 2 (QUADAS-2) tool, and publication bias was evaluated by Deeks’ funnel plot asymmetry test.
Results A total of 198 publications met the bibliometric eligibility criteria. Annual output increased 24.5-fold (from 2 in 2014 to 49 in 2025), with the period 2022 – 2025 alone accounting for 65.2% of all publications. China contributed approximately 83.5% of all institutional affiliations, with Shanghai University of Traditional Chinese Medicine and Jiatuo Xu being the most productive institution and author, respectively. Keyword analysis identified four thematic clusters (AI and deep-learning architectures, image processing and segmentation, traditional Chinese medicine (TCM)-specific applications, and disease-specific applications) and a temporal evolution from traditional machine learning to deep learning and transformer-based, explainable, and multimodal AI architectures. Sixteen DTA meta-analysis studies (14 755 participants) covering metabolic and hepatic disorders, oncological and oral lesions, cardiovascular risk, diabetes, and other clinical applications were included in the DTA meta-analysis. The pooled sensitivity was 90.3% 95% confidence interval (CI): 86.7% – 93.1% and the pooled specificity was 93.0% (95% CI: 90.6% – 94.7%); the area under the summary receiver operating characteristic (SROC) curve (AUC) was 0.961. Heterogeneity was substantial (I² = 95.8% for sensitivity; I² = 92.1% for specificity). Subgroup performance was broadly consistent across disease categories, AI architectures, and sample-size strata, and Deeks’ test indicated no significant publication bias (P = 0.258).
Conclusion AI-assisted tongue diagnosis has progressed rapidly and shows pooled diagnostic performance comparable to established screening modalities, supporting its potential as a complementary and easily accessible decision-support tool.

HTML全文

参考文献(72)

施引文献

资源附件(0)