研究动态
Articles below are published ahead of final publication in an issue. Please cite articles in the following format: authors, (year), title, journal, DOI.

健康文本简化:用于消化道癌症教育的带注释语料库和强化学习的新颖策略。

Health text simplification: An annotated corpus for digestive cancer education and novel strategies for reinforcement learning.

发表日期:2024 Oct
作者: Md Mushfiqur Rahman, Mohammad Sabik Irbaz, Kai North, Michelle S Williams, Marcos Zampieri, Kevin Lybarger
来源: JOURNAL OF BIOMEDICAL INFORMATICS

摘要:

健康教育材料的阅读水平显着影响信息的可理解性和可及性,特别是对于少数群体而言。许多患者教育资源超出了广泛接受的阅读水平和复杂性标准。迫切需要高性能的健康信息文本简化模型,以加强传播和读写能力。这种需求在癌症教育中尤为迫切,有效的预防和筛查教育可以大大降低发病率和死亡率。我们引入了简化消化道癌症(SimpleDC),这是一个专为健康文本简化研究量身定制的癌症教育材料平行语料库,包括来自美国癌症协会、疾病控制和预防中心以及国家癌症研究所。该语料库包括 31 个网页以及相应的手动简化版本。它由 1183 个带注释的句子对组成(361 个训练、294 个开发和 528 个测试)。利用 SimpleDC 和现有的 Med-EASi 语料库,我们探索基于大语言模型 (LLM) 的简化方法,包括微调、强化学习 (RL)、人类反馈强化学习 (RLHF)、领域适应和基于提示的简化方法接近。我们的实验包括 Llama 2、Llama 3 和 GPT-4。我们引入了一种新颖的 RLHF 奖励函数,其特点是轻量级模型能够在未标记数据上进行训练时擅长区分原始文本和简化文本。经过微调的 Llama 模型在各种指标上都表现出了高性能。我们的 RLHF 奖励函数优于现有的 RL 文本简化奖励函数。结果强调了 RL/RLHF 可以实现与微调相当的性能,并提高微调模型的性能。此外,这些方法有效地将域外文本简化模型适应目标域。性能最佳的 RL 增强型 Llama 模型在自动指标和主题专家手动评估方面均优于 GPT-4。新开发的 SimpleDC 语料库将成为研究界的宝贵资产,特别是在患者教育简化方面。本文提出的 RL/RLHF 方法能够有效训练未标记文本的简化模型以及域外简化语料库的利用。版权所有 © 2024 Elsevier Inc. 保留所有权利。
The reading level of health educational materials significantly influences the understandability and accessibility of the information, particularly for minoritized populations. Many patient educational resources surpass widely accepted standards for reading level and complexity. There is a critical need for high-performing text simplification models for health information to enhance dissemination and literacy. This need is particularly acute in cancer education, where effective prevention and screening education can substantially reduce morbidity and mortality.We introduce Simplified Digestive Cancer (SimpleDC), a parallel corpus of cancer education materials tailored for health text simplification research, comprising educational content from the American Cancer Society, Centers for Disease Control and Prevention, and National Cancer Institute. The corpus includes 31 web pages with the corresponding manually simplified versions. It consists of 1183 annotated sentence pairs (361 train, 294 development, and 528 test). Utilizing SimpleDC and the existing Med-EASi corpus, we explore Large Language Model (LLM)-based simplification methods, including fine-tuning, reinforcement learning (RL), reinforcement learning with human feedback (RLHF), domain adaptation, and prompt-based approaches. Our experimentation encompasses Llama 2, Llama 3, and GPT-4. We introduce a novel RLHF reward function featuring a lightweight model adept at distinguishing between original and simplified texts when enables training on unlabeled data.Fine-tuned Llama models demonstrated high performance across various metrics. Our RLHF reward function outperformed existing RL text simplification reward functions. The results underscore that RL/RLHF can achieve performance comparable to fine-tuning and improve the performance of fine-tuned models. Additionally, these methods effectively adapt out-of-domain text simplification models to a target domain. The best-performing RL-enhanced Llama models outperformed GPT-4 in both automatic metrics and manual evaluation by subject matter experts.The newly developed SimpleDC corpus will serve as a valuable asset to the research community, particularly in patient education simplification. The RL/RLHF methodologies presented herein enable effective training of simplification models on unlabeled text and the utilization of out-of-domain simplification corpora.Copyright © 2024 Elsevier Inc. All rights reserved.