Evaluation of AI-generated versus registered dietitian-authored nutrition responses: a cross-sectional study
Highlight box
Key findings
• Artificial intelligence (AI)-generated nutrition responses scored significantly higher than registered dietitian (RD)-authored responses on clinical quality, empathy, and overall performance. Nearly all AI responses reached the “acceptable” threshold (≥4 on a 5-point Likert scale), while few RD responses did.
What is known and what is new?
• Previous studies in medicine have shown that AI-generated responses can be rated more empathetic than physician responses.
• This study is the first to compare AI and RD responses in nutrition counseling, showing that while AI achieved higher perceived quality and empathy, AI-generated responses were more linguistically complex and less readable, extending prior literature that has already documented risks such as occasional inaccuracies or “hallucinations”.
What is the implication, and what should change now?
• AI could augment nutrition counseling by providing consistent, empathetic, and high-quality written responses. Integration into clinical practice should preserve clinician oversight and ensure accuracy, readability, and patient trust.
Introduction
The growing integration of artificial intelligence (AI) in healthcare has sparked considerable interest in its potential to enhance patient communication, reduce clinician workload, and expand access to high-quality health information (1). Large language models (LLMs) such as ChatGPT, released in late 2022, can generate fluent and contextually appropriate responses to a wide range of user queries (2-4). While these capabilities have been rapidly adopted in consumer and professional settings, their role in clinical communication is still being evaluated.
AI applications have already transformed healthcare delivery, from diagnostics to virtual patient care (1). Beyond these applications, AI is recognized as a driver of precision medicine, integrating genomic, clinical, and behavioral data to advance personalized health care (5). Recent studies suggest that AI-generated responses may be favorably perceived in online health information contexts. Ayers et al. found that ChatGPT responses to patient questions on a public medical forum were preferred over physician responses in nearly 80% of cases, with higher ratings for both quality and empathy (6). These findings raise important questions about the ability of AI models to replicate or even surpass human communication in domains that traditionally rely on professional judgment and interpersonal sensitivity. ChatGPT, in particular, has been highlighted as a promising conversational tool, offering efficiency alongside concerns about bias, hallucination, and ethical use in health contexts (2). Others have noted that while ChatGPT can demonstrate surface-level empathy, this does not always translate into authentic, context-sensitive communication (7).
Empathy in the context of nutrition and dietetic practice is fundamental to providing effective patient-centered care. A recent scoping review emphasized that empathetic communication enhances dietitians’ ability to deliver higher-quality and more impactful nutrition counselling (8). Given its central role, evaluating empathy alongside clinical quality is essential when comparing human- and AI-generated nutrition responses.
In the field of nutrition, communication is particularly nuanced. Effective responses must not only convey evidence-based dietary recommendations but also show empathy and adaptability to individual concerns (8). Registered dietitians (RDs) play a critical role in providing personalized and compassionate nutrition advice (9), yet the growing demand for digital health content has created a need for scalable support tools (10). Recent reviews of AI in clinical nutrition and dietetics highlight emerging applications in malnutrition screening, dietary assessment, and chatbot-assisted counseling, while also warning of risks such as bias, lack of accountability, and potential dehumanization (11). Whether LLMs can reliably deliver high-quality and empathetic responses in this context remains unclear. Recent work has explored the feasibility of AI systems specifically designed for nutrition counseling. Sun et al. reported that an AI dietitian model achieved performance comparable to licensed dietitians on the Chinese Registered Dietitian Exam, suggesting potential for LLMs to deliver clinically relevant dietary guidance (12). However, the applicability of these systems to diverse populations and real-world counseling remains uncertain. Recent commentary highlights both opportunities and risks of ChatGPT for credentialed dietitians, noting potential to streamline communication but also risks of misinformation and diminished patient contact (13). Similarly, reviews of AI in healthcare communication report that LLMs like ChatGPT can enhance clarity and empathy in patient interactions, though accuracy concerns and hallucinations remain significant challenges (14). Most prior evaluations of ChatGPT in health care have been broad and not nutrition-specific, underscoring the importance of studies tailored to this domain (15).
This study aimed to evaluate the performance of an advanced LLM (ChatGPT-4o, OpenAI) compared to responses authored by licensed RDs. Specifically, the perceived clinical quality and empathy of responses were assessed and compared using blinded evaluations by independent RDs. The goal of this research was to explore the potential role of AI in supplementing human communication in nutrition counseling and identify areas where human expertise remains essential. Despite increasing interest in AI for healthcare communication, no prior study has directly compared AI- and RD-authored nutrition responses in terms of both clinical quality and empathy, representing a critical study gap this study seeks to address. We present this article in accordance with the STROBE reporting checklist (available at https://mhealth.amegroups.com/article/view/10.21037/mhealth-2025-70/rc).
Methods
Study design
This blinded cross-sectional study employed independent evaluations by licensed professionals to compare the quality and empathy of nutrition responses authored by RDs versus those generated by an AI LLM (ChatGPT-4o). The study used purposive and snowball sampling to recruit eight practicing RDs, four assigned to evaluate AI-generated responses and four to evaluate human-written responses. Each evaluator rated 100 unique question-response pairs, with one response per question. A power calculation was conducted to detect a 10-percentage point difference in the proportion of responses assuming 80% power and a significance level of 0.05. This yielded a requirement of 389 total evaluations. With each question-response pair counted as one evaluation and each participant scoring 100 responses, the sample was adequately powered using four evaluators per group. Because each evaluation required approximately two hours and participation was voluntary and uncompensated, evaluators were divided into two independent groups to minimize fatigue-related bias. All evaluators were licensed and actively practicing in the United States and were blinded to the study purpose and the source of each response. Although interrater reliability could not be calculated across groups because evaluators assessed distinct response sets, methodological consistency was maintained through identical survey instruments, standardized instructions, and uniform scoring criteria (4). A priori power analysis ensured an adequate within-group sample size to support valid comparison of group-level outcomes despite evaluator separation. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study protocol was reviewed and approved by the Institutional Review Board of Florida International University (No. IRB-24-0447). Informed consent was obtained from all evaluator participants.
Data collection
Researchers independently searched for and compiled nutrition question-and-answer forums, identifying publicly available websites where responses were authored by verified RDs. A total of eight eligible forums that met this criterion were included to build the initial pool of RD-authored responses. These included “Ask the Dietitian”, Healthline’s Q&A with PlateJoy dietitians, Houston Family Nutrition, UAMS Nutrition and Hospitality Services, EatFresh’s “Ask the Dietitian”, Clinical Nutrition Services, Cancer Center Nutrition Support, and Nutrition Talk. The questions were divided among three researchers (K.A., M.N., P.H.), who independently screened the forums and selected eligible questions. Questions were eligible if they (I) originated from publicly accessible online nutrition Q&A forums with verifiable RD authorship; (II) addressed consumer- or patient-facing nutrition topics; (III) provided sufficient contextual information to allow a professional dietetic response; and (IV) were written in English. Posts were excluded if they were incomplete, promotional, duplicated across platforms, focused on the dietetics profession rather than nutrition counseling, written in a language other than English, or if RD authorship could not be verified. These inclusion and exclusion criteria were applied consistently across all forums to ensure that each question-answer pair reflected an authentic public nutrition inquiry within dietetic scope of practice. All RD-authored responses were obtained exclusively from platforms that explicitly identified contributors as licensed RDs. Selected forums required credential verification prior to author participation (e.g., hospital-based nutrition services, academic extension programs, or professional dietetics platforms listing RD credentials). Only responses where RD authorship could be clearly verified were included; posts without confirmed RD credentials were excluded. To ensure topic diversity, when forums contained multiple categories, the first five questions in each category were chosen. To capture a broad yet representative spectrum of nutrition concerns, forums were stratified by topic categories (e.g., weight management, diabetes, gastrointestinal health, sports nutrition), and the first five questions within each category were screened prior to random selection. This approach was intended to prevent overrepresentation of a single topic area while preserving the natural distribution of public nutrition inquiries. As a result, the final sample reflected commonly encountered nutrition questions rather than rare or highly specialized cases. This design prioritizes representativeness of high-frequency public concerns and may not generalize to uncommon or highly specialized nutrition scenarios. A total of 247 unique questions and corresponding answers were collected. The question-answer pairs were compiled into a Google Doc. Using a custom JavaScript function, 100 questions were randomly selected. Each selected question was entered verbatim into a fresh ChatGPT-4o (OpenAI, San Francisco, CA, USA) session without system priming, follow-up prompts, or exposure to RD-authored responses, in order to minimize potential data bleed and reduce the influence of prior contextual information. Hereafter, these are referred to as AI-generated responses. The model was used with the August 2024 snapshot, and each nutrition question was copied verbatim into a fresh session, without additional system priming. Responses were generated between September 10 and September 20, 2024 (Eastern Standard Time). A complete log of prompts and timestamps is provided in Supplementary File 1 (available online: https://cdn.amegroups.cn/static/public/mhealth-2025-70-1.docx).
Each question-answer pair was entered into a Google Form, one containing only the AI-generated responses, and the other containing only the RD-authored responses. Google Forms with custom JavaScript were used to standardize survey formatting and export data to an Excel file. Both surveys asked evaluators (n=8) to rate responses using a 5-point Likert scale (very poor, poor, acceptable, good, very good) for quality, and (not empathetic, slightly empathetic, moderately empathetic, empathetic, very empathetic) for empathy, as well as to assign a single overall score out of 100. Definitions for quality, empathy, and overall score were included in the prompt at the beginning of the survey. Specifically, evaluators were instructed as follows: “Carefully read each patient question (Q) and the corresponding answer (A). Your task is to evaluate the quality and empathy of the response and then provide an overall score. Quality refers to the clinical accuracy of the information, ensuring it aligns with current medical knowledge, as well as its readability for ease of understanding. A high-quality response directly answers the question, is clear, easy to interpret, and provides valuable, actionable information. Rate quality on a 5-point Likert scale: 1 = very poor; 2 = poor; 3 = moderate; 4 = good; 5 = very good. Empathy refers to the response’s ability to acknowledge and respect the patient’s concerns and emotions, demonstrating understanding and support in a compassionate manner. An empathetic response makes the patient feel heard, reassured, and cared for. Rate empathy on a 5-point Likert scale: 1 = not empathetic; 2 = slightly empathetic; 3 = moderately empathetic; 4 = empathetic; 5 = very empathetic. After rating quality and empathy, assign a single overall percentage score between 1 and 100, representing your holistic assessment of the response.”
In addition to domain-specific ratings, evaluators assigned a single overall percentage score (0–100) to reflect their holistic judgment of each response. This score was intended to capture the evaluator’s integrated assessment of clinical quality, clarity, and empathetic communication rather than serving as an independent construct. The overall percentage score was analyzed descriptively and used in secondary analyses to examine how evaluators weighted quality and empathy when forming holistic judgments of response performance.
Statistical analysis
Responses from the Google Form surveys were automatically recorded in linked Excel spreadsheets and imported for analysis. All statistical analyses were performed using Python within a Jupyter Notebook environment. Descriptive statistics were computed for quality, empathy, and overall scores, including means and standard deviations (SDs). All analyses were applied identically to AI-generated and RD-authored responses unless otherwise specified. Two-tailed independent t-tests were conducted to compare mean scores between AI-generated and RD-authored responses, as the study did not pre-specify the direction of expected differences between response types. To assess relationships between quality and empathy within each group, Pearson correlation coefficients were calculated. Pearson correlation coefficients were also calculated to examine associations between the overall percentage score (0–100) and the quality and empathy ratings within each response type. Although results are described directionally in the abstract for clarity of reporting, all inferential analyses were conducted using two-tailed tests to allow detection of differences in either direction and to avoid directional bias.
Proportion tests were performed to compare the percentage of responses rated as 4 or higher (≥4) on a 5-point Likert scale (1 = very poor/not empathetic, 5 = very good/very empathetic) for both quality and empathy between AI-generated and RD-authored responses. A score of ≥4 was defined a priori as the threshold for acceptable performance, reflecting responses judged by the RD evaluators as “good/empathetic” or “very good/empathetic. These comparisons assessed the relative frequency of responses meeting this defined performance threshold. Kernel density plots were generated to visualize the distribution of scores and illustrate differences in central tendency and variability between groups.
A sensitivity analysis was conducted to examine whether longer responses authored by RDs were associated with improved ratings. The data were subset using both the median and 75th percentile word count thresholds to assess any changes in the proportion of responses rated at ≥4 for quality and empathy. For readability outcomes, including Flesch-Kincaid Grade Level (FKGL), Flesch Reading Ease Score (FRES), and syllables per word, Mann-Whitney tests were applied given the non-normal distribution of scores. National guidelines from the Centers for Disease Control and Prevention (CDC) and the National Institutes of Health (NIH) recommend that health materials be written at or below a 6th-8th grade level (16,17).
Readability metrics
To further assess accessibility, responses were analyzed for readability using five established metrics: (I) FKGL (corresponding to U.S. school grade level required for comprehension); (II) FRES (0–100 scale, with ≥60 considered accessible for general populations); (III) average words per sentence; (IV) average syllables per word; and (V) total word count. These indices provide complementary measures of sentence complexity, vocabulary difficulty, and content load. Public health readability standards from the CDC and NIH guided the interpretation of results (16,17). Independent t-tests were applied to compare AI-generated vs. RD-authored responses across these metrics.
Results
The study sample included 100 matched question-response pairs, each containing an RD-authored and an AI-generated response. Independent evaluations by licensed RDs provided ratings of clinical quality, empathy, and overall performance. Responses were also analyzed for readability using the FKGL, FRES, average syllables per word, words per sentence, and word count (16,17). These metrics provided complementary insights to clarity, vocabulary complexity, and content load, forming the basis for subsequent analyses including comparisons of mean scores, threshold proportions, sensitivity tests, and correlations.
Overall ratings
Across all three prespecified outcomes, quality, empathy, and overall holistic score, AI-generated responses received substantially higher ratings than RD-authored responses (Table 1). On average, AI responses clustered at the upper end of the 5-point scale, while RD-authored responses were distributed more broadly with lower central tendencies (Figure 1). Kernel density plots demonstrate this separation graphically, showing a tight peak for AI at higher scores and a flatter distribution for RD-authored responses.
Table 1
| Outcome | AI-generated responses | RD-authored responses | P value |
|---|---|---|---|
| Quality Score (1–5 Likert scale) | 4.48±0.31 | 2.56±0.76 | <0.001 |
| Empathy Score (1–5 Likert scale) | 4.62±0.37 | 3.21±0.62 | <0.001 |
| Overall Score (0–100)† | 91.10±5.38 | 66.83±14.71 | <0.001 |
Data are presented as mean ± standard deviation. P values from independent (two-sample) t-test comparing AI-generated vs. RD-authored responses rated by 8 blinded RD evaluators. †, score between 0–100, as a holistic assessment incorporating both quality and empathy. AI, artificial intelligence; RD, registered dietitian.
Threshold analyses
Using a predefined cutoff of ≥4 on the Likert scale (classified as good/very good for quality and empathetic/very empathetic for empathy), nearly all AI-generated responses met these thresholds, compared with only a small minority of RD-authored responses (Table 2, Figure 2). Thus, evaluators were several times more likely to assign AI-generated content ratings in the highest categories.
Table 2
| Outcome | AI-generated responses, % (95% CI) | RD-authored responses, % (95% CI) | P value |
|---|---|---|---|
| Quality ≥4 | 96 (92–100) | 3 (0–6) | <0.001 |
| Empathy ≥4 | 97 (94–100) | 14 (7–21) | <0.001 |
Ratings were made on a 5-point Likert scale; scores ≥4 classified as acceptable/high. Denominator n=100 matched questions. P values from McNemar tests on binary outcomes (AI vs. RD). AI, artificial intelligence; CI, confidence interval; RD, registered dietitian.
Response length and readability
Total word count did not differ meaningfully between AI-generated and RD-authored responses (265±105 vs. 241±176 words; P=0.24). Longer RD responses, those exceeding the median or even the 75th percentile of length, were not associated with improved quality or empathy ratings (Table 3). Sensitivity analyses confirmed that extending response length did not increase the likelihood of RD responses meeting the ≥4 threshold. In contrast, AI responses met the threshold in 96–97% of cases across the full sample (Table 2, Figure 2). Additional readability analyses showed that RD-authored responses scored higher on the FRES and used a simpler vocabulary (Figure 3), while AI-generated responses had shorter sentences (13.8±4.3 vs. 17.6±5.1 words per sentence; P<0.001). Both groups were written at approximately 10th-grade level by the FKGL (10.17±2.72 vs. 10.24±2.31; P=0.84), which exceeds CDC and NIH recommendations for public health communication (16,17). Overall, RD-authored responses aligned more closely with CDC readability standards (16).
Table 3
| Subset of RD-authored responses | Word count cutoff | Sample size (n) | Quality ≥4 | Empathy ≥4 | |||
|---|---|---|---|---|---|---|---|
| % (95% CI) | P value | % (95% CI) | P value | ||||
| > Median word count | 187 | 51 | 4 (−1, 10) | 0.74 | 0.16 (0.06–0.27) | 0.73 | |
| > 75th percentile | 303 | 25 | 4 (−4, 12) | 0.81 | 0.20 (0.04–0.36) | 0.47 | |
Values represent the proportion of RD responses rated ≥4 (acceptable) for quality or empathy, stratified by word count. P values from z-tests comparing each subgroup with the overall RD sample (n=100). Ratings were made on a 5-point Likert scale, with scores ≥4 classified as “good/empathetic” or “very good/very empathetic”. CI, confidence interval; RD, registered dietitian.
Correlations
To further clarify how evaluators formed holistic judgments, we examined correlations between quality, empathy, and the overall percentage score within each response type. For RD-authored responses, overall scores were strongly correlated with quality (r=0.79, P<0.001) and moderately correlated with empathy (r=0.36, P<0.001), indicating that evaluators considered both domains when assigning holistic scores, with greater weight placed on clinical quality. In contrast, for AI-generated responses, overall scores were strongly correlated with quality (r=0.80, P<0.001) but showed no significant association with empathy (r=−0.08, P=0.43), suggesting that evaluators’ holistic assessments of AI responses were driven almost entirely by perceived clinical quality (Table 4). However, visual inspection of the kernel density plots (Figure 1) rather supports that a small number of low-scoring outliers contributed to the lack of a statistically significant correlation between empathy and the overall score for AI-generated responses; however, the distribution indicates that empathy still contributed to evaluators’ holistic assessments.
Table 4
| Response type | Variables compared | Pearson r | 95% CI | P value |
|---|---|---|---|---|
| RD-authored responses | Quality vs. empathy | 0.37 | 0.18–0.53 | <0.001 |
| Quality vs. overall score | 0.79 | 0.70–0.85 | <0.001 | |
| Empathy vs. overall score | 0.36 | 0.17–0.52 | <0.001 | |
| AI-generated responses (ChatGPT-4o) | Quality vs. empathy | –0.10 | −0.29–0.10 | 0.32 |
| Quality vs. overall score | 0.80 | 0.72–0.86 | <0.001 | |
| Empathy vs. overall score | –0.08 | −0.27–0.12 | 0.43 |
Pearson correlation coefficients were calculated to examine relationships between quality, empathy, and overall percentage scores within each response type. Overall scores represent evaluators’ holistic assessments (0–100). AI, artificial intelligence; CI, confidence interval; RD, registered dietitian.
Quality and empathy ratings were not significantly correlated for AI-generated responses (r=−0.10, P=0.32), suggesting that evaluators judged these domains independently when reviewing AI output. By contrast, RD responses showed a moderate positive correlation between quality and empathy (r=0.37, P<0.001), indicating that for human-authored content, perceptions of accuracy and empathy tended to rise or fall together.
Discussion
The primary objective of this study was to compare the clinical quality and empathy of AI-generated versus RD-authored nutrition responses using independently scored evaluations by licensed RDs. The results show a consistent and substantial advantage for AI-generated responses across all measured domains in the written format. Because “quality” was operationalized to include clinical accuracy as well as communication features such as clarity and readability, higher AI scores may partly reflect stronger alignment with these predefined rating criteria rather than an objective superiority in clinical judgment. On average, AI-generated responses showed higher clinical quality, a greater empathetic tone, and stronger overall scores. Notably, these differences were not attributable to length, as there was no significant difference in word count between the two groups. Sensitivity analyses further confirmed that longer RD-authored responses, whether above the median or the 75th percentile in word count, did not show improved ratings, while AI-generated responses consistently achieved high scores across the full sample. These findings indicate that differences in response ratings were not explained by response length alone. At the same time, these findings should be interpreted cautiously, as written evaluations may not fully capture the contextual judgement, tailoring, and relational skills that dietitians bring to real-world interactions. However, our readability analysis revealed an interesting counterpoint: RD-authored responses were easier to read according to the FRES and contained simpler vocabulary, aligning more closely with CDC standards for public communication (16,17). In contrast, AI-generated responses used more complex word choices that lowered readability scores. This likely reflects how LLMs are trained on extensive, often formal text sources such as books, academic papers, and web data, leading them to favor semantically rich but less accessible phrasing (3). Both groups, however, were written at approximately a 10th grade level based on the FKGL, exceeding CDC and NIH recommendations of a 6th–8th grade level for public health materials (16,17). This highlights a trade-off between perceived communication quality and accessibility, suggesting that while AI excels at producing polished, empathetic text, additional refinement may be needed to ensure readability for diverse patient populations.
Kernel density plots revealed a tightly clustered distribution of high scores among AI-generated responses, suggesting that LLMs can produce polished, uniform outputs. In contrast, the greater variability among RD-authored responses suggests that AI-generated responses offer a high degree of reliability and consistency in written communication. However, it also raises questions about adaptability and nuance. While RD-authored responses showed greater variability, this may reflect individualized approaches to communication that are not easily captured by rating scales. Consistent with this, Clay et al. emphasize that although LLMs often outperform humans on clarity and empathy, their tendency toward occasional inaccuracies underscores the need for human oversight (14). This underlines the value of professional experts: where AI provides consistency, human dietitians contribute adaptability, judgement, and context-sensitive nuance that remain critical to safe practice. It remains essential to assess whether AI-generated responses’ consistently high scores reflect meaningful clinical advantages or simply an ability to meet perceived surface-level expectations (5,18). This pattern aligns with findings showing that ChatGPT (GPT-3.5) responses were often perceived as more empathetic than human comparators (19), and supports the growing recognition of empathy as a cornerstone of effective dietetic practice (8). Similar concerns have been raised elsewhere, where LLMs were found to mimic empathetic phrasing but lacked the depth of genuine clinician-patient interactions (7).
Quality and empathy scores were not significantly associated for AI-generated responses, likely due to a ceiling effect, but were moderately correlated among RD-authored responses. This suggests that when RDs deliver high-quality content, they also tend to be perceived as more empathetic, potentially reflecting a more authentic, integrated communication style. This perspective aligns with broader views in precision medicine, where AI is envisioned as a tool to support personalization of care by augmenting, rather than replacing, professional expertise (5). Comparable work has also shown ChatGPT demonstrating empathetic abilities in controlled settings, such as rephrasing sentences to convey specific emotions with high accuracy (≈92%) and generating parallel emotional responses in standardized dialogue prompts, though these abilities were accompanied by a strong bias toward positive affect and questionnaire scores that fell below healthy human benchmarks (19). Some evaluations show that ChatGPT can display empathetic abilities in structured settings, though these may not reflect authentic patient-centered communication (7).
The sensitivity analysis showed that longer RD-authored responses were not associated with improved ratings, underscoring that effective communication is not simply a matter of length. This finding suggests opportunities for targeted training in online health communication, where clarity, empathy, and precision are critical (10). AI has also been applied in dietetics education, where virtual patient simulations have been used to help students practice empathy and communication, highlighting its role in training as well as patient-centered care (20). These findings are consistent with prior work showing that much online health information exceeds recommended readability thresholds (21), and that AI-generated patient materials in other domains, total knee arthroplasty, also fall short of accessibility targets (22). Thus, while AI can enhance written communication, limitations in accessibility and contextual depth highlight the continued importance of human practitioners in ensuring equitable, patient-centered care.
These findings align with recent evidence from other clinical domains. Ayers et al. reported that AI-generated responses (GPT-3.5) to patient medical questions were preferred over physician responses by lay evaluators, particularly due to higher empathy ratings (6). Similarly, Yan et al. found that ChatGPT could deliver patient education for inflammatory bowel disease that was comparable to physician-authored material, though the study emphasized that AI should supplement, not replace, professional experts (23). ChatGPT responses were consistently rated as more empathetic than human comparators in controlled evaluations (19). Similar trends have been observed in evaluations of GPT-4, which demonstrated strong performance on medical challenge problems and was praised for generating polished, uniform responses (18). Other studies note that while ChatGPT shows promise in health communication, many evaluations lack validation in nutrition-specific contexts (15). However, high empathy ratings may reflect linguistic polish rather than authentic, context-sensitive communication, underscoring the need for careful consideration of AI’s role in clinical nutrition counseling (5,18). For example, Naja et al. found that while ChatGPT outputs for diabetes and metabolic syndrome management were often rated clear and well-structured, they frequently omitted critical guideline-based recommendations (e.g., energy deficit targets, nutrient distributions, and physical activity advice), highlighting the gap between perceived clarity and clinical accuracy (24). Others have cautioned that ChatGPT outputs can suffer from hallucinations, bias, and ethical limitations (25). This concern is echoed in the dietetics literature, where Chatelan et al. warned that ChatGPT, while producing fluent responses, often introduces inaccuracies and may undermine the value of credentialed practitioners if used uncritically (13). Evaluations in other clinical domains have also found that ChatGPT responses, while fluent, may be incomplete or unsafe (26). Editorial perspectives have similarly emphasized that LLMs should be designed to augment, rather than replace, professional expertise, and that safeguards are required to prevent inappropriate or unsafe reliance on AI outputs in place of expert guidance (27).
Strengths and limitations
A key strength of this study lies in the use of blinded, independent evaluations by licensed RDs, which minimizes potential bias and strengthens the clinical relevance of findings. By incorporating real-world RD-authored responses, we provide a naturalistic benchmark for comparison with AI-generated responses. Unlike prior evaluations such as Ayers et al. (6), which relied on laypersons to judge the perceived helpfulness and empathy of responses, our study used licensed RDs as evaluators. This ensured that ratings reflected not only tone and readability but also clinical accuracy and adherence to dietetic standards. In addition, our analysis evaluated multiple dimensions of communication (clinical quality, empathetic tone, readability, and sensitivity to response length) providing a comprehensive assessment of communication performance.
Several limitations should be acknowledged. The RD-authored responses were sourced from public forums, created under real-world time constraints, and were not generated with knowledge of future evaluation, conditions that differ from the AI-generated responses produced specifically for this study. Importantly, lower quality ratings among some RD-authored responses should not be interpreted as reflecting a lack of professional expertise. Rather, these responses were written for public-facing forums under contextual constraints such as brevity, informal tone, and communication norms tailored to a lay audience, without the expectation of formal evaluation. In addition, because the study focused on commonly encountered public nutrition questions, the findings may not generalize to rare, highly specialized, or low-frequency nutrition concerns, including emerging or rapidly evolving topics. Readability metrics, while informative, capture surface-level features of text and may not fully reflect how diverse populations comprehend or act on information. Although the number of evaluators was modest, the study was adequately powered based on a priori calculations; nonetheless, larger and more diverse evaluator samples in future work could strengthen generalizability. Because evaluators provided structured ratings without free-text annotations, the study did not systematically capture specific examples or typologies of factual errors or hallucinations, limiting direct comparison of error patterns between AI-generated and RD-authored responses. In addition, our focus on written communication alone does not capture the dynamic interpersonal interactions present in live consultations. While our study highlights differences in clinical quality and empathy, it does not address the potential long-term impacts on patient outcomes, which remains an important avenue for future research.
This study was conducted using ChatGPT-4o, reflecting the state of the technology in 2024. Earlier studies employed versions such as GPT-3.5 or GPT-4, which differ in training data, reasoning ability, and reliability. Since then, newer models (e.g., GPT-5, released in 2025) have been introduced, and their performance may differ from what we observed. These rapid advances highlight both the timeliness and limitations of single-model evaluations, underscoring the need for ongoing research as LLM capabilities evolve.
In addition to these methodological considerations, structural limitations of AI chatbots must be acknowledged. Chatelan et al. highlighted recurring problems such as fabricated references, verbose or generic responses, and lack of transparency in sourcing (13). Similar concerns have been raised in medicine, where GPT-4 has been shown to produce convincing but occasionally inaccurate or misleading responses (18), and reviews caution that ChatGPT outputs may be verbose, biased, or ethically problematic if applied without oversight (2). In the field of dietetics specifically, Atwal emphasized that while AI tools hold promise for dietary assessment and counselling, they also pose risk of bias, lack of transparency, and potential dehumanization if integrated uncritically (11). These issues could influence perceptions of quality and empathy in real-world dietetics practice, even if they were not directly captured in our evaluation. Evidence from other contexts underscores both the promise and the risks of AI-driven nutrition counseling. Sun et al. evaluated an AI dietitian trained for type 2 diabetes management and found that it achieved expert-level performance on the Chinese RD exam and produced recommendations largely consistent with professional guidelines (12). However, the system also displayed limitations, including omission of culturally specific foods and inconsistent detail in dietary recommendations. These findings parallel our observation that AI-generated responses, while highly rated for quality and empathy, may not always capture the nuance and contextual accuracy required for individualized nutrition counseling.
Conclusions
AI-generated responses in nutrition counseling outperformed RD-authored responses in perceived clinical quality, empathy, and overall performance when evaluated in written form by licensed RDs. These findings highlight the potential of LLMs like ChatGPT-4o to support high-quality, empathetic communication in digital health settings. At the same time, lower readability and lack of contextual nuances suggest that AI should be viewed as a complement rather than a replacement for dietetic expertise. Future research should explore how AI can be integrated into clinical workflows, including real-time interactions and hybrid models where healthcare professionals leverage AI tools to enhance, rather than substitute, human expertise. Including evaluations from members of the general public in future studies could provide broader insight into how lay audiences perceive clarity, empathy, and usefulness of AI-generated responses, offering perspectives beyond professional reviewers. As newer versions of ChatGPT and other LLMs continue to evolve, ongoing evaluations will be essential to understand how advances in model design influence performance in clinical nutrition contexts.
Acknowledgments
The authors gratefully acknowledge the RDs who kindly participated in the study.
Footnote
Reporting Checklist: The authors have completed the STROBE reporting checklist. Available at https://mhealth.amegroups.com/article/view/10.21037/mhealth-2025-70/rc
Data Sharing Statement: Available at https://mhealth.amegroups.com/article/view/10.21037/mhealth-2025-70/dss
Peer Review File: Available at https://mhealth.amegroups.com/article/view/10.21037/mhealth-2025-70/prf
Funding: None.
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://mhealth.amegroups.com/article/view/10.21037/mhealth-2025-70/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The protocol was approved by the Institutional Review Board of Florida International University (No. IRB-24-0447). Informed consent was obtained from all evaluator participants.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Al Kuwaiti A, Nazer K, Al-Reedy A, et al. A Review of the Role of Artificial Intelligence in Healthcare. J Pers Med 2023;13:951. [Crossref] [PubMed]
- Ray PP. ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. . Internet Things Cyber-Phys Syst 2023;3:121-54.
- Minaee S, Mikolov T, Nikzad N, et al. Large Language Models: A Survey. arXiv. 2025. arXiv: 2402.06196.
- Chang Q, Chen F, Chen Y, et al. 2025 Expert consensus on retrospective evaluation of large language model applications in clinical scenarios. Intell Med 2025;5:318-30.
- Johnson KB, Wei WQ, Weeraratne D, et al. Precision Medicine, AI, and the Future of Personalized Health Care. Clin Transl Sci 2021;14:86-93. [Crossref] [PubMed]
- Ayers JW, Poliak A, Dredze M, et al. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Intern Med 2023;183:589-96. [Crossref] [PubMed]
- Luo MJ, Bi S, Pang J, et al. A large language model digital patient system enhances ophthalmology history taking skills. NPJ Digit Med 2025;8:502. [Crossref] [PubMed]
- de Graaff E, Bennett C, Dart J. Empathy in Nutrition and Dietetics: A Scoping Review. J Acad Nutr Diet 2024;124:1181-205. [Crossref] [PubMed]
- Ogata B, Carney LN. Academy of Nutrition and Dietetics: Revised 2022 Standards of Practice and Standards of Professional Performance for Registered Dietitian Nutritionists (Competent, Proficient, and Expert) in Pediatric Nutrition. J Acad Nutr Diet 2022;122:2134-2149.e50. [Crossref] [PubMed]
- Chen J, Lieffers J, Bauman A, et al. The use of smartphone health apps and other mobile health (mHealth) technologies in dietetic practice: a three country study. J Hum Nutr Diet 2017;30:439-52. [Crossref] [PubMed]
- Atwal K. Artificial intelligence in clinical nutrition and dietetics: A brief overview of current evidence. Nutr Clin Pract 2024;39:736-42. [Crossref] [PubMed]
- Sun H, Zhang K, Lan W, et al. An AI Dietitian for Type 2 Diabetes Mellitus Management Based on Large Language and Image Recognition Models: Preclinical Concept Validation Study. J Med Internet Res 2023;25:e51300. [Crossref] [PubMed]
- Chatelan A, Clerc A, Fonta PA. ChatGPT and Future Artificial Intelligence Chatbots: What may be the Influence on Credentialed Nutrition and Dietetics Practitioners? J Acad Nutr Diet 2023;123:1525-31. [Crossref] [PubMed]
- Clay TJ, Da Custodia Steel ZJ, Jacobs C. Human-Computer Interaction: A Literature Review of Artificial Intelligence and Communication in Healthcare. Cureus 2024;16:e73763. [Crossref] [PubMed]
- Wang L, Wan Z, Ni C, et al. Applications and Concerns of ChatGPT and Other Conversational Large Language Models in Health Care: Systematic Review. J Med Internet Res 2024;26:e22769. [Crossref] [PubMed]
- Centers for Disease Control and Prevention. Simply Put: A guide for creating easy-to-understand materials. 2010. Available online: https://stacks.cdc.gov/view/cdc/11938
- National Institute of Health. Plain Language at NIH [updated February 21, 2025]. 2025. Available online: https://www.nih.gov/institutes-nih/nih-office-director/office-communications-public-liaison/clear-communication/plain-language-nih
- Lee P, Bubeck S, Petro J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. N Engl J Med 2023;388:1233-9. [Crossref] [PubMed]
- Schaaff K, Reinig C, Schlippe T. Exploring ChatGPT's empathic abilities. arXiv 2023. arXiv: 2308.03527v3.
- Barker LA, Moore JD, Cook HA. Generative Artificial Intelligence as a Tool for Teaching Communication in Nutrition and Dietetics Education-A Novel Education Innovation. Nutrients 2024;16:914. [Crossref] [PubMed]
- Friedman DB, Hoffman-Goetz L, Arocha JF. Readability of cancer information on the internet. J Cancer Educ 2004;19:117-22. [Crossref] [PubMed]
- Lower K, Lin JY, Jenkin D, et al. Comparing the Quality and Readability of ChatGPT-4-Generated vs. Human-Generated Patient Education Materials for Total Knee Arthroplasty. Cureus 2025;17:e86491. [Crossref] [PubMed]
- Yan Z, Liu J, Fan Y, et al. Ability of ChatGPT to Replace Doctors in Patient Education: Cross-Sectional Comparative Analysis of Inflammatory Bowel Disease. J Med Internet Res 2025;27:e62857. [Crossref] [PubMed]
- Naja F, Taktouk M, Matbouli D, et al. Artificial intelligence chatbots for the nutrition management of diabetes and the metabolic syndrome. Eur J Clin Nutr 2024;78:887-96. [Crossref] [PubMed]
- Iqbal U, Tanweer A, Rahmanti AR, et al. Impact of large language model (ChatGPT) in healthcare: an umbrella review and evidence synthesis. J Biomed Sci 2025;32:45. [Crossref] [PubMed]
- Soddu M, De Vito A, Madeddu G, et al. Assessing the Accuracy, Completeness and Safety of ChatGPT-4o Responses on Pressure Injuries in Infants: Clinical Applications and Future Implications. Nurs Rep 2025;15:130. [Crossref] [PubMed]
- Will ChatGPT transform healthcare? Nat Med 2023;29:505-6. [Crossref] [PubMed]
Cite this article as: Ayres K, Nadery M, Henfridsson P. Evaluation of AI-generated versus registered dietitian-authored nutrition responses: a cross-sectional study. mHealth 2026;12:17.

