212 S. Ito et al.: Radioprotection 2025, 60( 3), 211 – 220
In recent years, artificial intelligence( AI) chatbots such as ChatGPT, Copilot and Gemini have been released, and various studies have been conducted in the medical field( Ayers et al., 2023; Decker et al., 2023; Pan et al., 2023). In the area of nuclear disaster prevention, the report suggests that AI chatbots in radiological emergency response could serve as a decision support tool for humans( Chandra and Chakraborty, 2024). It is recommended that patient information materials be readable at a 6th to 8th grade level or lower( Centers for Disease Control and Prevention( U. S.) et al., 2009; Cotugna et al., 2005; Weiss, 2003). However, it did not assess whether the information produced by AI chatbots is easier to understand and act upon than that on a web page. Moreover, the previous study did not assess the reading level of the text. Therefore, it is not certain whether AI chatbots can be a useful source of information in the event of a nuclear disaster.
The aim of this study was to evaluate the understandability, actionability, and readability of texts related to nuclear disasters produced by an AI chatbot, and to investigate whether it could be a useful source of information in the event of a nuclear disaster. We also investigated whether the AI chatbot could produce texts that were easy to understand at the 6th grade level. To evaluate the AI chatbot output, we compared it to Google Search results. It is important to note that AI chatbots may generate content that includes misinformation, and verifying the reliability of the information they provide is crucial. Additionally, it is essential to understand that AI chatbots do not bear responsibility for the outcomes or interpretations arising from the generated content. This study does not assess the reliability of the generated text but focuses solely on its understandability, actionability, and readability.
2 Methods
2.1 Information generation and website selection
In this study, we conducted a systematic quantitative content analysis of online materials based on a cross-sectional study. From March 25 to April 25, 2024, web pages were searched using the Google search engine, and text was generated by ChatGPT-3.5( Chat Generative Pretrained Transformer, model 3.5; Open AI, San Francisco, CA, USA), Microsoft Copilot( Microsoft Corporation, WA, US), and Google Gemini( Alphabet Inc., CA, US) AI chatbot. In addition, Ito had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. The keyword selection process was as follows. First, Google Trends was used to search for the top 25 most frequently searched keywords in Fukushima Prefecture during the week of March 11 to March 18, 2011. Second, among those top 25 keywords, we extracted five keywords related to the nuclear power plant accident:“ nuclear power plant,”“ radioactivity,”“ radiation,”“ nuclear power,” and“ Chernobyl.” Third, Google Trends was used to search for these five keywords in Fukushima Prefecture during the week of March 11 to March 18, 2011, and the top 25 most frequently searched words for each of these five keywords were extracted. The reason for the two-stage keyword extraction using Google Trends is that the single-stage keyword extraction process extracts a large number of keywords related to the Great East
Japan Earthquake and a small number of keywords related to the nuclear power plant accident. In addition, proper nouns such as company and individual names were excluded.
First, we describe the procedure for Google. The extracted keywords were entered into Google Search and evaluated in the order of search engine results. Search results were sorted in the order of keyword relevance, so those with lower rankings( those appearing lower on the search results page) were considered to have lower relevance to the keywords. According to a previous study, when the general public uses search engines to search for health-related information, the average number of web pages viewed is about eight, and 66 % of users reported viewing fewer than five documents( Jansen and Spink, 2006). Based on this, eight web pages were extracted per keyword. Among the search results, videos, advertisements, broken links, inappropriate content, sites requiring subscriptions or fees, external link lists only, numbers only, and professional pages were excluded from the evaluation. Next, we describe the sentence generation procedure for the AI chatbots. Two prompts were used for sentence generation:“ Tell me about that keyword”( normal level) and“ Tell me about that keyword at a 6th grade level”( 6th grade level). To ensure that information from the search history did not influence the results, the search history was erased before prompting the chatbots. Because search results can differ even when the same search keyword is used, each AI chatbot was prompted for the same keyword four times. Responses that contained obviously incorrect information regarding the search results were excluded from the evaluation.
This study was not subjected to ethical review because it did not involve any human subjects or interventions.
2.2 Evaluators
A researcher in the field of nuclear disaster prevention read each document and independently scored all documents except those whose content was clearly inappropriate. Next, a physician who is not a specialist in radiation or nuclear hazards scored 20 % of randomly selected texts from each condition group using the Patient Education Materials Assessment Tool for Printable Materials( PEMAT-P)( Furukawa et al., 2022; Shoemaker et al., 2014). In cases of disagreement, consensus was reached through discussion.
2.3 Understandability and actionability
The understandability and actionability of the content of each website were evaluated using the Japanese version of PEMAT-P( Furukawa et al., 2022; Shoemaker et al., 2014). The Japanese version of PEMAT-P consists of 23 items and assesses the following:“ content,”“ word choice and style,”“ use of numbers,”“ organization content,”“ word choice and style,”“ organization,”“ layout and design,” and“ use of visual aids.” Evaluators were required to respond to each item by selecting either 0( disagree) or 1( agree). These scores ranged from 0 % to 100 %, with higher scores indicating higher levels of perceived comprehensibility and ease of actionability. The cutoff value was set at 70 % for both scores. As for the text generated by AI chatbots, only the fourth result was evaluated.