Radioprotection 60-3 | Seite 14

214 S. Ito et al.: Radioprotection 2025, 60( 3), 211 – 220
3.1 Understandability and actionability
Table 2 and 3 show the results of comparisons of scale scores for the AI chatbots and Google Search. Table 4 shows the comparison of PEMAT-P item scores. First, we discuss the percentage of materials that met the cutoff score of 70 or higher in PEMAT-P( Tab. 2). With regard to understandability, Copilot( n = 22, 71.0 %) and Gemini( n = 26, 92.9 %) 6th grade level texts had significantly higher percentages of 70 or higher, while Google Search had a significantly lower percentage of 70 or higher( n = 58, 32.8 %). Conversely, with regard to actionability, there were no materials with a score of 70 or higher from the AI chatbots and only one from Google Search, with no significant differences between the groups. Next, we discuss the results of comparisons among the PEMAT-P scale scores( Tab. 3). For understandability, the 6th grade level sentences from Gemini had the highest scale scores( M = 84.1, SD = 7.4), significantly higher than those of the regular level ChatGPT-3.5( M = 63.8, SD = 13.3) and Copilot( M = 71.2, SD = 10.9), and the 6th grade level ChatGPT-3.5( M = 70.4, SD = 8.5) and Google Search( M = 60.8, SD = 17.9). For actionability, the scores were significantly higher for normal level Gemini texts( M = 6.9, SD = 16.3) than for Google Search( M = 1.4, SD = 8.4), but lower for both groups.
Table 4 shows the results of the PEMAT-P item score comparison among AI chatbots and Google Search. The following items from Google Search had significantly lower percentages of applicable materials than the other groups.
3.2 Readability
The percentages of jReadability difficulty levels are given in Table 2. Gemini at the normal level and Copilot and Gemini at the 6th grade level had significantly higher percentages of very readable to somewhat difficult responses. Conversely, ChatGPT-3.5 at the normal level, Copilot, and Google Search had significantly lower percentages of very readable to somewhat difficult responses. The results of the comparison of jReadability difficulty scores( continuous variables) and sentence length are given in Table 3. For the jReadability score, the normal level Gemini( M = 1.7, SD = 0.5) and 6th grade level Copilot( M = 2.7, SD = 1.2) and Gemini( M = 2.9, SD = 0.9) had significantly lower scores than the regular level ChatGPT-3.5( M = 1.7, SD = 0.5), Copilot( M = 2.1, SD = 1.0), and Google Search( M = 1.5, SD = 3.0).
4 Discussion
In this study, we evaluated the understandability, actionability, and readability of radiation-related texts generated by AI chatbots and investigated whether AI chatbots can be a useful tool for disseminating radiation-related disaster prevention information. The results showed that the sentences produced by the AI chatbots were easier to understand, the difficulty level of the Japanese sentences was lower, and the purpose of the document was clearer than the sentences on web pages from the Google Search results. It was also suggested that when the prompt“ Please teach me at a 6th grade level” was included, the difficulty level of the Japanese text often decreased compared to text generated without this prompt. The
Table 2. Comparison of item scores among artificial intelligence chatbot and web search( categorical variable).
Artificial Intelligence Chatbot Web Search
Normal level 6th Grade Level
ChatGPT-3.5 Copilot Gemini ChatGPT-3.5 Copilot Gemini Google
n % AR n % AR n % AR n % AR n % AR n % AR n % AR
PEMAT(> 70)
Understandability
12
40.0
�0.9
17
53.1
0.6
17
58.6
1.2
19
65.5
1.9
22
71.0
2.7
26
92.9
4.9
58
32.8
�5.7
Actionability
0
0.0
�0.3
0
0.0
�0.3
0
0.0
�0.3
0
0.0
�0.3
0
0.0
�0.3
0
0.0
�0.3
1
0.6
1.0
jReadability difficulty level *
10
8.0
�7.0
32
23.9
�3.1
69
55.2
4.9
42
34.7
�0.3
74
55.6
5.2
73
56.2
5.2
38
21.5
�4.4
Very readable
0
0.0
�0.4
0
0.0
�0.4
1
0.8
2.6
0
0.0
�0.4
0
0.0
�0.4
0
0.0
�0.4
0
0.0
�0.5
Readable
0
0.0
�2.3
2
1.5
�1.4
9
7.2
2.3
0
0.0
�2.3
14
10.5
4.6
8
6.2
1.7
1
0.6
�2.4
Neutral
0
0.0
�3.7
16
11.9
1.4
29
23.2
6.2
1
0.8
�3.3
13
9.8
0.5
20
15.4
2.9
3
1.7
�3.7
Somewhat difficult
10
8.0
�4.4
14
10.4
�3.8
30
24.0
0.2
41
33.9
2.9
47
35.3
3.5
45
34.6
3.3
34
19.2
�1.5
Difficult
66
52.8
1.3
73
54.5
1.8
48
38.4
�2.2
73
60.3
3.0
55
41.4
�1.5
54
41.5
�1.4
79
44.6
�0.8
Very difficult
47
37.6
8.6
25
18.7
2.0
8
6.4
�2.4
6
5.0
�2.9
0
0.0
�4.9
3
2.3
�4.0
36
20.3
3.1
Not measurable( too difficult)
2
1.6
�1.3
4
3.0
�0.4
0
0.0
�2.3
0
0.0
�2.3
4
3.0
�0.4
0
0.0
�2.4
24
13.6
7.9
Since the evaluation was possible only up to 20,000 characters due to the jReadability limit, we evaluated up to 20,000 characters from the beginning of the text. * Very readable to Somewhat difficult total is listed.
x2 test was performed to calculate adjusted residuals( AR); AR = adjusted residuals; significant difference if adjusted residuals are greater than or less than 1.96( p < 0.05).