656
E. D. Hernández et al.
Table III. Intra-rater agreement within each rater (A,B and C) and inter-rater agreement between all raters during test occasion 1 and
2 for Fugl-Meyer Assessment Upper Extremity (FMA-UE) subscale B, C and D items, sums and the total score A–D
Intra-rater agreement (PA %)
Rater A
B. WRIST
Stability at 15° dorsiflexion
Repeated wrist flexion
Stability at 15° dorsiflexion
Repeated wrist flexion
Circumduction
SUM B, range 0–10p
SUM B, 1 point
C. HAND
Mass flexion
Mass extension
Hook grasp
Thumb adduction
Opposition/pincer grasp
Cylinder grasp
Spherical grasp
SUM C, range 0–14p
SUM C, 1 point
D. COORDINATION/SPEED
Tremor
Dysmetria
Time
SUM D, range 0–6p
TOTAL A–D, range 0–66p
TOTAL A–D, 1 point difference
TOTAL A–D, 2 points difference
TOTAL A–D, 3 points difference
Rater B
Inter-rater agreement (PA %)
Rater C
Test occasion 1
Test occasion 2
90
92
85
87
90
66
88 94
91
88
88
91
74
87 89
89
83
89
97
72
87 98
97
95
97
93
89
97 98
97
95
97
93
89
97
97
97
90
92
85
92
92
75
95 100
100
94
97
91
100
100
89
97 100
100
89
94
92
94
97
74
92 100
100
100
97
93
97
98
93
98 100
100
100
100
100
98
100
98
100
87
82
85
67 94
88
88
71 92
83
97
78 93
90
98
85 93
92
98
85
33 (RC) a
60
70
78 45
61
66
79 46
61
68
82 67
80
87
97 75
83
88
93
a
Statistically significant disagreement (absolute value of RP/RC≥ 0.1 and 95% CI does not include 0) marked in bold.
PA: percentage of agreement, RP: relative position; RC: relative concentration.
measured as RV, were all close to zero. Exact RP and
RC values along with 95% CI are displayed in Tables
SIII–IV 1 .
The PA was above 90% for all items between the
raters (Tables II and III). Full agreement (100%) was
observed for reflex activity (A.I.), adduction/internal
rotation and elbow extension (A.II), stability of wrist
(B), mass extension and flexion of the hand, hook
grasp, thumb adduction, pincer grasp and spherical
grasp (C) at least in 1 of the test occasions. At the
subscale and total score level the PA varied between
67% and 93% on the first test occasion and between
75% and 98% on the second test occasion, which indi-
cates improved agreement on the second test occasion.
An 80% PA was reached for the subscale A and total
score A–D when a 1-point difference between test oc-
casions was accepted.
DISCUSSION
This study demonstrated that the FMA-UE is a re-
liable clinical instrument for the evaluation of upper
extremity motor function early after stroke. Only
one item (forearm pronation within synergies) in the
inter-rater reliability and 4 items (elbow extension,
forearm pronation within synergies, shoulder flexion
to 90°, normal reflex activity) out of 33 in the intra-
rater reliability testing showed statistically significant
www.medicaljournals.se/jrm
systematic disagreements, either in relative position
or in concentration. A systematic shift towards higher
scores on the second test occasion within the same rater
was observed for some items and for the total score.
In addition, the intra- and inter-rater agreement was
high (79–100%) for all single items, which confirms
that the use of single items from the FMA-UE might
be warranted. The 70% intra-rater agreement was also
reached for the subscale C, but a 1-point difference was
needed for the subscale B and D and a 3-point for the
subscale A and the total score A–D. Inter-rater agre-
ement was above 80% for subscales B, C and D, and
only 1-point difference was needed for subscale A and
the total score A–D to reach this level of agreement.
This study is the first to investigate the item-level
intra- and inter-rater reliability of the FMA-UE in a
relatively large sample of patients early after stroke.
Previous studies have to a large extent evaluated
reliability in relatively small samples and used sta-
tistical methods, such as ICC, which are less suitable
for ordinal data (33). However, a recent study used
weighted kappa statistics and reported high item-level
reliability when the scorings of the FMA-UE were
made from the video (19). Weighted kappa is a com-
monly used measure of agreement, but it still fails
to identify the systematic disagreements and ignores
the rank invariant properties of ordinal data. It also
assumes that the raters have equal skill level, which