Reliability of Fugl-Meyer Assessment of upper extremity in stroke
means that systematic disagreements are ignored (33,
34). In addition, the weighted kappa value depends on
the choice of weights and is sensitive to the number of
categories, which means that the value increases when
the number of categories decreases (33).
The current study used a statistical method parti-
cularly designed for ordinal data. This method makes
it possible to measure type and extent of observed
inter- and intra-rater disagreements in terms of sys-
tematic and non-systematic disagreements (32). The
disagreements caused by random individual variability
between- and within-raters were negligibly small and
non-significant in the current study. This means that the
extent of uncertainty in interpretation of the FMA-UE
scale categories among raters was small. This small
individual variability or low level of noise can be con-
sidered as a sign of good quality of the FMA-UE scale
properties and/or that the heterogeneity among raters
was small. In this study all raters had several years’
clinical experience and underwent formal training on
FMA assessment organized by the experts in the field
prior to data collection (21). All raters were also taking
part of the translation process of the scale, which might
have positively influenced the results (21). The fact that
each item of the FMA-UE is scored on only 3 levels can
also increase the possibility to reach high reliability,
which can be considered as a strength of the scale. The
more approximate scoring might consecutively reduce
the sensitivity of the scale to change, but the use of
total score will, in turn, reduce this risk.
The systematic disagreements observed were few
and occurred mostly in the intra-rater testing performed
over 2 consecutive days. These disagreements were
predominantly caused by a systematic shift of the
scores towards a higher score on the second test oc-
casions. Since a spontaneous recovery can be expected
to occur during the first 10 days post stroke, it is likely
that the observed disagreements were caused mainly by
spontaneous recovery along with a possible learning
effect. However, extra attention should be paid in terms
of standardization and training for these specific items
that showed systematic disagreements (e.g. shoulder
flexion 0–90°, elbow extension and pronation within
synergies as well as normal reflex activity). This is an
example of how a rank invariant method, as used in the
current study, can be used as guidance for identifying
the problematic items of a scale.
In addition to high agreement observed at the item
level, good intra- and inter-rater reliability was seen
at the subscale and total score level. For the inter-rater
reliability, a 70% agreement of the FMA-UE total score
was reached when only one point difference between
raters’ scores was accepted. Similarly, for the intra-rater
testing, a 3-point difference in the FMA-UE total score
657
was needed to reach the 70% agreement within raters.
These numbers can be used as guidance for estimating
the expected variance in scorings of the total score
between several trained and experienced assessors in
the early subacute stage after stroke. The suggested
minimum clinical difference for the FMA-UE is 3.6
points (35) and minimal clinically important difference
is 9–10 points (36) in the subacute stage after stroke.
In the current study, good agreement for the FMA-UE
total score was reached with lower values for all paired
comparisons. This might be accounted for joint training
with experts and identification of problem areas during
the translation and cultural adaptation process of the
scale prior data collection (21).
The current study aimed to include a representative
cohort of patients with stroke who would be candida-
tes for sensorimotor assessment using either upper or
lower extremity FMA. This, however, resulted in that
approximately 22% of the included patients’ received
maximum score on the FMA-UE. This selection bias
might have influenced the study results by reducing
the variability of the possible scores in these patients,
and making it more likely to achieve a high intra- and
inter-rater reliability. This study included patients from
the Central Military Hospital, which is free of cost for
those serving in the armed forces. Due to the previous
civil war, traumatic injuries have dominated, but stroke
is increasing. This is the cause of the relatively low
recruitment rate of 3.5/month. However, the cohort
contains almost half women; the mean age is repre-
sentative for the middle-income countries and has a
variation in sociodemographic background. A possible
methodological limitation of the study could be that the
time between test occasions 1 and 2 was only one day,
which might have caused a recall bias for the raters. To
minimize this bias a longer time between assessments
could have been used, but then again it would instead
increase the likelihood of spontaneous recovery to oc-
cur. Here, a video-based assessment could have been a
possible solution, although scoring from video is not
common in clinical praxis and the results would only
be valid for video-based assessments. In the current
study the assessments were done by 3 physiotherapists,
which is relevant considering clinical praxis. In clinical
acute settings it is common that several therapists per-
form the testing. In these situations continuing training
becomes even more important.
Standardized testing protocols and training proce-
dures for FMA-UE are crucial in order to reach suf-
ficient accuracy in scorings and agreement between
assessors at different time-points (14, 37). A recent
study showed, however, that the FMA-UE was mo-
dified in 12 out of 79 studies (11). This heterogeneity
limits the ability to pool data from different studies and
J Rehabil Med 51, 2019