Journal of Rehabilitation Medicine 51-9 | Page 33

Reliability of Fugl-Meyer Assessment of upper extremity in stroke means that systematic disagreements are ignored (33, 34). In addition, the weighted kappa value depends on the choice of weights and is sensitive to the number of categories, which means that the value increases when the number of categories decreases (33). The current study used a statistical method parti- cularly designed for ordinal data. This method makes it possible to measure type and extent of observed inter- and intra-rater disagreements in terms of sys- tematic and non-systematic disagreements (32). The disagreements caused by random individual variability between- and within-raters were negligibly small and non-significant in the current study. This means that the extent of uncertainty in interpretation of the FMA-UE scale categories among raters was small. This small individual variability or low level of noise can be con- sidered as a sign of good quality of the FMA-UE scale properties and/or that the heterogeneity among raters was small. In this study all raters had several years’ clinical experience and underwent formal training on FMA assessment organized by the experts in the field prior to data collection (21). All raters were also taking part of the translation process of the scale, which might have positively influenced the results (21). The fact that each item of the FMA-UE is scored on only 3 levels can also increase the possibility to reach high reliability, which can be considered as a strength of the scale. The more approximate scoring might consecutively reduce the sensitivity of the scale to change, but the use of total score will, in turn, reduce this risk. The systematic disagreements observed were few and occurred mostly in the intra-rater testing performed over 2 consecutive days. These disagreements were predominantly caused by a systematic shift of the scores towards a higher score on the second test oc- casions. Since a spontaneous recovery can be expected to occur during the first 10 days post stroke, it is likely that the observed disagreements were caused mainly by spontaneous recovery along with a possible learning effect. However, extra attention should be paid in terms of standardization and training for these specific items that showed systematic disagreements (e.g. shoulder flexion 0–90°, elbow extension and pronation within synergies as well as normal reflex activity). This is an example of how a rank invariant method, as used in the current study, can be used as guidance for identifying the problematic items of a scale. In addition to high agreement observed at the item level, good intra- and inter-rater reliability was seen at the subscale and total score level. For the inter-rater reliability, a 70% agreement of the FMA-UE total score was reached when only one point difference between raters’ scores was accepted. Similarly, for the intra-rater testing, a 3-point difference in the FMA-UE total score 657 was needed to reach the 70% agreement within raters. These numbers can be used as guidance for estimating the expected variance in scorings of the total score between several trained and experienced assessors in the early subacute stage after stroke. The suggested minimum clinical difference for the FMA-UE is 3.6 points (35) and minimal clinically important difference is 9–10 points (36) in the subacute stage after stroke. In the current study, good agreement for the FMA-UE total score was reached with lower values for all paired comparisons. This might be accounted for joint training with experts and identification of problem areas during the translation and cultural adaptation process of the scale prior data collection (21). The current study aimed to include a representative cohort of patients with stroke who would be candida- tes for sensorimotor assessment using either upper or lower extremity FMA. This, however, resulted in that approximately 22% of the included patients’ received maximum score on the FMA-UE. This selection bias might have influenced the study results by reducing the variability of the possible scores in these patients, and making it more likely to achieve a high intra- and inter-rater reliability. This study included patients from the Central Military Hospital, which is free of cost for those serving in the armed forces. Due to the previous civil war, traumatic injuries have dominated, but stroke is increasing. This is the cause of the relatively low recruitment rate of 3.5/month. However, the cohort contains almost half women; the mean age is repre- sentative for the middle-income countries and has a variation in sociodemographic background. A possible methodological limitation of the study could be that the time between test occasions 1 and 2 was only one day, which might have caused a recall bias for the raters. To minimize this bias a longer time between assessments could have been used, but then again it would instead increase the likelihood of spontaneous recovery to oc- cur. Here, a video-based assessment could have been a possible solution, although scoring from video is not common in clinical praxis and the results would only be valid for video-based assessments. In the current study the assessments were done by 3 physiotherapists, which is relevant considering clinical praxis. In clinical acute settings it is common that several therapists per- form the testing. In these situations continuing training becomes even more important. Standardized testing protocols and training proce- dures for FMA-UE are crucial in order to reach suf- ficient accuracy in scorings and agreement between assessors at different time-points (14, 37). A recent study showed, however, that the FMA-UE was mo- dified in 12 out of 79 studies (11). This heterogeneity limits the ability to pool data from different studies and J Rehabil Med 51, 2019