Content validity and test–retest reliability with principal component analysis of the translated Malay four-item version of Paffenbarger physical activity questionnaire

Purpose – This study aimed to develop the construct validity for the Malay version of the Paffenbarger physical activity questionnaire (PPAQ) by adapting the original questionnaire to suit the local context. Design/methodology/approach – The PPAQ was adopted and translated into the Malay language and modified to reach good content agreement among a panel of experts. A total of 65 participants aged 22–55 years old, fluent and literate in the Malay language were selected. Principal component analysis (PCA) was used to investigate construct validity. Reliability of this adapted instrument was analyzed according to types of variables. Findings – The panel of experts reached a consensus that the final four items chosen in the adapted Malay version of PPAQ were valid and supported by a good content validity index (CVI). In total, two domains consonant with the operational domain definition were identified by PCA. Based on scores from intensity and duration of exercise, the study further divided the group into who were physically active and those who chose the unstructured physical activity. Relative reliability after a 14-day interval demonstrated moderate strength of agreement with an acceptable range of measurement error. Research limitations/implications – PPAQ has been used worldwide but was less familiar in the local context. The Malay-four item PPAQ will provide the locally validated version of physical activity questionnaire. In addition, the authors have improved the original PPAQ by dividing the question items into two distinct domainswhichwill effectively identify thosewho are physically active and those who are involved in unplanned exercise. Nevertheless, further research is recommended in bigger and heterogeneous samples along with a number of reliability tests. Practical implications – To the authors’ knowledge, this is the first study to assess internal structure of the four-itemversion of PPAQ.This analysis successfully identified two componentswith eigenvaluemore thanone in the Malay four-item PPAQ. Based on this, the authors were able to separate pool of population into two groups, whichare physically active and unplanned exercise (involved in unstructured exercise). The ability of the validated Paffenbarger physical activity questionnaire © Fazlisham Binti Ghazali, Siti Nurhafizah Saleeza Ramlee, Najib Alwi and Hazuan Hizan. Published in Journal of Health Research. Published by Emerald Publishing Limited. This article is published under the Creative Commons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of this article (for both commercial and non-commercial purposes), subject to full attribution to the original publication and authors. The full terms of this licence may be seen at http:// creativecommons.org/licences/by/4.0/legalcode Declaration of conflicting interests: The author(s) declared no potential conflicts of interest concerning the research, authorship and/or publication of this article. Funding support for this study was provided by a grant from the Cyberjaya University College of Medical Sciences: Grant number CRG /01/03/2018. The current issue and full text archive of this journal is available on Emerald Insight at: https://www.emerald.com/insight/2586-940X.htm Received 27 November 2019 Revised 4 March 2020 25 March 2020 Accepted 31 March 2020 Journal of Health Research Emerald Publishing Limited e-ISSN: 2586-940X p-ISSN: 0857-4421 DOI 10.1108/JHR-11-2019-0269 questionnaire to divide the population into various intensities of physical activity is a novel one, which may be useful inmany public health studies where high intensity of physical activity; hence, greater energy expenditure is associated with increased longevity, better health benefit and improved cognitive function. Social implications – In addition, the second domain “unplanned exercise”was successfully grouped together. Implication of the unplanned exercise component is to identify pool of population with active lifestyle awareness and choose the unstructured exercise instead of vigorous and formal exercising. Even though the amount of intensity and duration of incidental exercise does not reach recommended public health recommendation, it has been proven that preferred healthier lifestyle is positively associated with better cognition in later life. Originality/value –The adaptedMalay version of PPAQhas sound psychometric properties and could assist in differentiating groups of population based on their physical activity.


Introduction
Producing an accurate measurement of physical activity is important for detecting important health associations or effects. Moreover, the choice of an appropriate physical activity measurement tool depends upon the application for which it is intended [1]. We aimed to develop a reliable tool for physical activity measurement to be adapted to primary care in the Malaysian setting.
The Paffenbarger physical activity questionnaire (PPAQ) has been developed to suit the changing terms and guidelines for physical health. The PPAQ was developed by Dr. Ralph Seal Paffenbarger to assess physical activity via questionnaires [2]. Since then, it has been extensively tested for its reliability and validity in large population studies. The current format of PPAQ consists of eight questions that measure not only sedentary lifestyle but also energy expenditure through a physical activity index [3]. A recent study showed that PPAQ is more adept at capturing vigorous activity as it uses more descriptive terms and proper physiological definitions of physical activity intensity [4].
This study aimed to translate and validate the PPAQ which has been used in the Common Cold Project [5] to provide a reliable questionnaire to measure the level of physical activity adapted to the local primary care setting.

Study design
To validate the questionnaire, a cross-sectional study was conducted in selected private hospitals in the area of Hulu Langat, Selangor, Malaysia. A total of 65 participants who were staff at the respective hospitals and were literate and fluent in the Malay language were selected using convenient sampling. Subjects were constructed to answer the modified Malay version of PPAQ which took about 15-20 minutes to complete. All recruited participants gave consent prior to completing the questionnaire on two occasions, 14 days apart.

Ethical consideration
This study was approved by the Ethics Committee of the Cyberjaya University College of Medical Sciences (Reference number CUCMS/CRERC/FR/023). Permission to carry out the research was granted by the General Manager and Chief Executive Officer of the respective hospitals.

Sample size
The sample size calculation for this study was based on the suggestion by Viechtbauer [6], for studies of similar nature.
It was anticipated that problems that might occur would be minor such as nonresponses or item misinterpretation. Hence, it was decided that, if such difficulties are presented themselves with at least π 5 0.05 probability (i.e. in at least 1 out of 20 participants), it would be good to detect this problem during the validation process. Accordingly, from the above equation, 60 participants needed to be screened to achieve 95% confidence that one or more such problem cases would be encountered.
2.4 Measure and procedure 2.4.1 Instrument modification and operational domain definition. It was imperative that the translated version of measurement was clear to respondents, and they perceived the same meaning as what researchers intended to achieve from the questionnaire. Therefore, in this adapted questionnaire, the content was developed and forward translated to Malay through an expert review.
2.4.1.1 Question 1. In this section, participants were asked if they engaged in any REGULAR physical activity that was long enough to work up a sweat. If the answer was yes, the next question requested them to detail the number of times per week. Physical activity was defined as any bodily movement produced by skeletal muscle that required energy. Translated into the Malay language this was "Aktivitifizikaldidefinasikansebagaipergerakan badan yang memerlukantenaga". The word skeletal muscle translated into "ototrangka" in Malay was removed for its rarity of medical term usage among the nonmedical Malay population.
The word REGULAR in the original Paffenbarger questionnaire was replaced with more specific term that is "engaged at least once a week," which was translated into "sekurangkurangnyasekaliseminggu" in Malay. By establishing consistency and frequency in a week, it would be possible to identify the physically active compared to the more sedentary.
Sweating is commonly associated with physical endurance with a significant linear relationship between sweat excretion and physical intensity [7]. Sweating sooner or more profusely has been a good indicator of physical activity intensity [8]. A question to assess the physical activity that induces sweating would identify those who are physically fit.
2.4.1.2 Questions 2 and 3. These questions assessed the subject's lifestyle by identifying how many stairs they climbed up each day and the distances they walked on average. Sperandio's study showed that walking less than 500 meters per week was the best predictor of physical inactivity [9] as it provided a better research metric for epidemiologic research and better public health targets than walking duration [10].
On the other hand, climbing stairs, an underrated exercise, has been proven to benefit an individual's health [11 and 12] and predicted longevity [13] as well as lowering blood pressure and improving fitness [14]. There is no universal consensus about the ideal number of stairs, but 8900-9900 steps per week are recommended [15]. (1) Seven day recall This question was about sports or recreational activity in the past week. The seven-day recall contrasted with the original Paffenbarger physical activity question requesting details of such activity in the past year. Due to limitations in human memory, it was deemed best to keep the reporting interval relatively short. Kjellsson's experiment showed that the overall level of recall error increased with the length of the recall period [16]. Masse and de Niet's literature reviews showed that seven-day Paffenbarger physical activity questionnaire recall can be validly ranked to identify those who are physically active and is sensitive enough to detect changes in physical activity behavior [17]. Therefore, this modified questionnaire required only a short-term recall of seven days.
(2) Restriction to sports and recreational activities In this study, the specific activity of any sports or recreational activities was used as a heading under question 4.
2.4.1.4 Question 5 and 6. The effectiveness of public health campaigns depends on people to know the intensity, duration and frequency of physical activity performed [18]. The WHO suggested adults aged 18-64 should do at least 150 minutes of physical activity at moderate intensity or 75 minutes at vigorous intensity throughout the week to achieve the desired health outcomes. Also, for reasons of practicality, raw data of all components of this complex behavior which include the type (intensity), duration and frequency of physical activity were converted into energy expenditure, i.e. the metabolic equivalent of tasks (METs). Therefore, under questions 5 and 6, we asked about the frequency and duration of physical activity performed.

Translation and back translation.
It is imperative that the translated version of measurement was well understood by respondents, and that they perceived the same meaning as what researchers intended to achieve from the questionnaire. Hence, the modified Malay version of the PPAQ was translated into the Malay language by a sports scientist (HH) who was also well versed in both the Malay and English languages. The questionnaire was then back translated into English by an independent professional translator. Another independent professional translator reviewed the back-translated version against the original PPAQ and concluded that no further modification was necessary.
An accredited professional translator then checked the Malay translated version to ensure the terms used were correct and culturally appropriate. The final Malay version was harmonized for any language errors by all the experts until an acceptable translation was developed.

Content validity.
A total of four professional bilingual senior sports science lecturers with over five years' research study in the English language medium were requested to determine if the items fully and sufficiently represented the targeted domain.
All four specialists were initially contacted by email and phone. They were provided with a formal invitation letter from Cyberjaya University, including details of the research and instructions. Attached to this was a set of questionnaires in the Malay language with an empty box for them to score each domain on a Likert scale.
The four experts rated the content validity of each test in relation to the five tasks in the rating protocol. The scale was scored as follows: 1 5 test not being relevant; 2 5 somewhat relevant; 3 5 quite relevant and 4 5 highly relevant. Grades 3 and 4 were considered acceptable. Apart from assessing the content, the four experts were invited to comment in more detail in boxes on the side of each question.
2.4.4 Subjects understanding of the modified questionnaire (cognitive interview). The final Malay version was pretested on ten respondents randomly picked from the public who fulfilled the criteria of being fair-minded and literate in Malay. They were aged between 20 and 40 years old with an equal mix of genders. The objective was to identify any words and grammatical errors that might affect the comprehension of the respondents. This also included an examination of respondents' cognitive ability to recall the information and assessment of the format and wording to elicit appropriate responses and whether respondents gave socially desirable answers.
Subjects were instructed to share their thoughts about each question and to describe their thought processes before answering each question. Participants were also invited to suggest alternative wording or sentences if they wished. In this session, the examiner read out the questions, and the subjects were answered with minimal interference from the examiner.
At the end of the session, participants were requested to provide more feedback about the length of questions and their clarity. All ten participants agreed that the questions were reasonable, and they were able to recall events pertinent to the questions asked.
2.4.5 Test-retest reliability. Participants were informed that they were required to complete the questionnaire twice at 14 days apart. The researcher was present during the completion period to assist participants if required. All 65 volunteers completed the testretest assessment.

Sociodemographic data
A total of 65 respondents ranging from 22 to 55 years old were with mean and standard deviation (SD) of 29.49 and 5.54 years, respectively. Females and Malays were dominant in gender and ethnicity. Most participants had studied beyond secondary school with 47.9 % studying beyond further education level for further 3.5 years.

Statistical analysis
In content validity tests, the initial content validity index (CVI) was used to analyze agreement between four experts judging the relevance of question items used. Further, construct validity test, principal component analysis (PCA) was done by using SPSS version 19. Where in the reliability test, the analysis was divided into two, i.e. analysis of continuous and categorical data; for continuous data, intraclass correlation coefficient (ICC), paired t-test and Bland-Altman plot were used to examine the agreement between two tests at two different times. In addition, standard error mean (SEM), minimal detectable change (MDC) and minimal important difference (MID) were used to demonstrate the absolute reliability of the questionnaire. Moreover, agreement between categorical data was observed from weighted kappa.
3.2.1 Validity test. 3.2.1.1 Content validity index. A panel of four experts reached a consensus that the final items in all six questions were valid to be used. An item-level CVI (I-CVI) was computed by dividing the total number of experts giving a rating of 3 or 4 (relevant) by the total number of experts in which all items scored a rating of 1, as presented in Table 1. 3.2.1.2 Construct validity. Construct validity was done using confirmatory and exploratory factor analysis with a factor loading of 0.4 or more considered good.
(1) KMO and Bartlett's test of sphericity Bartlett's test of sphericity resulted in 0.707, which reached statistical significance, supporting the factorability of the correlation matrix [19]. The null hypothesis could be rejected, and the alternate hypothesis that there may be a statistically significant interrelationship between variables was accepted. Hence, factor analysis was considered as an appropriate technique for further analysis of the data.
(2) Confirmatory and exploratory factor analysis From this analysis, Table 2, two components have been identified with eigen values of more than 1.0 suggesting that dividing the questionnaire into two components was most appropriate.
In further analysis, orthogonal rotation (varimax) was to delineate further the two components with an assumption that what was explained by one factor was independent of information from other factors. Factor rotation made it easier for further interpretation of components.
Rotated component matrix sorted six variables into two overlapping groups each with a loading factor of 0.4 or more. There were blanks in the matrix where weights were less than  Table 3). The factor column represented the rotated factors that were extracted out of the total factor. These are the core factors, which will be used as the final factor after data reduction.
The first component suggests that the mode of intensity and duration is highly correlated with each other, which explains about 51% of the variability in the performance of this physical activity questionnaire. The second component consisting of the number of stairs climbed and walking distance per day explained 22% of variance from PCA, as presented in Table 2. Surprisingly, the second component successfully delineated the two question items together which comprised the stairs climbed per day and walking distances per day with a higher loading factor.
(3) Internal consistency test Cronbach's alpha was used to measure the internal consistency of the scale. As from the factor analysis calculated earlier, two components were extracted out from this scale (Tables  2 and 4) .
In this analysis, all items in component 1 had a moderately high corrected item scale correlation. On the other hand, there was no correlation at all between climbing stairs and walking distances in the second component which is expected considering that both questions were not considered related to each other. The final Malay version PPAQ kept all the questions in view that it makes clinical sense to retain them in the respective components.
3.2.2 Reliability test. An agreement between continuous data, i.e. walking distances per day and stairs climbed per day of Malay version PPAQ at two different times of measurement were analyzed using ICC (two-way random effects, absolute agreement and single rater) for relative reliability, paired t-test and Bland-Altman diagram for systemic bias. Furthermore, the Bland-Altman plot was useful to provide the limit of agreement and to detect outliers possibly caused by errors of measurement [20]. SEM, MDC and MID were used to estimate minimal scores that are not due to error [21]. In contrast, categorical data in this reliability test were examined by using weighted kappa, which is more helpful to provide strength of agreement between two measures.  (

1) Paired t-test
We also found that there were no significant differences for means at 14 days interval with both p-value > 0.05 and agreed not to reject the null hypothesis that there was no statistically significant difference between the two tests.
(2) Bland-Altman plot Potential error of measurement was further analyzed by using the Bland-Altman diagram which addresses if there is any systematic difference between two sets of measurements as well as to identify possible outliers (see Figure 1). Each sample was represented on the graph by conveying the mean value of the two assessments (x-axis) and the difference between the two assessments (y-axis). The mean difference was the estimated bias, and the SD of the differences measured the fluctuations around this mean (outliers being above 1.96 SD difference).
(1) Standard error of mean (SEM) and minimal detectable change (MDC 90) ) The findings demonstrate that although test-retest reliability (relative reliability) for the clinical tests was excellent, there was still a substantial degree of variability of performance for individual participants from one test session to the next (absolute reliability). The SEM and MDC 90 were calculated to objectify these findings. SEM was calculated based on the formula: In accordance, SEM was based on the assumption of normal distribution, and probabilities of the normal curve could be applied to SEM values. In total, 68% probability that repeated questions for climbing stairs and walking distances will be within ±37.7 and ± 572.6 of the mean score on the first day of assessment, respectively. Thus, a 96% probability that  Table 3. Both MDC 90 values for climbing stairs and walking distances were out of range from changes of means across the two-time points. The overlapped testretest scores with the interval of MDC 90 value indicated that the changes were likely due to random measurement error.
(2) Minimal Important difference (MID) This study is the first to determine the measurement error of PPAQ which is an indication of the accuracy of the measurement instrument. COSMIN guidelines proposed that the interpretation of SEM should be based on the value of the MID [22]. However, the true purpose of MID which is to represent the smallest change in score that is considered a relevant outcome is not going to be utilized. Instead, MID was to assess statistical reliability, i.e. measurement errors relying on other statistical measures like SD, SEM and effect size [23] 1SEM: 1 3 standard error of means; SD: standard deviation and ICC: intraclass correlation coefficient 3.2.2.4 Determine reliability of categorical data.
(1) Weighted kappa The observed percentage of agreement implies the proportion of ratings where the raters agree, and the expected percentage is the proportion of agreements that is expected to occur by chance as a result of the raters scoring randomly. Hence, kappa is the proportion of agreements that is observed between raters, after adjusting for the proportion of agreements that takes place by chance [24]. By using the formula of where P o 5 observed agreement and P c 5 proportion of agreement by chance.
We were able to generate values of kappa as shown in Tables 5-7. Many scholars agreed that it is important to retain the hierarchical nature of the categories.
Therefore, further analysis of the ordinal data, weighted kappa was used to reflect the degree of agreement in terms of their seriousness, as shown in Table 5. In this analysis, quadratic weighting was preferred over linear as the variation coefficients of the former increases with the number of categories, which will be a more desirable weighting scheme given the hierarchical nature of categories.

Discussion
This study aimed to translate PPAQ into the Malay language. The Malay version PPAQ had good interrater reliability and internal structure. The panel of experts reached a consensus that the final items in both domains were valid to be used with item CVI reached a total mutual agreement.
To our knowledge, this is the first study to assess the internal structure of PPAQ. Our analysis successfully identified two components with eigen values more than one in the Malay version PPAQ. The ability of the validated questionnaire to divide the population into various intensities of physical activity is a novel one, which may be useful in many public JHR health studies where high intensity of physical activity; hence, greater energy expenditure is associated with increased longevity, better health benefit and improved cognitive function.
In addition, the second domain "unplanned exercise" was successfully extracted with Q2 and Q3 grouped under principal component analysis.
Analysis of measurement errors in this study was divided into two parts according to the type of variables. In continuous data, which is the unplanned exercise component, we found that the self-reported Malay version PPAQ has fair relative reliability within 14 days of interval.

Limitations of this study
The Malay version of PPAQ will provide a locally validated version of the physical activity questionnaire. Future studies in bigger and heterogeneous samples along with more reliability tests are encouraged to evaluate the validity of this instrument with more objective measures for example accelerometer as in this study; we only measured the reliability and content validity of the translated version for PPAQ. These future studies are particularly important in view of the limitations of subjective measurement to accurately identify those who need further recommendations for health activity.  Table 6. Distribution-based estimates of the minimal importance difference (MID) Table 7. Proportions of agreement of physical activity index scores, physical intensity, frequency and duration Paffenbarger physical activity questionnaire 6. Conclusion PPAQ instrument has been used worldwide but is less familiar in the local region of Malaysia. Lack of its translated version and psychometric analysis makes this study imperative as a starting point for further research. Our statistical analysis successfully identified and delineated two major components in accordance with our operational domain definition with fair internal consistency. Hence, the six items were compressed into a four-item questionnaire. Further research is recommended in bigger and heterogeneous samples along with more reliability tests.