Heewon Yang, CTRS, Ph.D.
Leisure Studies, School of Exercise, Leisure, & Sport
Kent State University
P.O. Box 5190-0001, Kent OH 44242
(330) 672-0218 (O), 672-4160 (Fax).
The purpose of this study was to establish the reliability
of the Smiley Face Assessment Scale administered at Camp Koinonia,
is an outdoor camping program for children with multiple disabilities
eastern Tennessee (the Knoxville area). The subjects for this
were 100 campers attending the camp during April 6 – 11, 1997.
Smiley Face Assessment Scale, which is a self-report instrument with a
pictorial response system (five faces), was used to measure the degree
of campers satisfaction with their camp experiences. The
was completed by each camper twice (test-retest) with the help of his
her counselor. Pearson’s Correlation Coefficient was used to
the relationships between the results of the whole group of campers,
cabin group, male and female campers, and anonymous and non-onymous
on the two consecutive days. Most data provided in this study
the hypothesis that the Smiley Face Assessment Scale is a reliable
to use for the participants of Camp Koinonia (16 out of 17 correlations
of this study were proved to be significant at p < .01 level).
Key Words: The Smiley Face Assessment Scale, Reliability, Likert-type scale, Correlation Coefficient.
The Smiley Face Assessment Scale (SFAS) is one of a number of attitude assessment scales used primarily to measure the affective domain of children. The scale is a Likert-type self-report assessment instrument with a pictorial response system (Henerson, Morris, & Fitz, 1987). The SFAS has been adopted and used for years by the staff at Camp Koinonia, which is a one-week camp program for children with multiple disabilities in eastern Tennessee (the Knoxville area), to determine the outcome of the individual camper’s camp experience. Few studies have provided adequate data on the reliability and validity of the self-report type evaluations of people with disabilities. However, some researchers have reported that even people with mental retardation have the ability to make valid and reliable responses to self-report instruments (Dattilo, Hoge, & Malley, 1996).
Without the assurance of reliability and valid test results, the results of the evaluation are dubious. Reliable results of an evaluation and documentation provide valuable information for decisions and revisions in the program. They also provide the means to accountability in the programs and a better understanding of how such programs improve a participant’s quality of life. In addition, they can be the basis for the recruitment of volunteers, community awareness, and the determination of whether program goals and participant learning objectives are being met (Hayes & Brown, 1996).
Reliability can be the hallmark of a good measuring
instrument and measurement precision is a crucial preparation for
Included in the following sections are the background of the SFAS,
related to reliability coefficients, and methods of establishing
Background of the Smiley Face Assessment Scale
The development of attitude scales was due primarily to the work of Thurstone, Likert, and Guttman (Aiken, 1976). The Likert scale developed by R. Likert is a method of obtaining information pertinent to affective variables. Answers, such as strongly agree, agree, undecided, disagree, and strongly disagree, are followed by a five-response continuum. Hopkins and Stanley (1981) stated that Likert scales are very flexible and can be constructed more easily than most other types of attitude scales. They further suggested that pictorial response scales are sometimes more effective for assessing attitudes, especially for children, and they illustrated a seven-point rating scale that uses a pictorial response continuum.
Henerson, Morris, and Fitz (1987) noted that the major problems hindering the assessment of children’s attitudes are that they have short attention spans, an inability to understand questions, and difficulty keeping their places. To help children understand what is expected of them, response options are sometimes presented in picture form, and the four or five face responses are typical examples. Thus, Thomas and Nelson (1996) observed that a task made fun and enjoyable is more likely to elicit a consistent performance and it can be done by the use of cartoon figures, encouragement, and rewards.
Strengths and Weaknesses of the SFAS
Since the SFAS is an applied type of the Likert scale, its strengths and weaknesses need to be examined from the strengths and weaknesses of general Likert-type scales. Some of the strengths of Likert-type scales are as follows. Likert-type scales may save time compared to interviews and other inventories, and subjects can be reached through the use of mailed questionnaires or in-class questionnaires in the schools (Wrightstone, 1956). Aiken (1976) stated that Likert-type scales do not require the use of experts, are easier to use, and exhibit good reliability. Bell, McDonell, & Winter (1982) also described the strengths of the Likert-type scale as following: (a) it forces a participant to give a clear positive or negative answer; and (b) it produces items suitable for rapid response and analysis. Additionally, Cunningham, Farrow, Davies, & Lincoln (1995) reported that Likert-type self-report had a good reliability and content validity. With the exception of the extremes on the depressive spectrum (i.e., atypical, neurotic, or hysteroid forms), self-rating scales have been shown to have a high degree of reliability (Bell et al.).
Wiig (1995) supported the importance of self-assessments. He argued that if one of the clinical-educational objectives is for a student to take charge of himself or herself, it is important to listen to the student’s self-description. He further stated that the importance of allowing the student to assess himself or herself cannot be overemphasized. Moreover, Henerson et al. (1987) maintained that children have short attention spans and sometimes don’t understand the questions; therefore, the SFAS is needed to help children understand the questions and what is expected of them when they answer those questions.
There are weaknesses, however, in the Likert-type scales. Thus, Hopkins and Stanley (1981) discussed four kinds of problems in affective measurement: (a) fakability; (b) self-deception; (c) semantic difficulties; and (d) criterion inadequacy. Self-report attitude rating scales also tend to be more sensitive to the subjective distress of the client. Furthermore, the use of a self-rating scale may be limited by many variables such as socioeconomic, educational, cultural, and linguistic variables, as well as by the type of disorders the client has (Suttle, 1985).
Henerson et al. (1987) stated that questionnaires and attitude-rating scales do not provide the flexibility of interviews, and that if the questions are interpreted differently by respondents, the validity of the information is jeopardized. Moreover, although Likert-type scales are quantifiable, subtle nuances and shades of meaning apparent in the rich data provided by field interviews may be lost (Russon & Koehley, 1995).
There is also another problem. Many children will answer the way they think others expect them to respond, which is the so-called “the social-desirability response” (Suttle, 1985). According to Sullivan (1995), the results of attitude-rating scales can also be influenced by variables such as threats to the status of the respondents while the respondents’ impulse to respond in a socially desirable way may partially misrepresent the true feelings of the respondents.
Factors Related to Reliability Coefficients
Wrightstone (1956) indicated some possible sources of variation in psychological measurements. They are: (a) the actual differences in abilities and skills among individuals as they relate to the psychological characteristic being assessed or measured; (b) the differences in the ability to take a specific test (i.e., the ability to comprehend directions or the effects of taking previous tests); (c) the differences associated with chance factors (i.e., fluctuations in performance, memory, or reasoning, as well as the fortunate selection of answers by guessing, and an individual’s unique knowledge of particular facts); (d) the temporary nature of such things as health, energy, fatigue, motivation, and emotional tension; and (e) external conditions such as heat, light, ventilation, noise, broken pencils, and interference.
Hopkins and Stanley (1981) proposed some additional factors which affect reliability: (a) test sophistication (a general know-how in test-taking) can affect test performance; (b) practice (the examinee’s previous performance on the test or its equivalent; (c) coaching; (d) anxiety and motivation; (e) acquiescence (a tendency to choose a positive answer); (f) a tendency to select non-technical options; (g) the qualities of the examiner (i.e., his or her temperament and gender); (h) the nature of the answer sheet (i.e., whether it is a test booklet or a separate answer sheet); (i) time of test day (i.e., 10 A.M. or 9 P.M.); and, (j) guessing and cheating.
Other factors Henerson et al. (1987) pointed out have to do with variations in the conditions of the administration of tests from one test to the next, ranging from distractions such as unusual outside noise to inconsistencies in the administration of the instrument such as oversights in giving directions. Differences in the methods of scoring or interpretation of the results were also viewed as possible influences on the reliability coefficient.
Types of Reliability-Estimating Methods
Test-retest (Determining stability). In general, there are four major types of reliability coefficients: (a) test-retest reliability, the alternate form reliability, split-half reliability, and inter-rater reliability (Henerson et al., 1987). In the test-retest method, the same test is administered to the same individuals after an intervening period of time, and the correlation between the score on the two tests is computed in order to obtain the coefficient (Wrightstone, 1956). This procedure takes into account measurement error caused by different administration times. However, since the same test is administered on both occasions, the error due to different samples of the test items is not reflected in a test-retest coefficient (Aiken, 1976).
Test-retest coefficients are usually higher than alternate form reliability coefficients because the latter permits a fresh sample from the same content universe (Hopkins et al., 1987). Henerson et al. (1987) stated that one problem with test-retest reliability is that the administration of a test to the same group within a few days or weeks after the first test presents a problem because how much of what they remember from the first administration has carried over to the second. Also, a long period of time between administrations of a test raises the possibility that the true skills and attitudes of the examinees will have changed. Furthermore, Thomas and Nelson (1996) proclaimed that the interval between the tests cannot be so long that actual changes in ability, maturation, or learning will have taken place.
Constructing alternate forms. The preferred method of determining reliability coefficient is to administer two parallel (equivalent) forms of the tests, that is, ones that contain different questions but which can reasonably be assumed to be equivalent to the same individuals. If the forms are not equivalent in difficulty and content, the correlation coefficient will not provide a true estimate of the reliability of the test (Wrightstone, 1956). Unlike the test-retest procedure, this method takes into account the variance of error introduced by using different samples of items (Aiken, 1976). Thomas and Nelson (1996), however, indicated that this method is rarely used with physical performance tests, because it is more difficult to construct two different sets of good physical test items.
Obtaining internal consistency. In general, two methods are used to estimate internal consistency. They are the split-half method and the Kuder-Richardson method. In both of these methods the same general traits, abilities, or characteristics of a homogeneous nature should be measured and the test should not be speeded up. In the split-half method, the coefficient will be seriously underestimated if the halves of the test are not closely equivalent in difficulty and content. Split-half method is best used with instruments that have many items, ones that can be paired. For example, an instrument with homogeneous items such as a general vocabulary test can be used for split-halt test.
The Kuder-Richardson method can be used for items scored dichotomously (i.e., right or wrong) and represents an average coefficient of all possible split-half reliability coefficients. Only one test administration is required and no correlation is calculated for Kuder-Richardson method.
Thomas and Nelson (1996) added more methods, which
are same-day test-retest and coefficient alpha (Chronbach Alpha
The test-retest on the same day results in a higher reliability
than does a test-retest on separate days. The coefficient alpha
used with items that have various point values, such as essay tests
scales that have “strongly agree,” “agree,” etc. This method is
most commonly used in estimating reliability in standardized tests.
Inter-tester reliability. The test environment and the behavior of the person who is being tested may vary from time to time. The perceptions of the person doing the reporting may also fluctuate. The best way to demonstrate that one’s work has been minimally contaminated by inconsistency in the “human instrument” is to use more than one person to do at least a portion of your interviews or observations. In addition, extensive training and careful instruction of the people conducting the test or responding to the reports-of-others questionnaire are necessary (Henerson et al., 1987). Thomas and Nelson (1996) indicated that it is possible to assess a number of sources of variance in one analysis, that is, variance caused by testers, trials, days subjects, and error.
The test-retest method was used in the present study
to determine the reliability of the Smiley Face Assessment Scale.
The Pearson Product-Moment Correlation Coefficient was used to
correlations between the results produced by all the camp participants,
gender differences, each cabin group, and anonymity when conducted on
|Level of Impairment||Slight||Moderate||Severe|
|Male (n = 52)||8||28||16|
|Female (n = 48)||9||30||9|
The response rate for the 100 subjects for each evaluation was 100%. However, among the returned evaluation forms, 10 pairs of responses were excluded from the data analysis because they were either incomplete or unreadable. As a result, the scores of 90 campers (male, N = 44: female, N = 46) were analyzed.
For those who were unable to read or understand the questionnaire, a counselor’s help was provided. The participants were asked to indicate which face best represented how they felt about the questions. The five responses expressing the participant’s feelings (boring, okay, good, great, and terrific) were provided (see an example of the SFAS in Figure 1). Each response was transformed to each numerical score: boring was 1 and terrific was 5. The maximum possible score was 70 and the minimum score was 14 for the 14 items. The reliability and validity of this scale have not been reported. According to Hopkins et al. (1981), test-retest and split-half reliability estimates of .80 or above are common for a Likert scale.
Specifically, among the total 10 cabin groups, three male cabin groups (Cave Men, Disco Inferno, and T-Birds) and two female cabin groups (Flinstones and Pink Ladies) were randomly selected. All the campers in the five groups were asked to enter their names on the form on Thursday morning (April 10) but were asked not to enter their names on the second form, administered on April 11. In contrast, campers from the other five cabin groups, two male groups (Gangsters and Star Wars), and three female groups (Bell Bottom Babes, Hippie Chicks, and Flower Power) were asked not to enter their names on April 10 but were asked to enter them on April 11. Counselors were reminded of the administrations of the evaluation during the evening staff meeting on April 9.
Proper assistance for campers who had difficulty
in completing the form was offered by the counselors. The
forms were collected by each head counselor, who then returned the
to the principal researcher.
The scores of each camper were used to determine:
1. The relationship between the results of the same instrument administered on two consecutive days to all the camp participants.
2. The relationship between the results of the same instrument administered on two consecutive days for each of the ten cabin groups.
3. The relationship between the results of the same instrument administered on two consecutive days for male campers and female campers.
4. The relationship between the results of the anonymous first evaluation and the non-anonymous (coined this word as an antonym of anonymous for convenience) second evaluation for male campers.
5. The relationship between the results of the non-anonymous first evaluation and the anonymous second evaluation for male campers.
6. The relationship between the results of the anonymous first evaluation and the non-anonymous second evaluation for female campers.
7. The relationship between the results of the non-anonymous first evaluation and the anonymous second evaluation for female campers.
SPSS for Windows 8.0 was used to obtain the correlation coefficients.
Table 5 shows the 17 correlation coefficients calculated for this study. The correlation coefficient between the results of the same instrument administered on two consecutive days to all the camp participants was .92 (p < .01). The correlation coefficient between the results of the same instrument administered on two consecutive days for each of the ten cabin groups ranged from r = .90 to r = .99 (p < .01) with the exception of the T-Birds, who had an outcome of r = .69. The correlation coefficients between the responses of the same instrument administered on the two consecutive days for male campers and female campers were .94 and .91 (p < .01) respectively. The male respondents showed a higher correlation than female respondents did.
The four correlation coefficients regarding the anonymity
tests were all significant at the p < .01. The
coefficient between the anonymous first evaluation and the
second evaluation for male respondents was r = .79 while the
between the non-anonymous first evaluation and the anonymous second
for male respondents was
r = .95. In the case of female respondents, the correlation coefficient between the anonymous first evaluation and the non-anonymous second evaluation was r = .91 while the correlation between the non-anonymous first evaluation and the anonymous second evaluation was r = .93.
The highest standard deviations (11.06 and 11.34), in the case of the Cave Men (n = 10), indicated that campers in this group showed relatively frequent high and low scores on both evaluations. That is, there was a relatively large spread of scores around the mean among the campers in Cave Men. On the contrary, the T-Birds showed the lowest standard deviation (4.92). However, since the number of the respondents was only 6, it may not be a meaningful statistic.
The female respondents at Camp Koinonia expressed a slightly higher contentment with their camp experience than male respondents did for both days (60.17 > 59.16, 60.62 > 59.52). In anonymity tests, male respondents had felt better about their camp experiences on the second evaluation, regardless of the anonymity or onymity. However, female respondents produced slightly higher scores when they were asked to enter their names on the form than when asked not to. It may be possible that the female were attempting to respond in a socially desirable way.
The results of these evaluations indicated significant reliability coefficients by the totality of the campers, the 10 cabin groups, the male and female campers, and anonymity/onymity. Only the correlation coefficient of one cabin group (the T-Birds) was not significant of the 17 comparisons calculated in this study (r = .69). The reason for this was that only 6 campers from the T-Birds provided scorable responses and one raw score was revealed to be an outlier. Therefore, the correlation of the T-Birds may not be meaningful because the outlier might have influenced the correlation coefficient.
According to Hopkins and Stanley (1981), when employing the test-retest method to determine instrument reliability, a correlation of .80 or above is common for a Likert-type scale. The findings of this study revealed that 16 out of 17 correlation coefficients were significant (all above .80 at the p < .01 level). Therefore, it is recommended that the SFAS continue to be employed at Camp Koinonia.
The time interval between administrations of tests in the test-retest reliability method is a very important factor, which possibly affects test results. It is doubtful that the interval of one day was enough time to produce truly meaningful data in determining the reliability of the instrument. The re-administration to the same group within one day may present a problem because what they remember from the first administration may be carried over to the second (carry-over effect). However, a longer interval, such as a month or several weeks between the two evaluations, may raise the possibility that a camper’s true attitude may change. Thus, an appropriate time interval needs to be determined by future studies. Specifically, since every camper will have experienced all or most of the activities in the first one and a half days, it may be desirable to administer the first evaluation on the second day (Tuesday evening) instead of Thursday morning and the second evaluation may be conducted on the last day (Friday morning).
The current SFAS offers four positive descriptors (“okay,” “good,” “great,” and “terrific”) while only one negative descriptor, “boring,” is included. It may be possible that the descriptors were developed to induce more positive answers. Moreover, the five descriptors and their faces do not always match each other (three frowning faces and two smiling faces). Therefore, it would be beneficial to develop new descriptors to reflect balanced aspects of camp experiences in order to better assess the reliability of the SFAS.
According to the present researcher’s observation, some campers responded very quickly to the evaluation marking the most positive answer, which is “terrific.” The fact that 16 respondents (18 %) on the first day and 14 respondents (16 %) on the second day scored 70 (the maximum score) may support the problem of respondents’ tendency to choose one particular answer. It may, therefore, be advisable to invert the order of the answer scale in order to prevent the camper’s tendency to select one particular answer without considering other options. An example of a SFAS question that has been modified is presented in Figure 2. The redesigned instrument offers five balanced descriptors: “Lousy,” “Boring,” “Not Sure,” “Good,” and “Terrific.” The five faces have also been amended, and the descriptors of the five randomly selected items were inverted (from “Terrific” to “Lousy”) to prevent the tendency to choose the one particular answer.
Finally, coaching by the camper’s counselor was another factor that might have influenced the test results. To limit counselors’ coaching, more specific administration techniques should have been introduced before the administration of the questionnaire. Specific instructions such as “Counselor’s grade is not based on a camper’s evaluation,” “Do not mention any specific descriptor to campers,” and “Do not point out any specific descriptor to campers while questioning” may be useful in preventing un/intentional coaching by counselors.
In summary, this study tested the reliability of
the Smiley face Assessment Scale (SFAS) used at Camp Koinonia in 1997
the use of test-retest reliability method. The findings of this study
that the SFAS is a reliable instrument, which can continuously be
at outdoor camp setting for children with multiple disabilities.
it was questionable that the time interval between administrations
two consecutive days) was enough to produce truly meaningful data. In
the need for developing new descriptors that reflect balanced aspects
camp experiences was pointed out. As a result, a modified SFAS was
by the researcher. Further study needs to continue to examine the
of the modified SFAS utilizing a different time interval.
Bell, N. A., McDonell, J. R., & Winter, J. (1992).
a measure of children’s prosocial tendencies: An initial validation of
a self-report instrument. Journal
of Social Service Research,
Brown, F. E. (1997). The
relationship of activity implementation
to camper outcomes as a function of activity programming for children
varying degrees of disability severity.
Unpublished master’s thesis, University of Tennessee, Knoxville.
Cunningham, R. Faroow, V., Davies, C., & Lincoln, N.
Reliability of the assessment of communicative effectiveness in severe
aphasia. European Journal of
Disorders of Communication, 30,
Dattilo, J., Hoge, G., & Malley, S. M. (1996). Interviewing people with mental retardation: Validity and reliability strategies. Therapeutic Recreation Journal, 15, 163-178.
Green, N. L. (1995). Development of the Perception of Racism Scale. Image Journal of Nursing and Scholarship, 2, 141-146.
Hayes, G., & Brown, C. (1997). Camp Koinonia Manual (Rev.ed.). Knoxville, TN: Graphic Creation.
Henerson, M. E., Morris, L. L., & Fitz, C. T. (1987). How to measure attitudes. Newsbury Park, CA: Sage Publications.
Hopkins, K. D., & Stanley, J. C. (1972). Educational and psychological measurement and evaluation (4th ed.). Englewood Cliffs: Prentice Hall Inc.
Hopkins, K. D., & Stanley, J. C. (1981). Educational and psychological measurement and evaluation (6th ed.). Englewood Cliffs: Prentice Hall Inc.
Russon, C., & Koehly, L. M. (1995). Construction of a scale to measure the persuasive impact of qualitative and quantitative evaluation reports. Evaluation and Program Planning, 18, 165-177.
Sullivan, J. P. (1995). An
examination of the validity of the
Profile of Mood Stae (POMS) in the assessment of mental health in
Unpublished master’s thesis, University of Tennessee,
Suttle, N. S. (1985). Agreement
in classification choices by
school psychologists when evaluating children for special education
Unpublished doctoral dissertation, University of
Thomas, J. R., & Nelson, J. K. (1996). Research methods in physical activity (3rd ed.). Champaign, IL: Human Kinetics.
Wiig, E. H. (1995). Seminars
in speech and language.
New York: Thieme Medical Publishers, Inc
Wrightstone, W. (1956). Evaluation of modern education. New York: American Book Company.