HSA 801

Research Design, Measurement & Evaluation Supplementary Materials

Provided by Dr. Raoul A. Arreola

HSA 801 Fall 2005 Student Roster
 
 
Supplementary Handouts
 
 
Measurement/Evaluation Terms and Concepts
 
 

 

   
Measurement

Measurement is the process of systematically assigning numbers to objects or persons for the purpose of indicating differences among them in the degree to which they possess the characteristic being measured. The result of a measurement is a number - by definition. BACK

   
Nominal Measurement

Nominal measurement consists of assigning items to groups or categories. No quantitative information is conveyed and no ordering of the items is implied. Nominal scales are therefore qualitative rather than quantitative. Religious preference, race, and sex are all examples of nominal scales. Frequency distributions are usually used to analyze data measured on a nominal scale. The main statistic computed is the mode. Variables measured on a nominal scale are often referred to as categorical or qualitative variables.

The numbers in nominal measurement are assigned as labels and have no specific numerical value or meaning. For example, in referring to methods of transportation we might code automobiles = 1, airplanes = 2, boats = 3. This does not mean that 3 cars equals a boat. No form of mathematical computation may be performed on Nominal measures. [BACK]

   
Ordinal Measurement

Measurements with ordinal scales are ordered in the sense that higher numbers represent higher values. However, the intervals between the numbers are not necessarily equal. For example, on a five-point rating scale measuring attitudes toward gun control, the difference between a rating of 2 and a rating of 3 may not represent the same difference as the difference between a rating of 4 and a rating of 5. There is no "true" zero point for ordinal scales since the zero point is chosen arbitrarily. The lowest point on the rating scale in the example was arbitrarily chosen to be 1. It could just as well have been 0 or -5.

No form of mathematical computations may done with numbers representing ordinal measures. All that can be done with such measures is to represent "greater than" or "less than" comparisons. [BACK]

   
Interval Measurement

On interval measurement scales, one unit on the scale represents the same magnitude on the trait or characteristic being measured across the whole range of the scale. For example, if anxiety were measured on an interval scale, then a difference between a score of 10 and a score of 11 would represent the same difference in anxiety as would a difference between a score of 50 and a score of 51. Interval scales do not have a "true" zero point, however, and therefore it is not possible to make statements about how many times higher one score is than another. For the anxiety scale, it would not be valid to say that a person with a score of 30 was twice as anxious as a person with a score of 15. True interval measurement is somewhere between rare and nonexistent in the behavioral sciences. No interval-level scale of anxiety such as the one described in the example actually exists. A good example of an interval scale is the Fahrenheit scale for temperature. Equal differences on this scale represent equal differences in temperature, but a temperature of 30 degrees is not twice as warm as one of 15 degrees.

Interval measures may be added or subtracted - but may not be used in any computation requiring multiplication or addition. The reason for this is an interval measurement scale does not have a "zero" value. In the Farenheit scale, for instance, "Zero" degrees does not mean there is no heat at all. [BACK]

   
Ratio Measurement

Numbers are assigned that have all the attributes of ordinal, nominal, and interval measures PLUS are based on a true "zero" point. Ratio scales are like interval scales except they have true zero points. A good example is the Kelvin scale of temperature. This scale has an absolute zero. Thus, a temperature of 300 Kelvin is twice as high as a temperature of 150 Kelvin.

A "zero" value in a ratio measurement means there is a complete absence of the variable being measured. Any form of mathematical computation may be carried out on ratio measures. [BACK]

   
Evaluation

Evaluation is the process of interpreting a measure (or aggregate of measures) by means of a specific value (or set of values) to determine whether the measure(s) represent a desirable or undesirable condition. The result of an evaluation is a judgment. [BACK]

   
Reliability The degree to which an instrument is measuring whatever it is measuring, consistently. A reliable instrument will provide consistent measures of an object or person as long as there is no change in the object or person on the dimension or characterstic being measured. For example, suppose I get on my bathroom scale and the scale reads (unfortunately) 196 lbs. Then I get off, wait a minute, and get back on and it reads 196 lbs again the bathroom scale may be said to be reliable. Reliability values can range from -1.00 - 0 - +1.00. [BACK]
   
Split Halves Reliability To determine the reliability of an instrument using the Split Halves method, the instrument administered (measures are taken using the instrument). The instrument is then divided into two sets of items (top half and bottom half, or some random distributing of items into two groups) and the responses from the two groups are correlated. The correlation between the responses on the top (first) half and the bottom (second) half is the split-halves reliability index of the instrument. A split-half reliability of -1.00 means that those that scored highly on the first half scored poorly on the second half and vice versa. A split-half reliability of 0.00 means that there was no correlation between the first and second halves. A split-half reliability of 1.00 means that there was a perfect positive correlation between the first and second halves. [BACK]
   
Test-Retest Reliability To determine the reliabiity of an instrument using the Test-Retest method the instrument is administered to a specific group of individuals. Then, at a later time, the instrument is administered to the same group making certain that nothing has happened to the group to affect the characteristic or dimension being gathered. The data from the first administration of the instrument is correlated with the data from the second administration. This correlation is the Test-Retest reliability index of the instrument. A test-retest reliability of -1.00 means that those that scored highly on the first administration scored poorly on the second administration and vice versa. A test-retest reliability of 0.00 means that there was no correlation between the first and second administrations. A test-retest reliability of 1.00 means that there was a perfect positive correlation between the first and second administrations of the instrument. [BACK]
   
Kuder-Richardson Reliability In computing the reliability using the Kuder-Richardson approach, each and every item is considered an individual 'measure'. Then every possible pair of individual measures (items) are considered and the correlations computed. The average correlation of all such correlations is the reliability index of the instrument as computed by the appropriate Kuder-Richardson formula. [BACK]
   
Validity A valid measurement instrument is one that, in fact, measures (reliably and with an acceptable degree of accuracy) what it was desgined to measure. NOTE: A measurement instrument may be reliable and accurate but NOT valid. That is, it may be measuring SOMETHING reliably and accurately, but not the thing that was intended. For example, a bathroom scale may be a very reliable and accurate measure - but of WEIGHT not HEIGHT or IQ. [BACK]
   
Face Validity The degree to which an instrument appears to measure what it is intended to measure. Face validity is usually determined by providing the instrument to 1) experts in the issue to be measured and 2) a sample of people of the type who will be completing the instrument, and asking their judgment as to whether the instrument appears to measuring the issue or characteristic of interest. [BACK]
   
Criterion-Related Reliability (predictive validity) The degree to which the results of the instrument correlate with another measure that is an unquestioned measure of the issue or characteristic of interest. For example, the Stanford Binet Intelligence Test is a long, exhaustive measure requiring a high degree of training to administer and score and is a long-accepted measure of intelligence. If we develop a short, easy-to-score IQ test we can determine its predictive or criterion-related validity by administering it to people who have already taken the Stanford Binet and then determining whether our instrument is highly related to, or would have predicted, their Stanford Binet scores. [BACK]
   
Construct Validity The degree to which the results of the instrument follow a pattern predicted by a model or theory. The degree to which a measure relates to other variables as xpected withn a system of theoretical relationships. [BACK]
   
Content Validity The degree to which a measure covers the range of meanings included within a concept. (Is every aspect or dimension that defines the concept being measured by some item or set of items?) [BACK]
   
Variable A variable is any measured characteristic or attribute that differs for different subjects. For example, if the height of 30 trees were measured, then height would be a variable. [BACK]
   
Continuous Variable A continuous variable is one for which, within the limits the variable ranges, any value is possible. For example, a person's height is a continuous variable ("height" exists anywhere along the range of values possible). [BACK]
   
Discrete Variable A discrete variable is one that cannot take on all values within the limits of the variable. For example, responses to a five-point rating scale can only take on the values 1, 2, 3, 4, and 5. The variable cannot have the value 1.7. [BACK]
   
Quantitative Variable Quantitative variables are measured on an ordinal, interval, or ratio scale. If fifty-five-year old subjects were asked to name their favorite actress, then the variable would be qualitative. If the time it took them to respond were measured, then the variable would be quantitative. [BACK]
   
Qualitative Variable Qualitative variables are measured on a nominal scale. If fifty-five-year old subjects were asked to name their favorite actress, then the variable would be qualitative. If the time it took them to respond were measured, then the variable would be quantitative. [BACK]
   
Independent Variable

Variables that are manipulated in research studies are referred to as independent variables. The dependent variable is the one that you expect to change as a function of an independent (or intervention) variable.

For example, in a study examining whether using a hand calculator improves learning statistics, student performance (as measured by a statistics test) would be the dependent variable, and using (or not using) a hand calculator would be the independent variable.
In a research study, a variable that changes as a function of some intervention (independent variable) is the dependent variable. (See Independent Variable discussion above)
[BACK]

   
Dependent Variable When an experiment is conducted, some variables are manipulated by the experimenter and others are measured from the subjects. The former variables are called independent variables; or factors, the latter are called dependent variables or dependent measures. (see Independent Variable discussion above) [BACK]
   
SCALES: General Discussion

This section is exerpted from Mehrens and Lehmann's "Measurement and Evaluation in Education and Psychology" 3rd edition, 1984, published by Holt, Rinehart & Winston.

Scales designed to measure attitudes (beliefs, perceptions, philosophical positions, orientation, prejudices, etc.) are classified in terms of their method of constuction. There are three major proceduresor techniques for constructing attitude scales: summated ratings such as the Minnesota Scale for the Survey of Puplic Opinion (Likert tyope); equal-appearing interval scales such as the Thurston and Remmers scales (Thurston type); and cumulative scales (Guttman type). In addition, the Semantic Differential, though not a type of scale construction, is also used.

These techniques differ primarily in their format: in the positioning of the statemetns or adjectives along a continuum versus only at the extremes; and whether or not the statements are cumulative (such as the Bogardus social distance scale). There are advantages and disadvantages associated with each of these techniques. For example, the Thurston method places a premium on logic and empiricism in its construction but unfortunately is somewhat laborious to develop such an instrument.

In the Likert, Thurston, and Guttman methods, statements are written and assembled into a scale and the subject responds (either positively or negatively) to each statement. On the basis of the subject's responses, an inference is made about the respondent's attitude toward some object(s). In the Semantic Differential, the subject rates a particular attitude object(s) on a series of bipolar semantic scales such as good-bad, sweet-sour, strong-weak. Each of these approaches to constructing a scale is different. Each has its own advantages and limitations. Each of the techniques makes different assumptions abou the kind of test items used and the information provided, even though there are some assumptions that are basic and common regardless of the method used. For example, each method assumes that subjective attitudes can be measured quantitatively, thereby permitting a numerical representation (score) of a person's attitude. Each method assumes that a particular test item has the same meaning for all respondents, and therefore a given score to a particular item will connote the same attitude. "Such assmptions may not always be justified but as yet, no measurement technique has been developed which does not include them." [BACK]

   
Thurstone Scale

a way of measuring people's attitudes along a single dimension by asking them to indicate that they agree or disagree with each of a large set of statements (e.g. 100) that are about that attitude. The statements are designed to be parallel in construction, but some toward one end of the scale and some toward the other end, and each trying to indicate the attitude in a slightly different way.


This can be contrasted with a Likert scale which asks someone to indicate their degree of agreement or disagreement with a single statement, e.g. a Likert scale would be "Please rate on a scale of 1 (Strongly Disagree) to 4 (Strongly Agree) the statement:

This software was easy to use."

The corresponding Thurstone scale would state this question in multiple ways, eg.:
* I had trouble finding what I wanted.
* I liked how easy the software was.
* The software has many convenient features.
* The software was confusing.
* etc.


Finally, to choose the statements people respond to, you need to validate them. For instance, you'd have expert judges (or pre-testing subjects) rate each of the statements in terms of to what extent they reflect either extreme of the attitude being measured. [BACK]

   
Likert Scale

A rating scale measuring the strength of agreement with a clear statement. Often administered in the form of a questionnaire used to gauge attitudes or reactions.


For example:
Question: "I found the software easy to use..."
1 Strongly Disagree
2 Disagree
3 Agree
4 Strongly Agree

[BACK]

   
Semantic Differential

a type of survey question where respondents are asked to rate their opinion on a linear scale between 2 endpoints, typically with 7 levels. For example:
Please rate this software on the following dimensions:


easy to use 1 2 3 4 5 6 7 hard to use
-or-
easy to use 3 2 1 0 1 2 3 hard to use

[BACK]

   
Guttman Scale

The Guttman scale is a comparative scaling technique developed by researcher Louis Guttman in 1944.
In a Guttman scale, a unidimensional set of items are ranked in order, much like a Likert scale; items range from least extreme to most extreme position. It is implicit that those who agree with a more extreme position also agree with the less extreme positions preceding it. The rating is scaled by summing all responses until the first negative response in the list.
The Guttman scale has become less popular in recent years, although is still used occasionally


Here is a hypothetical (extreme)example of the scale:

  1. Some children occasionally require physical restraint when unruly. (Least extreme)
  2. Slapping a child's hand is an effective discipline technique.
  3. Spanking is sometimes necessary to control children.
  4. Sometimes children require firm discipline with a belt or whip.
  5. Some children need a regular vigorous beating to keep them in line. (Most extreme)

[BACK]

   
Bogardus Social Distance Scale

A Bogardus Social Distance Scale is comprised of a set of questions that increase in terms of closeness of contact that the respondent may or may not want with members of another racial or ethnic group. The differences in intensity of contact presume that if the respondent is willing to accept a given kind of association, he or she would be willing to accept all those preceding it in the list of questions – those with lesser intensities. For example, the person willing to permit members of a different race or ethnicity to live in the neighborhood will surely accept them in the community or nation but may or may not accept them as next-door neighbors or relatives. There is a logical structure of intensity inherent in the set of questions.

[BACK]