Methodology: Overall Design

Because full implementation of class size reduction will take several years3, the evaluation should collect data annually for three years beginning in the current 1997-98 schoolyear. Archival data should also be used to establish pre-CSR baselines against which to measure change. For some topics (e.g., classroom practices), however, establishing such a baseline will not be possible. In these cases, the status at the beginning of the study and changes thereafter should be described. However, for those classrooms and schools where CSR was not fully implemented by the beginning of the current schoolyear (i.e., 1997-98) baseline data can and should be collected.

To enable linking and aggregating information gathered at different levels of the system, a nested sampling design of districts, schools, and classrooms should be implemented. First, a stratified, random sample of districts representing the state as a whole, then sample schools within districts and finally, teachers and classroom samples within schools should be selected. Particular attention should be paid to the relationship between CSR and student outcomes, including achievement. To this end, achievement tests of successive cohorts of fourth-grade students using a time series (post-test only) design should be used.4 Other analyses of student outcomes would complement this approach. The sampling plan adopted should have sufficient power to detect mean differences of 1/8th of a standard deviation and percentage differences at the median of ± 5% with an error rate of 5% or less when groups are being compared.

Data Collection

It is important to have a data collection plan that is the most appropriate for the questions listed above, while at the same time minimizing the burden on respondents. Table 1 shows the data collection plan the for the evaluation of CSR.

Existing Databases. Several existing state or district databases should be used to address questions relating to resource allocation and teacher preparation, including (1) the California Basic Educational Data System (CBEDS), which includes information about teachers’ backgrounds and personnel assignments, and district and school information, (2) district financial records maintained by the State Department of Education (SDE), and (3) the Cost-of-Education Index (CEI), which permits adjustments for geographical differences in costs.

Standardized Tests. The plan is to analyze data from the standardized, statewide achievement test, which is to begin in the spring of 1998, and which presumably will also be available on CBEDS. Data from these tests should be complemented with district and school records to monitor changes in other student outcomes, including attendance, retention/promotion in grade, transition rates into and out of special programs, and behavioral problems.

An oral reading test such as the NAEP Integrated Reading Performance Record (Pinnell et.al. 1995) should also be administered to successive samples of fourth grade students beginning in the spring of 1998 through the spring of 2000 to ensure that more than reading proficiency is being assessed.

All students will be required to take the state’s new STAR test in English. As indicated above in discussing reading achievement, it may not be sensitive enough to measure achievement in those limited English proficient students who have not been taught solely in English. Hence, data from tests taken in languages other than English should also be analyzed as part of the evaluation plan. In this regard, many districts opt to administer tests in Spanish to Spanish-speaking students. For example the SABE/2 test is a nationally normed Spanish-language standardized achievement test in reading and mathematics published by CTB/McGraw-Hill and is comparable to the commonly used English-language CTBS/4 test.

Finally, a reading readiness test (e.g., Woodcock-Johnson) should be administered as part of the evaluation to successive samples of fourth grade LEP students beginning in the spring of 1998 and continuing through the spring of 2000 in order to better assess pre-reading achievement gains for LEP students--gains which might not be detected by the new standardized reading test.

Mail Surveys. The evaluation plan should include four survey instruments-one each to district administrators, school administrators, teachers, and parents. The district survey should be sent to superintendents, but it should be organized by district function-finances, personnel, instruction, facilities-so it can be delegated to staff with those particular administrative responsibilities. It should focus on questions about the use of facilities, changes in internal resource allocations, recruitment and hiring, teacher assignment practices, staff development, integration of CSR with the district’s planning efforts, parent involvement, and opportunities forgone as a result of class size reduction.

The school survey should be sent to principals and should include questions about classroom organization (resource allocation), the use of school facilities, changes in teacher assignments (staff development), support for new staff, the provision of equipment and materials, integration with other reform efforts and programs, and the involvement of parents in school activities.

The teacher survey should focus on professional development opportunities, curriculum coverage, classroom organization, access to materials, the effect of reforms on teaching and learning, contacts with parents, and selected instructional issues.

The parent survey should address parents’ participation in policy making, contact with their children’s school, support for learning, and satisfaction with class size reduction.

Case Studies, observations, and videotaping. Case studies to be conducted in a limited number of districts, schools within those districts, and classrooms within those schools, to collect qualitative and process information that can only be obtained through open-ended interviews and through field observations and videotaping.

As appropriate to the entity studied (i.e., the district, school, or classroom), open-ended interviews should be conducted with administrators, principals, teachers, special program directors, school board members, union and parent representatives, as well as others. These interviews should cover issues related to district and school policies concerning class size reduction, including implementation (e.g., which grades to reduce first, facilities requirements, teacher hiring and assignment, allocation of resources), instructional support (e.g., staff development, teacher planning or collaboration, curriculum development), integration with other programs, and problems encountered and solutions adopted.

Classroom case studies should be the primary source of information for aspects of curriculum and instructional practices that are difficult to measure using surveys, including teachers’ approaches to curriculum topics, their expectations, and the use of reform-oriented instructional strategies. Specific data collection strategies should include teacher and principal interviews, teacher logs, classroom artifacts, observations, and the videotaping of actual classroom instruction.

Interviews with teachers should cover such areas as teaching background (e.g. years of experience, range of class sizes taught, credentials), curriculum and teaching practices, beliefs about the effects of class size reduction, perceptions of how practice changed with class size reduction (or for teachers in larger classrooms, perceptions of how practice might change), type and quality of instructional supports, type and frequency of contact with parents, and perceptions of student outcomes associated with smaller classes.

The collection of logs and classroom artifacts (annotated assignments and examples of student work) should focus on the content of the curriculum and the embodiment of instructional goals. A sample of teachers should also be asked to keep a daily log that collects information on curriculum topic coverage and emphasis on teaching practices, student activities, grading and homework, texts, and equipment. Teachers should keep these logs one week out of every four according to a schedule to be developed at the beginning of the study. These logs would be supplemented with copies of curriculum materials, including classroom and homework assignments, assessments, and samples of student work from a random sample of five students per class. Teachers should be paid for participating in the study and carrying out the extensive data collection activities.

Observations and videotapes should be used to examine instructional interactions and teacher practice variables that cannot be captured in interviews or inferred from artifacts. During the first year, a mathematics and a language arts lesson from each teacher should be videotaped, similar lessons being videotaped during the second year. This scheme is designed to balance the advantages and disadvantages of the two methods.


Mail surveys and case studies in a nested sample of districts, schools within these districts, and classrooms within these schools should be conducted as follows:

Mail Surveys. First, a stratified, random sample of districts should be selected with probability proportional to size, i.e., the number of fourth grade classes each contains. In addition to size, stratifying variables should include median income, urbanicity (i.e., urban, suburban, rural), and share of students in the district with limited English proficiency. This approach assures that the largest school districts in the state would be in the sample, and that the state’s diversity of settings and students is adequately represented in the sample.

Second, within the districts selected above, a random sample of schools that contain grades K-4 should be selected, an average of three schools per district, with a minimum of one school per district. The selection process should be the same as above, i.e., the probability of selection of classrooms in any one district should be proportional to the number of fourth grade classes in the district after stratifying for share of limited English proficiency in the schools.

Finally, a sample of teachers and parents should be selected from the universe of teachers and parents in the schools selected above. Parents should be selected so as not to take only the active or engaged parents.

The sampling strategy utilized must provide samples large enough to generalize reliably the findings to the state as a whole for all fourth graders as well as for major subgroups based on gender, region, urbanicity, ethnicity and percent of LEP students in the school. A survey completion rate of 80 percent for districts and schools, and 75 percent for teachers and parents is assumed for the plan.

Case Studies. The districts, schools, and classrooms in which the detailed implementation and qualitative case studies would be conducted should be purposively selected from the sample of districts and schools as selected above. A nested design should also be used for this part of the project using classrooms generated as part of the overall sampling plan outlined above. The diversity of the state with respect to racial/ethnic composition should be represented in the case studies. At least two of the largest districts in the state, and at least two suburban and one rural district should be included.

Student Achievement Tests. Since the state achievement tests will be administered to all fourth grade students throughout the state, the plan calls for data for all classrooms in the state to be used (see Analysis, below). All schools in the sampled districts that used the adopted state test in 1996-97, 1997-98, and 1998-99 in order to conduct analyses of student achievement within cohorts should also be selected (See Analysis, below). Based on current test use data in California, it is estimated that this subsample would include approximately 50 schools.

SABE/2 data should also be collected in four districts, selected from the districts included as part of the overall sampling plan and within those districts, two schools, based on availability of SABE/2 test scores, size, region, and a high percentage of limited English proficient students whose primary language is Spanish. According to the California Department of Education, 132 districts in the state are using the SABE/2 test, including San Jose, San Francisco, Bakersfield, and Pasadena.

Finally, for a subsample of no fewer than 200 schools chosen from the schools included in the overall sampling design, no fewer than five students should be randomly selected to take an oral reading test in order to validate that high scores on the state’s new standardized reading achievement test correlate highly with actual reading ability. A second sample of no fewer than five students in each of the schools should be selected from special populations to take an individual-level reading readiness test.

Data Analyses

Although the evaluation plan calls for the collection of different types of data (quantitative and qualitative), from a range of respondents (administrators, teachers, and parents) using a variety of methods (surveys, interviews, case studies), relating to a number of issues from resource allocation and classroom practices to student outcomes, the analyses plan presented below follows a simple overall strategy. Considerable effort should go into data preparation. In the case of quantitative data (such as state databases, surveys, fixed-choice interviews, and student outcome data), steps include data entry and verification, data reduction/simplification, the linking of comparable data from different sources, and the preparation of files for analysis. Similar functions would be performed for qualitative data, such as open-ended interviews, classroom artifacts, and logs, according to procedures described below.

Quantitative Data Analyses. Data analysis should provide valid and reliable results. The statewide distribution of information about each of the research questions should be examined, as well as differences in CSR effect variables between particular types of districts, schools, classrooms, and students. Relationships among the seven research issues should also be explored, providing associations between the research topics such as teacher experience and classroom practices, and resource allocation and parent involvement. More complex relationships should also be examined, including questions about the relative contributions of multiple factors. In addition, changes over time should be examined by repeating much of the data collection on an annual basis. The approach to longitudinal analyses should be similar to the cross-sectional analyses just described.

To determine whether class size reduction is related to an increase in reading and mathematics scores, average achievement scores by schools on California’s new standardized test for successive fourth grade cohorts in the springs of 1998, 1999, and 2000 should be examined. This analysis should include all fourth grade students in California elementary schools who take the test. If CSR is having an impact on reading and mathematics achievement, a statistically significant improvement in scores in 1999 (compared to 1998) and again in 2000 (compared to 1999 and 1998) would be expected.

Analyses should then be refined by including teacher characteristics (e.g., years in the classroom and teaching credentials) by school and by grade level in the model to determine whether they have a significant relationship with reading and mathematics achievement above and beyond their relationship with CSR measures. Finally, it is important to see whether any observed relationships between CSR and achievement hold for all regions, genders, racial and ethnic groups, and social classes. Therefore, all the analyses above should be repeated after dividing the sample using these variables.

One issue that needs to be addressed in the analysis of the successive fourth grade cohort study is the introduction of a new test. As teachers and students become familiar with the test, an apparent achievement gain is likely. However, this observed "gain" may be the result of their growing knowledge of the format of the test or to "teaching to the test," rather than being due to CSR. Although test scores may improve because of teaching to the test, research shows that this gain cannot be replicated when the same content area is measured by a second test (Linn and Kiplinger, 1995). Therefore, in addition to the analysis of outcome data for successive fourth grade cohorts, data from a sample of schools that have used the STAR test for the years 1995-96, 1996-97 (i.e., before it was chosen as the STAR test) should be analyzed to estimate what how much gain might be expected from teaching to the test (i.e., in the absence of CSR). These results should be noted when reporting results relating the STAR to CSR assuming the sample of districts using the standardized test prior to its adoption are representative of the population of districts in the state.

The relationship between CSR and oral reading ability for a subsample of fourth grade students should also be examined, using the same methods outlined above. Furthermore, the results should be compared with those generated by the 1994 NAEP oral reading assessment.

Given the large limited English proficiency (LEP) population in California, an important question is whether the relationship between CSR and achievement is the same for all students. Further, resources that might otherwise be used for special populations may now be used for CSR. Therefore, all of the analyses described above should be run separately as a function of LEP/non-LEP status.

In addition to paying particular attention to the achievement of LEP students in the general outcomes analysis, achievement data for LEP students who might be lost in English versions of standardized tests should be captured. SABE/2 data should be analyzed for schools in four large districts, as noted above. This analysis should assess whether the relationship between CSR and achievement gains is the same for LEP students tested in English and those tested in Spanish.

Finally, it is also important to examine the relationship of CSR to the reading and mathematics achievement for children with disabilities. Therefore, all achievement analyses should compare students with and without disabilities as a function of CSR. Analysis of the reading readiness data should also help determine the relationship between CSR and achievement for students with disabilities.

While reading achievement and readiness are core outcome variables in the evaluation plan, other important student outcomes may also be related to CSR. For example, the plan calls for a determination of whether referrals to special programs such as sheltered English courses, resource classes, and Reading Recovery and other reading programs have decreased. Changes in the number of students classified as special education, the promotion rates from third to fourth grade, and the number of students referred for disciplinary action should also be investigated. The same analytic strategies outlined above for examining the relationship between CSR and reading achievement should be used in examining the relationships between CSR and these variables.

Qualitative Data Analyses. The qualitative data analyses deserve special discussion. Interviews and field notes should be entered into a computer-based program for organizing text data (such as Ethnograph or NUDIST). The coding method should be valid, reliable and consistent across fieldworkers.

To illustrate relationships derived from the analysis of qualitative data, visual displays, such as matrices and narrative tables, should be developed. Cross-site analyses should also be conducted which draw conclusions using both quantitative survey and qualitative interview data. For example, student achievement data should be triangulated with quantitative data on teacher characteristics and qualitative data on the nature of professional development to further provide insight into potential relationships among variables.

Advisory Board

An advisory board to represent the concerns of all interested parties, to maintain the independence of the study, and to help interpret the results in ways that are meaningful to those groups is also an integral part of the evaluation plan. The board should help frame the study, establish priorities among research questions, review instruments, interpret results and disseminate findings. The board should include representatives from:

  • State government;
  • Professional education organizations such as:
    • California School Boards’ Association;
    • Association of California School Administrators;
    • California Federation of Teachers, United Teachers of L.A.; and
    • California Parent Teacher Association;
  • Foundations;
  • Research and acadmic organizations such as:
    • UCLA Center for Research on Evaluation, Standards, and Student Testing (CRESST);
    • Commission on Future of Teaching and Learning;
    • California State University Institute for Education Reform; and
    • California Center for School Restructuring.


A minimum of four reports should be written as part of the evaluation. The first three reports are designed to report on the formative part of the evaluation. The first is due not later than December 31, 1998, the second not later than December 31, 1999, the third due not later than December 31, 2000. These reports should report on analyses of data for each of the three years of the evaluation during which data are collected. Each of these annual formative reports should be structured around the seven major issues described above and should contain data analyses designed to answer the set of research questions associated with each. In addition, each of the reports should contain a set of recommendations for how California’s CSR program can be improved with respect to cost-efficiency and equity.

The fourth and final report should be summative and report on the overall findings from the evaluation. It, too, should be organized around the seven major issues described above. It should be completed by June 30, 2001.

3 The class size reduction program began in 1996-97, and its greatest expansion will occur during the school years 1996-97, 1997-98, and 1998-99. Most schools in California had reduced-size classes in one or two grade levels in 1996-97. Most will have smaller classes in two or three grade levels in 1997-98, and in three or four grade levels in 1998-99.

4 Data from a new state testing program which will use the same commercial test in all schools throughout the state and is scheduled for use in the spring of 1998.


