Technical Document

Posted on

Data:
The data set used in this paper is the High School Longitudinal Study of 2009 (HSLS:09). This data is a longitudinal national survey that follows individual high school freshman through school and on to further educational pursuits and/or the work force. The sample was comprised of 944 schools, where administrative and support staff, over 23,000 students and their parents, and one math and one science teacher for each student were questioned. The schools were selected first (randomly), then 9th graders were randomly selected within those schools. Students were first surveyed in the fall of 2009, the base year, by administering cognitive math and science tests, logging experiences, and recording aspirations. The survey followed a multilevel model, which collects information by questionnaire from multiple sources such as students, their parents, their teachers, their librarians, and their schools. Two and a half years later (11th grade), in spring 2012, the same students were re-tested, and a new round of information was collected. A short round of data collection focused on transcripts and college planning happened when these students graduated in the spring of 2013, however this data is not yet available. The Fourth data collection is scheduled to take place in 2016, which will collect information about postgraduate trajectories, earnings and employment information, postsecondary education outcomes, and lots more. The study culminates in 2021 where these same students are interviewed about transitions to the labor force, education attainment, and future plans (NCES, 2009).

Model:
This project suggests, using the existing education literature, that the true underlying high school math achievement is modeled as follows. We represent math achievement, Ai, by students i ‘s school type (choice school or not), ʗi , family background, ɲi ,student motivation, Ƴi, parent characteristics and support, Բi , and teacher quality, Ți . The hypothesized underlying achievement function can be expressed as:

  Ai =Ɓ0 +Ɓ1(ʗi)+Ɓ2(ɲi)+Ɓ3(Ƴi)+Ɓ4( Բi)+Ɓ5(Ți)+Ɛi        (1)

where ,Ɛ, is mean zero normally distributed error.

It is important to note here that this model is capturing aspects about the student themselves, their school, and arguably most importantly their home life. In the true model these vectors would be filled with variables containing accurate data. However in reality many of these desired variables are unobserved or not measured. This leaves us with the option of omitting unobtainable variables resulting in substantial bias, or measuring what we can and using proxies for the rest. If our proxies are good we can obtain unbiased and consistent estimates. This project fills each one of these vectors with variables keeping in mind this goal.

Dependent variable
Math scores were tested on two occasions, first in the base year of 2009 and again two and half years later in 2012. The test administered uses what is known as a math item response theory (IRT). IRT rates the difficulty for each item by comparing the likelihood of students correctly getting some items against others. After the difficulty of all the items has been set, the ability of each student is estimated even if the actual assessments are different (NCES 2009). This allows us to compare across different types of tests. These scores are not integers because in the IRT calculation the probability of a correct item is given, instead of simply just counting right and wrong answers.

The dependent variables is mathgain. mathgain is the sample member’s gain (or loss) in their math IRT estimated right scores between the base year (9th grade) and the first follow up (11th grade). This measurement of test score gain is critical in controlling for innate cognitive ability. Differencing creates test score gains. Therefore ability is controlled for because it was the students own past test score subtracted from their current score, essentially subtracting out innate ability for all students. This is possible because innate ability is time invariant, so by differencing the test scores we clean out the inherent ability each student has. This is a crucial step for most education analysis, because omitted variables correlated with independent variables can cause substantial bias in our estimates. Using test score gains also allow us to move past the limitations of using static cross sectional data and take advantage of the longitudinal data structure.

Independent Variables
-choice school
The main focus of this project is examining the effect that attending a choice school (charter, magnet, or voucher program) has on student achievement. Whether or not a student attends a choice school is measurable. choice_school is a variable equal to one if the school the student attends participates in a public school choice program. School choice program is defined in the question as a magnet school, charter school, or school voucher program. This question is answered by a school administrator, which in most cases is the principle. This fact is important because many students, parents, and even teachers might not actually know how how their school is classified in terms of choice programs.

In addition to this dummy variable, this project also includes two interaction terms to dig deeper into the relationship of choice and student achievement. The literature suggests choice schools might not have impacts on student achievement as a whole, but they have shown to be beneficial to students of color and students coming from more disadvantaged families (Gleason et. al, 2010). choice_stud_col is the interaction term between choice and minority, to see if there is a differential effect of choice when the student identifies as black, hispanic, or mixed. choice_ses is included to see if attending a choice school has if you are also from the bottom two socio-economic background quartiles. Together these interactions help investigate key details from the choice literature.

-family background
In order to control for family background this project uses a socio-economic status construct. This composite variable is comprised of parent/guardian’s’ education, occupation, and income. The environment outside of school is very important in modeling achievement. The seminal Coleman report highlights this fact, finding 80 percent of variation of achievement was within schools and only 20 percent between school, suggesting the majority of student outcomes are determined by the difference between student rather than the schools they attend (Coleman, 1966). It continues to be the case that SES is one of the strongest predictors of student achievement.

The race of the student is included in this paper as a control. stud_col is equal to one is the student identifies as Black/African-American, hispanic, or more than one race and is equal to zero if the student responds with anything else.

Student ability is controlled for by using the math gain scores as the dependent variable. Because innate ability is time invariant, differencing the 2009  test scores with the 2012 scores we can clean out the effect of innate or inherent ability each student had. For more information see, the dependent variable section.

-student motivation
In order to control for students intrinsic motivation, which surely has an impact on student achievement two proxies are used. goodgrades is a dummy equal to one if students strongly agreed with the statement good grades matter to them and zero if they agreed, disagreed, or strongly disagreed. Because we are looking at student motivation, only students who strongly agreed with the statement would be the ones we would consider motivated. This allows for a little more variation and tries to distinguish between students more effectively.

Second, this project includes a continuous variable, student_expect that measures student expectations in terms expected of years in school. Students were asked how far they expected to go out of 11 options: drop out of high school, graduate high school only, attend a 2 year college or university but not finish, graduate from a 2 year college or university, attend a 4 year college or university but not graduate, graduate from a four year college or university, start a master’s degree, complete a masters degree, start a PhD, MD, or other professional degree,  or complete  a PhD, MD, or other professional degree. These variable are included to gauge and control for student motivation on student achievement.

-parent characteristics
In order to account for selection bias, this project attempts to control for the underlying differences between parents that might affect student achievement and also whether or not a student attends a choice school. Nine variables from the parent questionnaire are included to do this. First, six dummy variables equal to one if they responded yes and zero if they responded no to the following questions: Since the beginning of the school you have you attended a general school meeting?, a parent teacher organization or association?, a parent-teacher conference?,  a museum?, served as a volunteer?, and participated in school fundraiser?.

Next, this project includes a continuous variable, parent_expect that measures parent expectations of their student in terms expected of years in school. If parents have high expectations for their child we expect they will be involved and attempt to help their student do well more than those with low expectations of their future education attainment for their child. Parents were asked how far they expected each student  to go out of 11 options: drop out of high school, graduate high school only, attend a 2 year college or university but not finish, graduate from a 2 year college or university, attend a 4 year college or university but not graduate, graduate from a four year college or university, start a master’s degree, complete a masters degree, start a PhD, MD, or other professional degree,  or complete  a PhD, MD, or other professional degree. Each option was assigned a number of years in school to create a continuous variable. The literature stresses the importance of labels and expectations on student performance.

The variable langnoteng equal to one if the  language spoken in the student’s home was not english and equal to zero otherwise, was included to control for the differences that face students when english is not spoken in the home.

Lastly, the dummy variable help_hw was included to control for parent involvement and help on school work. help_hw is equal to one if the parent help their child with homework every week and equal to zero if they never do or help less than a week. It is worth noting that parents are not the most trustworthy when talking about their parenting practices. Often they answer more as an ideal than a reality. In all the above proxies do a strong job of controlling for unobserved parent characteristics we want to control for in order to move closer to a valid comparison.

-teacher quality
A variable that controls for teacher quality, or perceived teacher quality is added to the independent variables. Teacher quality is very difficult if not impossible to measure.This paper includes a continuous variable, teach_expect which is a composite variable developed by the HSLS 09 that aims to capture a teachers perception of their peers expectations of students.This composite is composed of questions that get at teachers expectations of students ideas, futures, and performance. Knowing that teachers can evaluate other teachers well, this project uses this composite to control for teacher quality.

In addition, teach_expr, a continuous variable that measures the number of years a teacher has taught math is included to control for teacher quality. Experience and quality have been shown to be correlated (Murnane & Phillips, 1981; Klecker, 2002).

A dummy variable teach_degree was included to continue to get at teacher quality. A teachers credentials and qualifications have shown to be one of the biggest factors in determining teacher quality (Darling-Hammond, 2000).  teach_degree is equal to one when a teacher has received an advanced degree and equal to zero if they have not. The idea being that teachers who completed degrees signal a variety of characteristics, such as, perseverance, the extra effort, and hard work.

Finally, several dummy variables were included to account for student perceptions of their teachers. Students often have a good idea if they have a “good” or “bad” teacher, these variables are an attempt to take advantage of that fact. The dummy variables were coded equal to one if the student agreed or strongly agreed with the question and to zero if they disagreed or strongly disagreed. Dummy variables were made for the following questions: does your math teacher value and listen to student ideas?, treat students with respect?, think all students can be successful?, make math interesting?, and make math easy to understand? These controls are attempting to get at teacher quality. These are crude controls but this project argues that these factors all together do a decent job accounting for teacher quality.

Descriptive statistics
Before Multiple Imputation

Variable

Observations

Mean

SD

Minimum

Maximum

mathBY

21444

40.18

11.97

15.8527

69.93

mathF1

20594

67.22

19.21

25.0057

155.1

choice

17754

0.209

0.406

0

1

ses

21444

0.054

0.78

-1.9202

2.88807

ses1

21444

0.16

0.366

0

1

ses2

21444

0.172

0.378

0

1

ses3

21444

0.197

0.398

0

1

ses4

21444

0.212

0.408

0

1

stucol

22409

0.363

0.481

0

1

studentexpect

16813

19.34

4.16

11

25

goodgrades

21062

0.592

0.491

0

1

anymeeting

15525

0.83

0.375

0

1

ptomeeting

15492

0.383

0.486

0

1

ptconf

15480

0.569

0.495

0

1

volenteer

15519

0.305

0.46

0

1

fundraise

15513

0.53

0.499

0

1

museum

15448

0.534

0.498

0

1

hwhelp

15711

0.483

0.499

0

1

pexpect

21658

11.06

13.04

11

25

langnoteng

15985

0.219

0.413

0

1

teachexpect

1524

0.118

0.954

-5.13

1.29

advdegree

17067

0.505

0.499

0

1

experience

17020

10.14

8.48

1

31

tvalues

18973

0.855

0.3351

0

1

ttreats

18964

0.914

0.28

0

1

tsuccess

18905

0.922

0.267

0

1

tlisten

18933

0.883

0.32

0

1

tinterest

19936

0.629

0.482

0

1

tmatheas

18939

0.74

0.438

0

1

After Multiple Imputation

Variable

Mean

Standard Error

mathBY

39.96

0.08

mathF1

66.42

0.13

choice

0.209

0.003

ses

0.041

0.005

ses1

0.159

0.002

ses2

0.174

0.002

ses3

0.202

0.002

ses4

0.215

0.002

stucol

0.364

0.002

studentexpect

19.11

0.003

goodgrades

0.587

0.033

anymeeting

0.7795

0.003

ptomeeting

0.322

0.01

ptconf

0.552

0.007

volenteer

0.254

0.011

fundraise

0.483

0.009

museum

0.462

0.006

hwhelp

0.501

0.009

pexpect

11.09

0.086

langnoteng

0.158

0.008

teachexpect

0.111

0.007

advdegree

0.504

0.003

experience

10.09

0.064

tvalues

0.851

0.002

ttreats

0.911

0.001

tsuccess

0.92

0.001

tlisten

0.881

0.002

tinterest

0.628

0.004

tmatheas

0.736

0.003

Missing data
A big issue with survey data is missing values. Sample members taking the survey would sometimes leave answers blank or respond that they did not know.This issue would be okay if we knew the missing values were random, however most often there is systematic reasons for why some people leave answers blank thus presents bias into the model. One way to deal with this issue is the imputation of missing values. This method uses other characteristics about the respondent to impute an estimation of the missing values. The problem with imputation is that it tames the data, reducing outliers and reinforcing the means. In this project, multiple imputation is used to deal with missing data. Multiple imputation has the benefit of imputing missing values, however, because it creates multiple values for a given imputation, it reintroduces randomness, avoiding over precision that is caused by standard imputation.  For all variables with missing values, the project uses STATA implementation of the Monte Carlo Markov Chain (MCMC) multiple imputation algorithm that generates five plausible values for each variable based on non-missing values for every other variable. The random seed that was used was 12061992. The analyses were replicated for each of the five imputed data sets and the final coefficients and standard errors were merged using Rubin’s Rules. Using multiple imputations is ideal because it sufficiently mitigates non-random missing values by imputing missing values.

Attrition
If we are using student test score gains, which are measured in 11th grade, should we be concerned that some students in the sample went to choice schools for the beginning of high school but then transfered or vice versa? If this is the case, we would expect there to be bias in our results because it would ether under or overestimate the true effect. Luckily the HSLS of 2009 collects data on which students transfer. Looking at the conditional descriptive statistics, of the 3,290 student who were enrolled in choice programs, none of them ended up in different schools by 11th grade. It is worth noting that 334 survey members did not respond and for 90 respondents the question was no longer applicable. We will assume this is not an issue.

Econometric issues
To represent many of the variables in our underlying model we used plug-in proxies because the true variable is unobservable. When a variable is unobserved it is hard to know for sure the strength of the proxy. If the proxy is weak, it could be that case that it should not even be included in the model, a reason why we may see little significant on some of our variables. This could be particularly true for some of the variables trying to represent student motivation, parent characteristics, and teacher quality. The variables included were indicators of the underlying process but not the process itself.

Measurement error is another way to look at our proxy strength problem. For example, let say there is some variable we are measuring, and the measurement of that variable contains a certain amount of error. We can think of this as measurement error because unlike many proxies, we care about the specific estimates of these variables. Under the CEV if the covariance of our true variable and our measurement error is equal to 0 then although our estimates will be biased, this bias will be in the downward direction. This is an unfortunate reality of education research there are some many unobserved variables and good strong data is rare. This measurement error can be considered a data problem.

Another data problem is which is cut from the same wool is the data at hand was not designed or collected with the specific questions of this paper in mind. For that reason the data is applied to the model, instead of collected for the model. This results in many compromises when doing analysis.

Maybe most influential on our statistical significance and explanatory leverage, is just the general complexity in student outcomes. Education achievement in general is a very stochastic process; there is a lot of natural randomness. These are just a few more reasons why education data is hard to work with.

Another issue could be clustering. This is common in education research. Students are nested within classrooms, schools, and communities. It is possible, and even probable that there are unobserved factors affecting students who, for example are all in the same classroom that has no heat, no computer, and a power plant directly outside. We can not assume all of our observations to be independent because they share conditions at the school and classroom level. The school codes were suppressed in the public data file so we were not able to carry out HLM or even clustering our standard errors at the school level to account for the nested structure of the data.

 Results Table
Students IRT Math Score Gains Estimated by Ordinary Least Squares

Model1

Model2

Model3

Model4

Model 5

choice

0.219

0.378

0.362

0.412

0.447

1

2

3

4

2

choice_SES

1.045*

1.068**

0.78

0.792

6

7

8

[.417]

choice_race

-0.894

-0.973*

-0.842

-0.843

9

10

[.452]

11

ses

4.109***

3.439***

3.197***

3.089***

12

13

14

15

stucol

-1.01***

-1.183***

-1.32***

-1.299***

16

17

18

18

studentexpect

0.328***

0.261***

0.253***

20

21

21

goodgrades

1.303***

1.051***

0.963***

23

24

24

anymeeting

0.489

0.494

26

27

ptomeeting

-0.509*

-0.517*

1

29

ptconf

-0.233

-0.254

30

31

volenteer

0.755*

0.693

32

33

fundraise

0.470*

0.438

34

35

museum

0.707**

0.693***

36

16

hwhelp

-2.136***

-2.05***

35

39

pexpect

0.044***

0.042***

40

40

langnoteng

2.258***

2.207***

42

42

teachexpect

0.260*

44

advdegree

0.401*

45

experience

0.054***

46

values

0.980*

7

ttreats

-0.039

[.377]

tsuccess

-0.379

48

tlisten

0.647

49

tinterest

-0.329

50

tmatheasy

0.803***

51

constant

26.417

26.586***

19.647***

20.499***

18.55

52

53

54

[.846]

55

N

23415

23415

23415

23415

23415

ledgend: * p<0.05; **<0.01; ***<0.001; note: standard errors are clustered at the school level and displayed in parenthesis

Works Cited
Darling-Hammond, Linda. “Teacher quality and student achievement.”Education policy analysis archives 8 (2000): 1.

Coleman, James S. Equality of Educational Opportunity. [Washington]: U.S. Dept. of Health, Education, and Welfare, Office of Education; [for Sale by the Superintendent of Documents, U.S. Govt. Print. Off., 1966.

Klecker, Beverly M. “The Relationship between Teachers’ Years-of-Teaching Experience and Students’ Mathematics Achievement.” (2002).

Murnane, Richard J., and Barbara R. Phillips. “What do effective teachers of inner-city children have in common?.” Social Science Research 10.1 (1981): 83-100.

United States. Department of Education. National Center for Education Statistics. EDAT. Web. < http://nces.ed.gov/surveys/hsls09/>.

  1. 233
  2. 297
  3. 298
  4. 299
  5. 297
  6. 425
  7. 422
  8. 418
  9. 479
  10. 458
  11. 456
  12. 125
  13. 137
  14. 160
  15. 159
  16. 211
  17. 204
  18. 207
  19. 207
  20. 037
  21. 038
  22. 038
  23. 187
  24. 192
  25. 192
  26. 308
  27. 316
  28. 233
  29. 232
  30. 188
  31. 195
  32. 328
  33. 329
  34. 227
  35. 228
  36. 209
  37. 211
  38. 228
  39. 222
  40. 007
  41. 007
  42. 259
  43. 259
  44. 121
  45. 181
  46. 013
  47. 422
  48. 478
  49. 349
  50. 223
  51. 244
  52. 108
  53. 140
  54. 716
  55. 844