Wednesday, December 5, 2007

Step #3 - Simple Data Manipulation

OK, so you have all your data gathered from all sources, and now you need to define additional variables, or drop some data that is irrelevant for your sample. To do so, you will first need to think about the statistical analyses you will want to do later on. If your Econometric model is quadratic, for instance, or in other words, if you measure the effect x on y with both x and x-squared, you need to create a variable with x-squared. Also, it is very common to see log-level and log-log models, in which the natural log of some variables are run instead of the variables themselves. From the project's point of view, the data handling from now on will take place in parallel with the statistical analysis.

Variables which are simple functions of other variables
To create a variable, you should use the gen command. Its basic syntax is very simple:

gen <new variable> = <expression>

For example:
gen income_squared = income ^ 2
gen ln_income = log(income)
gen gdp_per_capita = gdp / population

Stata will generate a new variable, and for each of the observations, it will evaluate the expression specified according to the values in the variables we use in the expression. In the first example, for an observation with an income of 200, income_squared will contain 40,000, but for an observation with an income of 100, income_squared will contain 10,000.

You may use an expression that is independent of any variable. This way, the variable will be constant over all observations.


Variables which are determined by a condition
Now, suppose you want to create a variable like this:

(1)
educ_parent = father's education if the student is male and mother's education if the student is female.

(2)
education_level = 0 if education equals 0
education_level = 1 if education is higher than 0 and lower-or-equal than 6
education_level = 2 if education is higher than 6 and lower-or-equal than 9
education_level = 3 if education is higher than 9 and lower-or-equal than 12
education_level = 4 if education is higher than 12


In order to do this, you will need to use both the gen and the replace command. Let's first look at the replace command.

replace <existing variable> = <expression>

The only difference between replace and gen is that gen creates a new variable and puts values in it, while replace takes an already existing variable and puts values in it. Actually, if you will type help gen and help replace, you'll get to the same manual page.

Let's look at the first example again. We need to first generate a variable named educ_parent that will be equal to the father's education, for boys:

gen parent_educ = father_educ if boy == 1

Notice the condition we added. In many Stata commands, conditions can be specified so that the command will be run only on observations for which the condition holds. Notice the double equality sign in the condition as opposed to the single equality sign in the beginning of the statement. Put simply, the single sign determines a value while the double sign checks whether the values are equal. In other words, it's like the difference between saying "set x to be equal to y" (x=y) and saying "are x and y equal?" (x==y).
In the gen case, observations for which the condition does not hold, missing values will be assigned. In the replace case, these observations will simply stay with their previous values.
Therefore, going back to our example, all that is left now is to run the following command:

replace parent_educ = mother_educ if boy != 1

More on Conditions
Before we move on to the next example, here are two comments about conditions:


  1. You can combine several comments with logical operators (and, or, not). :
    • if x==y x==z // This means "if x equals y OR x equals z"
    • if boy==1 & immigrant == 1 // This means "if boy equals 1 AND immigrant equals 1"
    • if !(income>100000) // This means "if income is not greater than 100,000
    • if income<=100000 // This is equivalent to the previous line.
    • if income<=100000 & (profession == "Banker" profession == "Doctor") // This selects observations of people who are either bankers or doctors with income greater than 100,000. Note the parenthesis around the OR combination. If they weren't there, the condition would selected bankers with income greater than 100,000 and all doctors (regardless of income).


  2. Very Very Important - Missing values are a dangerous pitfall! Look at the following examples:
    • if income > 100000 // In Stata, observations with missing values will also fulfill the condition
    • if boy != 1 // If you meant girls, they will fulfill the condition, but also observations in which boy has a missing value

    In order to not to capture the observations with the missing values with the ones you are really looking for, try the following conditions respectively:

    • if income > 100000 & income != .
    • if boy != 1 & boy != .
    • if boy == 0 // This will do the same job, but it's shorter and more readable.




Back to the second example above. To create the variable education level, we will use a gen command followed by a sequence of replace commands. I will write two possible implementions that creates exactly the same variable:

// One possibility
gen education_level = 0 if education == 0
replace education_level = 1 if education > 0 & education <= 6
replace education_level = 2 if education > 6 & education <= 9
replace education_level = 3 if education > 9 & education <= 12
replace education_level = 4 if education > 12 & education != .


// Another possibility:
gen education_level = 0
replace education_level = 1 if education > 0
replace education_level = 2 if education > 6
replace education_level = 3 if education > 9
replace education_level = 4 if education > 12
replace education_level = . if education == .

TIP:

After defining a new variable, especially if it is not a simple one, it is always a good thing to tabulate it against the variable you are constructing it on:





Creating Dummy Variables with a Condition
Dummy variables, or indicator variables, are variables which have either the value of 0 or 1 (if they are not missing). For example, the variable is_working can have either 0 if not working or 1 if working. The creation of a dummy variable is made with a condition (but without if preceding it):

gen <dummy variable> = <condition>

Examples:

gen is_working = income_from_labor > 0
gen high_demand = demand > 600

Problem - Missing Values
Take a look at the second example. What if some observations have missing values in demand. In the example given, they will get 1 in high_demand. So you will probably say: "Hey, you already talked about it. I'll run it like this:"

gen high_demand = demand > 600 & demand != .

But this will make the value of high_demand to be 0 for observations with missing demand. Is this something you want to do? Well, sometimes. But in other times, you don't want to classify missing demand as "not high", but to keep high_demand missing too. In order to do that, you need to run the following command instead of the one above (note the difference):


gen high_demand = demand > 600 if demand != .


Now, why is that? Good question.
A condition - whether defining the dummy variable or specified after the if word - is actually a function that returns either 1 - if the condition is met - or 0 - if the condition is unmet. For each observations the value might be different. Once you specify a condition after an if word, Stata understands that the command is executed only on observations which meet the condition (for which the condition returns 1). But when you specify it after the = (remember the meaning of a single equality sign?), it tells Stata to put the value that the condition returns, 1 or 0, in the variable generated.

Getting Rid of Excess Data
You might want to drop some observations or variables that you don't need. That may improve running time, so you should consider whether you really need all the variables you have. You also want to reduce the complexity of the dataset and keep it as thin as possible.
Two opposite commands will do the job: drop and keep.

For both commands, you need to think first if you want to drop variables or observations. If you want to drop variables, the syntax is:

drop <variable(s) name(s)>
keep <variable(s) name(s)>

Examples:
drop gdp_per_capita_ppp gdp_per_capita_current // This will drop the variables specified and keep the rest
keep id school_id treatment average_score // This will keep the variables specified and drop the rest

If you want to keep or drop observations, the syntax is a bit different:

drop if <condition>
keep if <condition>

Examples
drop if sales_mil < 100 // This will drop all observations (companies for example) with sales less than $100M, and keep the rest of observations.
keep if problematic_education == 0 // This will keep observations with problematic_education == 0 and drop the rest (the problematic ones, as defined previously in the program)

Notice that the if here acts the same way as in the gen and replace commands: it tells Stata which observations are relevant to the command.

(Go on to Step #4)




No comments:

Post a Comment