Sunday, December 28, 2008

Step #6: Automation - Separating the Men from the Boys

Before we start, to make sure, when I'm talking about automation I'm referring to all sorts of ways to write a program in Stata that will run, and save output from, commands in a more-or-less structured program without having to write all commands separately. In the Stata manual you will sometimes see the term Automation as reserved to OLE automation, which is making Stata available to Microsoft Office programs. I will not deal with this automation at all.

Intro: Why?
OK, so what is automation good for? We had a preview in the previous step: One thing we can do is to calculate things based on outputs from commands we ran, before seeing the output and automatically by the program: divide mean by standard deviation, add 1.96 standard errors to the mean, and so on. Of course, one way (the boys way) is to run the command, and then calculate it with the di command, or in a spreadsheet or calculator. But the men don't do things manually. They tell their programs what to do.
Another thing we can do is to avoid repeating similar commands. Instead of having 50 rows of the same reg command but with different regressors or different samples, we can write a loop that will do it in 5 rows. This is good not only to save rain forests when you print out your code, but it also puts a structure to your regressions and reduces the chances that when you need to change something in the command (report clustered standard errors for example), you'll need to do it just once and not 50 times (or 48 times and forget to change 2 of the regressions accidentally).
Finally, what we can also do is to construct tables of the results we want to report. But we'll deal with this possibility in the following step.
When is it better to leave automation out? Probably when you need just a few regression and you're not doing anything too time-consuming with the output. Think of automation as an investment you do in your program. It entails a fixed cost of thinking about the structure and implementing it, but above a certain threshold the benefits of having the program do most of the work for you. What I used to do in many cases when I was starting to automate do-files, was to first write the program simply and then when I saw that I'm starting to repeat almost the same code (Copying and pasting like there's no tomorrow) I started thinking of how to automate things.

Macros
A macro is a word (or a string) that whenever we write it in a Stata command, before running the command, Stata replaces this word by a string (a set of characters) that is set for this macro, and only after the replacement, it runs the command.
Enough with the definitions, let's see a simple example:
To define a macro x that contains the value 6, run the following line (without the . in the beginning):
. local x = 6
Now Stata has assigned a place in memory called x and put 6 in it. So whenever we want to tell Stata to use this x we saved, we use backquote (usually the key in the top-left corner of the main keyboard, left to "1") and quote (single quote) characters:
. di `x' + 5
11
What really happens, behind the scene, is that Stata first sees the ` followed by the ', and looks within them to find a local we previously defined. Then it replaces this referral by the value that was saved:
First (as typed): di `x' + 5
Then (replacing `x' by 6): di 6 + 5
And only then when no `' are left in the command, Stata will run it and return 11.

Note: if nothing was defined for y, then Stata will replace the `y' by nothing:
First (as typed): di `y' + 5
Then (replacing `y' by an empty string): di + 5
And then you will get an error because the di command can't handle "+ 5" as input. Note that you will not always get an error. If Stata has no problem running the command after replacing the macro by an empty string, then it will run. Many annoying bugs in your programs will stem from this problem.


We can also put strings inside macros:
. local x = "Hi there, how are you?"
or
. local dependent_variable = "wage"

Why do we need the double quotes? Because otherwise Stata will think what we put after the = sign is part of our command and not simply a value. Specifically, when we put words (instead of numeric values) after the = sign, Stata thinks we're referring to a variable in the dataset (if there is, indeed a variable in that name, it will put the value of the first observation in this variable). Thus, to tell Stata you don't want the value of what's inside wage, but rather you want simply the name "wage" to be kept inside the macro dependent_variable

Now, let's see how we can refer to these macros. Almost the same as when it was numeric:

. di "`x'"
Hi there, how are you?

. sum `dependent_variable'

Why did we use the double quotes for the first command but not for the second?
Remember what the macro does, it replaces the `macro' with what we've saved in it. So for the first line this would be:
First (as typed): di "`x'"
Then (replacing for the macro): di "Hi there, how are you?"
So if we had not put the double quotes, it would had been equivalent to running:
. di Hi there, how are you?
But Stata will then look for a variable named Hi and won't find it. We didn't intend Stata to look up a variable, but to simply display a string as it is.

However, when we ran the second line with the summarize (sum) command, we indeed wanted the command to treat wage not as just a word, but as a reference to the variable wage! In other words, we wanted to run:
. sum wage
(and not sum "wage")

Three final remarks before we move on to loops (if you're still wondering why do we need all this, hang on).
  • local and global - you might have wondered why the command to define a macro is called local. In step #8, which hopefully will be written some day, we will deal with writing commands in Stata, and then it will matter. local means that the macro is defined within the program it was set in, and global means that all commands and programs can refer to the macro.
In any case, to define global, you will use the word global instead of local:
. global x = 6
But to refer to the global later, we do something a little different. We use ${macroname}
. di ${x}

  • Long strings - at least in Stata 9, there is a weird issue with defining macros for strings that are longer than 255 characters. You might think why do you need more than 255 characters, but it happens sometimes and unless somebody told you, this can be one of the most annoying bugs (Stata might simply cut your string after 255 characters...). To avoid that, what you need to do is define string macros without the = operator:
. local mylongstring "This is my very long string, and since it is longer than 255 characters, I omitted the = in its definition. Looks strange, but this is how it works. Good luck!"
  • Predefined macros - Stata has some macros within it that you might find helpful. They're not exactly macros, when I come to think of them, but we treat them as such (but without the `'). For example, _N holds the number of observations in the dataset (try to run di _N), _pi holds the number pi. For a few more, you can look up help _variables
  • Nested macro reference - You can refer to a macro within another macro reference. What does that mean? Say you have one macro named x_a and another macro named x_b, you can define a macro named i and do the following:
local x_a = 800
local x_b = 43.2
local i = "b"
di `x_`i''
Note that it is not double-quote at the end of the di command but two single-quotes. What happens when Stata hits the di command is the following:
First (as typed): di `x_`i''
Then (replacing the innermost macro): di `x_b'
Then (replacing the next innermost macro): di 43.2
And then Stata will execute the command and shoe the number 43.2

This can be handy when you have several macros and you want to alternate referring to them (i decides which of the x's to use), which sometimes you need to do inside loops.
  • Extended functions - Ever wondered how to save a variable's label? Ever wanted to count how many words there are in a string? Maybe you didn't, but sometimes there are things you need to save to a macro and you have no idea how to do that. In some of these cases, you might find your answer in the extended functions. They work a bit differently (but just like egen, you get many different features with variations on the same command):
local <macro_name> : <extended_function>
Note that we use : instead of = which tells Stata we're not using the regular functions but the extended functions. For example, say you want to keep the label of the variable w2gef (usually questionnaire data will have cryptic variable names but hopefully informative labels) inside the macro w2gef_label:
local w2gef_label : variable label w2gef
Another example:
local x "This is my string. How many words are in it?"
local num_words : word count `x'
local sixth_word : word 6 of `x'
di "There are `num_words' words in x. The sixth of them is `sixth_word'"
The output will then be:
There are 10 words in x. The sixth of them is many
More on that in help extended_fcn
  • Saving and reusing - I thought once that this part is obvious, but teaching Stata has taught me otherwise. So to make things clear... When you save with the "save" command or with the icon on the top left and so on, it just saves your data. It will not save the macros. Macros are part of programs. If you use the interpreter interface of Stata (the command line below the output window), then when you will close Stata, your macros will disappear. If you want to reuse them when you open Stata next time, you have got to work with .do files. Stata comes with a do-file editor, but you can write it in any text editor. Make sure from now on you work with .do files. Of course, experimenting commands with the interpreter is something which is always worth doing, but in the end keep the commands you liked in a do-file.

Loops (and Conditions)
The power of automation comes mainly from the ability to create a loop and repeat commands with it. The next subsection will show some examples.

* Before we start, the following examples sometimes have lines extend beyond the boundaries of the blog, so it cuts them to two. If you're not sure where each line ends and another one starts, copy the example to a text editor.


while


The simplest loop is the while. The syntax goes like this:
while <exp> {
    ...
}
where <exp> is a condition. If you remember when we talked about creating dummy variables we said that a condition in stata is an expression that is equal to 1 if the condition holds and 0 otherwise. The while command tells Stata to keep running the same commands between the {} until is equal to 0 (that is, as long as the condition is satisfied).
When we dealt with dummy variables we usually constructed the condition on one of the other variables (and Stata checked the condition on the values of the variables for each observation: educ>12 for example). But when you deal with loops and other matters of flow control (that is, how your program runs contingent upon the situations it faces), the conditions will mainly deal with macros instead of variables.*

* There is no technical problem with referring to variables in the condition. The thing is that as opposed to conditions when creating variables - in which Stata goes through all the observations - here referring to a variable will give its value in the first observation only, because nothing tells Stata to go through all observations. If you want to refer specifically to the value of the variable in an observation other than the first, just rever to varname[observation_number]. You can experiment with the command di

Anyway, here's an example:
local i = 1
while `i' <= 4 {
    di "counting `i'"
    di "good."
    local i = `i' + 1
}
This will output:
counting 1
good.
counting 2
good.
counting 3
good.
counting 4
good.
Note that the last row within the while loop is iterating the macro i. Each time before the iteration is over, we increase i by 1. If we didn't do so, i would have stayed 1 and the condition would always be satisfied. To get out of the loop we need to make sure that after a finite number of iteration the condition is no longer satisfied.

But for examples as the one I given, there is a better loop which is less cumbersome (the foreach/forvalues loop). While is good for situations in which one doesn't know in advance how many iteration one wants.

forvalues
When you know how many iterations you want, using a for loop is much better. The simplest for loop is the forvalues loop. Lets start with an example which will do the same thing as the example for the while loop:
forvalues i=1/4 {
    di "counting `i'"
    di "good."
}
Let's try to find the differences between the examples:
First and foremost, the condition from the while loop has changed to i=1/4. Second, the initialization of the macro i before the while loop and the incrementation before the end of the iteration are both gone. This is done by the simple i=1/4 which we wrote for the forvalues. It means that we are creating a loop that will start with i=1, then increase i by 1 until it reaches 4 (including 4). We can refer to i inside the loop or we can ignore it. The loop will run 4 times with each time having the next number for i.

More generally, our forvalues loop looks like this:
forvalues <loop_macro> = <range> {
    ...
}

In our example <loop_macro> was i and <range> was 1/4. Note that when we're in the forvalues context, 1/4 doesn't mean a quarter, but rather "from 1 until 4 in steps of 1". The range can be different both in terms of boundaries and in terms of steps. We can do this:
forvalues proportion = 0(0.05)1 { ...
Which will start with proportion=0, then the next iteration will have proportion=0.05, then the next one 0.1, and so on until proportion=1.

More on the possibilities of range in help forvalues.

foreach
The foreach command is pretty versatile. In my experience, two of its versions are very common. The first and simplest one is this:
foreach <loop_macro> in <list> {
  ...
}
where <list> is simply a list of words (can also be numbers if you want) separated by white space. Let's see some examples:
foreach regressor in educ_mom educ_dad "educ_mom educ_dad" {
    reg wage educ `regressor'
}
The loop will run the following three regressions:
reg wage educ educ_mom
reg wage educ educ_dad
reg wage educ educ_mom educ_dad
Note that the double-quotes in the last expression are there to tell Stata we want it to treat it as a one word (one iteration in which the whole string inside the double quotes is the value that is assigned to the macro regressor). In other words, if you don't want Stata to treat the space as a separator.
foreach male_value in 0 1 {
    reg unemployed wage educ shock if male == `male_value'
}
This will run twice:
reg unemployed wage educ shock if male == 0
reg unemployed wage educ shock if male == 1
What if you want an additional regression for both males and females? Because macros are simply text substitutions before commands are run, there are quite a few possibilities to implement this. I would try to do the one which makes the code easiest to read. One possibility is doing it this way:
foreach male_cond in "male == 0" "male == 1" 1 {
    reg unemployed wage educ shock if `male_cond'
}
This will run the following three regressions:
reg unemployed wage educ shock if male == 0
reg unemployed wage educ shock if male == 1
reg u
nemployed wage educ shock if 1
The last 1 says that the condition will always satisfy. Thus, all observations (including, for example, those with a missing value in the variable male) will be in the last regression.

Now, besides lists of strings and numbers, we can tell foreach to iterate between variables only. This is good for two reasons: (1) You can refer to a group of many variables with just one word , and, (2) If we're really interested in iterating names of variables, we can get something which is usually absent in Stata - we can get an error message if there is no such variable (error messages are definitely underrated - it is true you don't want any of them, but if you misspelled one of the variables' name, you probably want Stata to tell you).
How do we do it?
foreach <loop_macro> of varlist <varlist> {
    ...
}
For example (suppose the following variables exist in the loaded dataset: educ educ_dad educ_mom year1998 year1999 year2000 year2001 year2002):
foreach var_to_sum of varlist educ* year1998-year2002 {
    sum `var_to_sum'
}
The educ* will make the loop go through all variables of which names start with educ. Then, year1998-year2002 will make the loop go through all the variables between year1998 and year2002.

As always, further details are to be found in help foreach

if
You are already familiar with the if condition most commands support. This if is meant to limit the execution of the command only to observations for which the condition is satisfied. As we said when we talked about the while loop, sometimes we would like conditions to control how our program flows. Those conditions are a bit different.

Let's do an example. Suppose you want to run the loop above which iterates over different samples: male, female and all. But when you run both males and females in the regression you want to add the male dummy as a regressor (this is sometimes called adding a main-effect), or an interaction between the male dummy and a treatment variable. You only need to add those regressors to the "all-sample" iteration (actually you can put the regressors in the male-only and female-only regressions too and Stata will just drop those variables as they are multicollinear with the constant, but lets ignore this for the sake of the example). You can do something like
foreach male_cond in "male == 0" "male == 1" 1 {
    if "`male_cond'" == "1" {
       local add2reg "male maleXshock"
    }
    else {
       local add2reg ""
    }
    reg unemployed wage educ shock `add2reg' if `male_cond'
}
Note that I put double-quotes on both sides of the condition because if I wouldn't, the first and second iterations would make the if command look like this:
if male == 0 == 1

Stata would first evaluate 0 == 1 and then male == 0 (the second 0 is because 0 is not equal to 1). You didn't want this. You wanted simply to compare the string of the condition to 1 (to get the last iteration).
This example brings me to another point. Note that we wrote 9 rows of code for a loop that replaces 3 rows of simple regression commands. In many cases, simply writing the original regressions will do the job. In others you might be working in a greater framework, or you want to later add additional subsamples which will make it better to write the loop instead of the regressions themselves. Do your own calculation of whether complicating things with a loop (and inner conditions) is better than simply repeating your commands, however stupid it feels.

For further help (this time I'm going to surprise you), look up help ifcmd.

Additional issues for loops and conditions:
  • Nested loops - you can write a loop inside a loop. This will make the inner loop run anew for each iteration of the outer loop. This is where the whole thing really starts to pay off, because you can run many regressions and make it pretty readable, enabling easier changes in the specification when you need it. Here's an example
local control_vars "educ_dad educ_mom hh_income grade_5 grade_6"

foreach dep_var of varlist score pass_dummy admitted {
    foreach treatment of varlist hours_tutored tutored_dummy {
       foreach sample in "male == 0" "female == 1" "male == 0 & educ_dad < 12" {
          reg `dep_var' `treatment' `control_vars' if `sample'
       }
    }
}
  • continue - if you want to exit a loop before it ends naturally (i.e murder a loop?), you can use the continue command. Usually it will appear inside an inner if condition. This is very uncommon, though, and makes the code less readable.


Summary
So we learned how to define macros and give our regressions a structure with loops (and nested loops). I hope by now you understand how this can contribute to your project. I think the last example - for the nested loop remark - makes it very clear. As we will see in the following steps, loops and macros can help us automate not only the statistical commands, but also how we save the output we're interested in and export it to nice tables (if reading logs of Stata isn't your favorite pastime activity).

No comments:

Post a Comment