Thursday, December 6, 2007

Step #1 - Getting the data

The first encounter with Stata, in many cases, can be one of the most frustrating ones. You got the first project you need to work on your own, you searched the internet for data, or got it from someone, but it's not in Stata format. Transferring the data can be very annoying if you don't know how to do it. Maybe this example can help.

Suppose you need to construct a dataset for a research on OECD countries' Foreign Aid. You log into http://www.oecd.org/, go to Statistics, and get the data you want. Then you want to export the data from the website and it lets you do that to an Excel file. I got the following file:


There are, as you can see, four variables: Country, Year, ODA, and ODA as % of Gross National Income. A crucial point is to get the data right in the website, but I am ignoring that here. As a general advice, always ask yourself if the numbers make sense. If you won't ask yourself, others will ask you!

In any case, there are a few things we better do before we transfer the data. Keep in mind that we want to get to the point where the table has one header row with the variables names, and the rest of the rows represent observations:

  1. The country column have only one cell containing each country's name. The rest of the cells should be filled with the name too. Otherwise, this row - when it will be transformed to an observation in Stata - will have a missing value instead of the country's name. To solve this, select the cell with the country's name, click on the bottom-right corner of the selected rectangle and drag it down all the way to the last empty cell of this country. In my file, I also needed to unmerge the cells. Repeat this for all the countries.

  2. The cells in the year column have this weird green triangle on their top left corner. This means that the values there are stored as Text instead of as numbers. Excel recognized this pitfall and therefore warns us with the green triangle. We must reformat the values from text to numeric. Otherwise, Stata will not be able to perform any calculation on this variable. To convert the cells formats to numbers, first select the first cell, then drag the selection all the way down. Click on the small icon with the exclamation mark and click on "Convert to Number"

  3. Column C and Rows 1-6 are excessive. Copy, or type, the column headers for columns D and E in row 7, and delete rows 1-6 and column C.

  4. This is optional, but I would suggest to delete the DAC Countries, Total rows too, since they are not really observations which represent countries, and we want our dataset be country-year based.

Now, you got an Excel file that looks like the Stata file you want. Particularly, all numeric variables have values that are classified as numbers. To transfer them into Stata, you have three options:

  1. The safest, but most expensive - Use a data-transfer program like StatTransfer.
  2. The cheapest way - Copy & Paste.
  3. The third way - Export from Excel, then import to Stata.

Data-transfer Software

These kinds of programs will enable you to transfer data from, and to, different formats, easily and reliably. The problem is that they cost money, so if you're only into a small project, they probably don't pay off. If you are working on a university computer, maybe you have this software installed somewhere. For more details about StatTransfer, visit http://stattransfer.com/

Copy & Paste

This will work in most cases, but might be problematic sometimes. For example, if you have a column with percentage format. Go to the worksheet in Excel, select all the columns that you need (click on the first column's letter and drag to the last column). Copy the selection (Ctrl+C or right-click and then copy or any other way you can think of). Now launch Stata. In the command window type edit and press enter. A window with an empty matrix should appear. Now paste (Ctrl+V etc). Voila!



Export & Import

Go to the worksheet in Excel, select from the menu bar File->Save As. In the bottom of the window that opened you have "Save as type". Select either "Text (Tab delimited) (*.txt)" or "CSV (comma delimited) (*.csv)". Choose a name for the file and remember the folder you saved it in. You will need this file's full path (e.g "C:\Documents and Settings\bla\My Documents\OECD Aid.csv"). Now press on the Save button. The program might ask you whether you are sure you want to save in this format, select Yes.

Launch Stata. In the command line, write,

If you selected CSV as the "Save as type":

insheet using "C:\Documents and Settings\bla\My Documents\OECD Aid.csv", comma

If you selected Text as the "Save as type":

insheet using "C:\Documents and Settings\bla\My Documents\OECD Aid.txt", tab

You can use the Import & Export method with other Spreadsheet programs. As long as they can export the data in a tab- or comma-delimited format.



Before you celebrate

Now, make sure you got the variables right. A simple way is to type sum in the command window. Stata will report, for each variable, the number of observations, mean, standard error, minimum and maximum values. Look at these numbers and make sure they make sense. Make sure that all numberic variables have the number of observations in the dataset. Missing values or string values are not counted by the sum command, so it's a good way to see if you got too many of them.

Another way to make sure the data you have is right is the tabulate command, or in short: tab. Running this command with a variable: tab donor will print a histogram of the variable donor. This way, we can make sure each country appears 7 times. When I run tab, I usually run it with the missing option (or just m, for short). If I don't ask for that option, Stata will not count observations for which the value of the variable (country in our case) is missing.



As you can see, the countries' observations are fine, but look at the value at the bottom of the tabulation. This is the result of the last row in the Excel file we used. It was a note below the table that Stata mistakenly took as another observation. We should get rid of it. The easiest way to do it is to type edit in the command line, scroll all the way down, select the observation and click on the "Delete..." button at the top of the Data Editor, and ask it to delete this observation. Then, when you close the window, a popup comes up and asks you whether to accept the changes. Accept them, of course.

Run the tab command again to see you didn't ruin anything by accident.

If everything is all right, save the file. Either use the icon in the top bar, or just type:

save "C:\myfilename"

If there is already a file with that name in that folder, Stata will not let you save like this. You will need to type save "C:\myfilename", replace.

Congratulations!

(go on to Step #2)


No comments:

Post a Comment