Monday, December 3, 2007

Step #2 - Combine Multiple Datasets into One

In many cases, the data needed for the statistical analyses come from different sources. For example, if you want to analyze international growth, you might find economic indicators in a dataset of the World Bank, political indicators in think tanks such as Freedom House, and climate data in another dataset. Another case is when you have one dataset which is divided into multiple files. In this post I will try to elaborate a bit on how to make it work.

Types of Datasets Combinations

There are actually two main types of combinations:
  1. "Vertical" combination - You want to do this when you want to add observations from one file to another file. For instance, if you are working on a sports statistics project and you have data for players performance in four separate files, one for each year between 2001 and 2004. Another possibility is that the data is separated according to different leagues, groups, etc. As long as the variables in the files are the same and the only thing you need to do is to add observations, this is vertical combination. The command in Stata we will use is append. We will explore this command later.


  2. "Horizontal" combination - This is the kind of combinations in which you want to add variables, and not observations. The observations appear in both files (at least most of them), but in each file there is different information about them. For example, if we're dealing with high school students and we have one file with their personal information and grades, and another file with SAT scores only. If we have an identifying variable in both files (e.g Social Security Number), we can assign each student his/her SAT score. This example is a One-to-One matching. There are three types of matches of this kind:

    1. One-to-One matching: If the identifying variable which appears in the files is unique in both files, then it's a one-to-one match. Unique means that for each value of this variable, there is only one observation that contains it. In the figure below, country is the identifying variable. In both datasets, each country has only one observation.



    2. One-to-Many matching: If the identifying variable is unique in one file, but not unique in the other, then it's a one-to-many matching. This is very common when you have groups of observations in one file (the file with the identifying variable which is not unique), and information regarding each group in another file (the other file). The following figure will make it clearer:



    3. As you can see, one can group the individuals to housefolds. The household identifying variable (fam_ID) is common to both of the files. It is not unique in the individuals file, but it is unique in the households file. This enables Stata to assign the same value, of each of the households variables, to all the members of the household. Note that although we have a unique identifier for the individuals (indiv_ID), it is irrelevant for this merge of files.

    4. Many-to-Many matching: This is very rare. This is also problematic, since there is no unambiguous rule for the assignment of values from observations in one file to observations in the other file. I will not elaborate on this matching too much.

Commands Syntax

There are three commands you should know if you want to combine datasets: append, merge and joinby. All three of them combine the dataset currently in memory with data from a file you specify. We will name the data in memory "Master Data" and the data to combine from the specified file "Using Data". It will be clear why we use the word Using here.

Append

The append command does what we called "vertical" combination. It adds observations. It's syntax, in a simple form (for options not specified in this tutorial, you can always type help append in the command line in order to explore more about the command), goes like this:

append using <filename>

Example:

append using "C:\more_observations.dta"

append using "C:\more_observations" // (this is equivalent)

This will add the observations from the file C:\more_observations.dta to the data in memory. In case no extension is specified (i.e no .dta at the end of the filename), Stata assumes it's .dta, so you can omit it.

Now you understand why we call the data in C:\more_observations.dta "Using Data".

What happens if you have variables in the Master Data which do not exist in the Using Data? The observations from the Using Data will be assigned missing values in those variables. If there are additional variables in the Using Data which do not appear in the Master Data, the observations from the Master Data will have missing values in them.

Tip: Before you append, you might want to make sure you know the source file for each observations. For example, if you append 2008 data to 2007 data currently in memory, you might want to make sure you have the variable year in each of the datasets prior to the incorporation of the Using Data.

Merge

For "horizontal" combination of datasets you will need either merge or joinby. The difference between them is the method they use in order to do the merging, but in one-to-one or one-to-many merges, they give almost the same functionality. We will start with the merge command. The syntax, in its simplest form, is:

merge <identifying variable(s)> using <filename>

Examples:

(1)
use "D:\geography", clear // Assumes D:\geography.dta"
merge country using "D:\economy"
// Assumes "D:\economy.dta"

(2)
merge fam_id using "K:\households.dta"

(3)
merge state year using "K:\USA_data\precipitation.dta"

In the first example, Stata first loads observations from a file called geography and then matches them to observations in the economy.dta file. This will do what the figure in the one-to-one section above shows.
Note: what comes after the double forward-slash (//) will be ignored by Stata. It's used to make the code clearer to the human reader

In the second example, assume the individuals dataset is already in memory. I tried to do what the figure in the one-to-many section above shows. Notice that there is no difference in the syntax of the command. The only difference is in the structure of the files you are operating on.

In the third example, I wanted to show you can use more than one identifying variable. In case only combination of variables is unique (and you want to identify observations uniquely), you can specify both of them. In this example, suppose you have data on state-year basis (this is called Panel Data, because you have the same subjects reappearing in different instances) - let's say car accidents data (number of accidents, injuries, etc) and you need to add data about the weather conditions in that year, you need to tell Stata to make the match between the datasets according to both state and year.

Important: The merge command requires that both the Master and Using Data will be sorted by the identifying variables. If the Master Data isn't sorted, run sort <identifying variable(s)>before the merge command. If the Using Data isn't sorted, open it first (use <filename>, clear), then run the sort command, then save it (save <filename>, replace), open the Master Data and run the merge command. Here's an example:

use "D:\economy", clear
sort country
save "D:\economy", replace
use "D:\geography", clear
sort country
merge country using "D:\economy"

1) Since you saved D:\economy.dta in the third line, you will not need to open D:\economy.dta and sort it again in future runs.
2) If you are doing a one-to-one match (i.e if the identifying variable(s) are unique in both sets), you can run the merge command with the sort option. It will automatically sort the datasets within the merge command. The sort option will not work if the identifying variables are not unique.

The _merge variable:

The merge command automatically creates a variable named _merge, which contains information regarding the observation's existence in each of the two datasets. In the simple cases I mentioned above, it will contain, for each of the observations, one of the following values:
1 => the observation (the identifying variable(s) values) appeared only in the Master Data
2 => the observation (the identifying variable(s) values) appeared only in the Using Data
3 => the observation (the identifying variable(s) values) appeared in both datasets

It is up to you to decide what you want to do with each of the cases. In some projects you will not want observatios with the value 2 in the _merge variable. For example, take example 2 above. If you have households data in the Using data, but your interest is individuals (in the Master Data), you don't need observations with household data but without individuals that are linked to it. If you want to get rid of it, you can either type drop if _merge == 2 after the merge command, or, even better, run the merge command with the option nokeep. That is:
merge fam_id using "K:\households.dta", nokeep

You can also decide that observations in the Master Data that has no corresponding observations in the Using Data are irrelevant for your research. In that case, there is no special option for the merge command. So you need to add the command drop if _merge == 1 after the merge command.

Other options of interest

update and replace

What happens if you have some overlap between the variables in the files? Say, when you are merging data from the CIA World Factbook and the World Bank, you might have GNI in both datasets. If you specify none of them, Stata will keep the values that were in the Master Data (in memory). If you specify the options update replace (replace can't be specified without update), Stata will take, instead, the values that are in the Using Data and put them in place of the Master Data values. If you just type the update option (without replace), however, Stata will put the Using Data values only in observations where the Master Data values are missing.

So in case you have the same variable but different values, use neither option when you think the Master Data is more reliable. Use the update replace options if you think the Using Data is more reliable. If they are equally reliable, use just update.

If you specified the update option, _merge will contain 5 possible values:
1 => the observation (the identifying variable(s) values) appeared only in the Master Data
2 => the observation (the identifying variable(s) values) appeared only in the Using Data
3 => the observation (the identifying variable(s) values) appeared in both datasets and the values are the same in both
4 => the observation (the identifying variable(s) values) appeared in both datasets and the value in the Master Data is missing.
5 => the observation (the identifying variable(s) values) appeared in both datasets but the values in the datasets are not missing and not the same.

Examples:

merge country using "D:\Economy", update replace

merge id using "K:\second_version", update

keep

If you want only some variables to be merged, instead of all of them, you can specify keep().

Example:

merge country year using "F:\intl_health_stats.dta", keep(birth_rate death_rate)

unique, uniqmaster, uniqusing and sort

In order to make sure the one-to-one or one-to-many matches are really unambiguously defined, you can make sure the identifying variables are unique in either the Master Data (uniqmaster), Using Data (uniqusing) or both datasets (unique). It is really recommended to specify them, although it won't change the functionality. The main contribution of these options is to make Stata print an error and exit if what you think is unique is not really unique. The sort option can make the merge command sort the datasets on its own, but it is only possible if you're running a one-to-one match (in other words, sort implies unique).

More than one dataset

You can merge more than one file in one command. Instead of specifying one filename after using, you can add more filenames. Unless the nosummary option is specified, the command will create _merge1, _merge2, ... , _mergen variables in which the observation's value in each of the _mergek variables will be 1 if the k-th dataset had this observations and 0 otherwise. The _merge variable will still be there, but now the value 3 in it means that the observations appeared in at least one of the Using datasets.

Personally, I prefer running the merge command iteratively and adding one dataset at a time. It requires to drop the _merge variable each time, and it might take longer time, but I can better report and deal with the merging outcomes.

Joinby

The joinby command does almost the same job merge does, but its internal working is different, so there might be differences in terms of processing time. Its main difference arises when you're dealing with many-to-many matches, but it can be used for one-to-one and one-to-many matches too. The simple syntax is:

joinby <identifying variable(s)> using <filename>

Example:

joinby country using "D:\economy"

Unlike merge, the default of joinby is to drop all observations that do not appear in both datasets. In order to keep those observations, you need to use the unmatched() option. This option has four possible variations:

  • unmatched(none) - Keep none of the unmatched observations (this is the default)
  • unmatched(master) - Keep observations in Master Data that have no match in Using Data (but not vice versa)
  • unmatched(using) - Keep observations from Using Data that have no match in Master Data (but not vice versa)
  • unmatched(both) - Keep all unmatched observations, from both Using and Master Data

So if you want to do the same thing done in the first example of the merge command, use the following commang:

joinby country using "D:\economy", unmatched(both)

There is no need for the datasets to be sorted by the identifying variable(s), which is an advantage over merge.

The update and replace options are available for joinby too.

As I said, more details with:

help joinby

Many-to-Many Merge

Although I have never needed it, this is where merge and joinby will give you totally different results. The question is how to match values from one dataset to the other. I think the best way to explain the difference between the commands is graphically:

Now you can understand the meaning of the sentence describing the joinby command in the help reference: "Form all pairwise combinations within groups".

Conclusion

If you want to add observations: append.
If you want to add variables: merge or joinby

As always, before you celebrate, make sure you got the combination of the files right by looking at the means, counts, minimum and maximum values (sum command) and tabulations (tab command). Take a special look at the _merge variable. Look for missing values or other outlying observations. If you have too many of them, you might have made a mistake along the way. Browse the data a bit. See that the data merged correctly.

Don't forget to save the file (that is, if you don't want to rerun the merge command later).

(go on to Step #3)


No comments:

Post a Comment