Monday, August 10, 2009

Stata: Introduction

Introduction

This page is designed to give you a quick look at some basic tasks you may need to do in Stata. What it does cover is:
1. Overview of Stata.
2. Creating and using "log" and "do" files.
3. Data management.
4. File management.
5. Basic data description.
6. Basics in regression analysis.
There are some other resources you may find useful:
1. Stata manuals
2. Stata Technical Bulletin
3. The DSS web site, www.princeton.edu/~data/stata (the major source for this overview)
4. Stata’s web site, www.stata.com
5. Stata’s on-line tutorial (the command, ‘help tutorial’ will show you what to do).
You can click on the to see what the output from that particular command or commands look like, or what a particular window should look like in the PC version of Stata; use the "back" button on your browser to return to this page.
________________________________________
Overview
Stata is an interactive data analysis program which runs on a variety of platforms. The most current release, Stata 5.0, is available on the arizona UNIX servers. Stata is also installed on personal computers in a number of CIT and private clusters. There is also a version of Stata which is available for the Macintosh.
Stata can be used to enter and edit interactively data, and for both simple and complex statistical analysis. Commands can be executed one at a time at the Stata prompt, or groups of commands can be entered into do-files which can then be executed.
Stata 5.0 is available on the arizona UNIX servers. To invoke Stata 5.0 issue the following command at the UNIX prompt:
flagstaff.Princeton.EDU% stata
If you are working in a telnet session, this should be what you see: :
On PC's, you can invoke Stata by clicking on the Stata icon and your screen should look like this: :
We will refer to "Stata for PC" features using the symbol that appears at the upper-left corner of your screen, and these features will be presented in red.
________________________________________


Data Management

Data management is perhaps the most important part of your work in Stata. It is how you get your data into a format in which you can run your analyses properly. Major mistakes have been the result of poor data management rather than poor analysis.



Reading Data

There are three basic ways of reading data into Stata. "use" is for reading data that have been saved in Stata format, "insheet" is for spreadsheets saved as "CSV" files from a package such as Excel, "infile" is for data in what are called "flat" files, and "dictionaries" are for when you have data that has to be read in using special formats. .

Use
The Stata use command reads data that has been saved in Stata format:
use [filename]
where filename is the name of the Stata file. For example, suppose you have a Stata file named "myfile.dta." You can read this file with the following command:
use myfile
Note that the ".dta" file extension is automatically appended to Stata files. You do not have to include the file extension on the use command. Also note that the file should be stored in your working directory. You can also read files from other directories, e.g.:
use \home\project\myfile
On a PC, use the "File/Open..." menu for reading Stata formatted data. Stata will automatically clear your memory from any existing data.


Insheet



This is an example of a spreadsheet from Excel. It cannot be brought into Stata without some editing and saving it as a ".csv" file.
The insheet command is very useful in reading in data from a spreadsheet. There are some conventions required though:

1. The first line should have Stata variable names (eight characters or less, not starting with a "special character" or number) and the second line begins the data.
2. Missing numeric data should be coded as an empty cell, not a space, dot, or any other non-numeric data. Often, 0, 9, or 99 is used to code missing numeric data; this is fine as long as these are not also valid values for that variable.
3. Commas in numbers or text are particularly problematic because Stata thinks they are a delimiter and will not read the data properly. You must remove the commas from numeric values before saving the file.
4. When some spreadsheets create a .csv file it does not add commas to the end of a line if the cells at the end of that line are empty. This will confuse Stata which relies on the commas to tell it where the values are. You can avoid this problem by adding another column of 1’s (or any other character) to your spreadsheet. You can drop this variable once you have read it into Stata.
5. The file must be specifically saved as a "comma separated values" file in Excel. You can do this by going to "File", then "Save As…", then choosing "comma separated values." Simply giving it an extension of ".csv" will not work. When you close the spreadsheet and Excel asks if you want to save the changes, say "No." This is counter-intuitive, but the changes it’s asking you about are the changes it needs to make the spreadsheet a regular Excel spreadsheet again.

Once you have completed the changes and you are in Stata, give the command:
. insheet using filename.csv
to read in the file. If you get an error message about "wrong number of values," then you have a problem with not enough commas in the file, see #4 above.

Infile
"infile" may be useful if you have downloaded a data file from the web and it conforms to these specifications:
1. The file should NOT have variables names on the first line.
2. Character variables that have spaces in them, such as full names, must be enclosed in quotes.
3. Numbers can have commas and minus signs, but not dollar or percent signs.
4. Infile assumes that the variables have spaces between them and that there are no blanks spaces where it expects data (missing data needs to be represented by something).
The command to read a file with infile is:
infile var1 var2 var3 using mydata.raw
Essentially, insheet assumes that commas separate the variables, and infile assumes that blanks separate the variables.For example, suppose you have a raw data file named mydata.raw which contains the following fields:
1 55 4.5
2 23 3.2
3 34 3.4
4 52 7.1
5 41 2.9
You can use the following Stata command to read this file:
infile [varlist] using [filename]
where varlist is a list of variables and filename is the name of the file that contains the raw data. For example:
infile caseid age score using mydata.raw
where caseid, age and score are variable names and mydata.raw is the file that contains the data. It may be easier, though, to bring a file into Excel first, save it as a csv, then use insheet.
Moreover, you can use the infile command to read data files containing many variables or multiple records per observation, or more generally when you need to read data that have been saved with particular formats. In these cases, it is often much easier (and sometimes necessary) to read data using a Stata dictionary file, as described below.

Dictionary Files
Stata dictionary files can vary in their complexity but all of them basically specify which data file to read, where the variables are located in the file and what they are to be called. Optionally, labels can also be provided in the dictionary file. Once the dictionary file is written, it is accessed by issuing the Stata infile command:
infile using [dictionary_file]
where dictionary_file is the name of your Stata dictionary file.
In this example, the raw data are saved in a file named "latimes.raw" and the dictionary file is saved in a different file named "latimes.dct." To read the data, you would call the dictionary file:
infile using latimes.dct
Given below is a sample Stata dictionary file:
dictionary using latimes {
_column(13) wght %5f "Weight Value"
_column(22) groups %1f "Ethnic/racial grp"
_column(23) rdir %1f "Cntry in right dir.?"
_column(24) apbush %1f "Do You app. of Bush?"
_column(27) rvote %1f "Are you reg. to vote?"
_column(28) wvote %1f "Vote for/Lean toward?"
_column(29) whyvote %1f "Why this candidate?"
_newline
_column(48) location %1f "Area you live in?"
_column(49) views %1f "Views on political matters?"
_column(50) tparty %1f "Party views?"
_column(52) relign %1f "What religion?"
_column(55) faminc %1f "Family income"
_column(66) agerng %1f "Age range"
_column(69) educ %1f "Education"
_newline
}
The above dictionary starts with the words dictionary using , which defines the file as a Stata dictionary. The name of the file which contains the actual data appears immediately following the word using . In this example, the data file is named "latimes.raw" and the name latimes appears in the dictionary file. The data file extension is assumed to be .raw unless a different file extension is specifically provided.
The actual data dictionary is enclosed within curly brackets ({ }). On each line of the dictionary, the _column command is used to move the pointer to the location where the data item starts. After the column locator is the variable name. After the variable name is the format with which the data item is to be read (see the manual for possible formats). After the format, a quoted text string can be provided for a variable label. If there is more than one line of data for each record, a _newline command must be given to advance the pointer to the next line in the data file.


How to Save Stata Files

You can use the Stata save command to save a file in Stata format:
save [filename]
where filename is the name of your Stata file. For example:
save myfile
will save a Stata file named "myfile.dta" in your working directory. This file can be read in Stata with the use command. Note that the ".dta" file extension is automatically appended to Stata files. You do not have to include the file extension on the use or save commands.
If you already have a Stata file named "myfile.dta" and wish to save an updated version of the file under the same name, then use the Stata save command with the replace option, as in:
save [filename], replace
where filename is the name of the file you which to replace, e.g.:
save myfile, replace
To save an updated version of the active file, you can simply type:
save, replace
To save your file in some other directory, you can type:
save \home\project\myfile, replace
This command will destroy the previous version of your file so use the replace option only if you are certain that you will not need the older version of your file. There is no way to retrieve your original file once another file has written over it.
In Stata for PC, to save the file "myfile.dta" use the "File/Save As..." menu.


How to Increase Memory

Sometimes you may need to allocate additional memory for your Stata session, such as when you are working with a large file. If you recieve this message from Stata:
no room to add more observations
then you should increase the amount of memory available to your Stata session. Here's how.
1. Find out how large the file is. First, issue the clear command to remove the file from memory. Then issue the desc using filename command:
desc using mydata.dta
At the top of the information listed is the size of the file, in bytes. There are 1,000 bytes in a kilobyte, and 1,000 kilobytes in a megabyte, so if the size is 11,000 then the file is 11 kilobytes. For example:
Contains data
obs: 1,248
vars: 104
size: 11,576,000
-------------------------------------------------------------------------------
This shows that the file has 1248 observations, 104 variables, and is just slightly over 11.5 megabytes.
2. From the Stata dot prompt, issue the command set memory to increase the amount of memory. For example, the following command allocates 12 megabyes of memory to the current Stata session:
set memory 12m
Set the memory to a number slightly larger than the size of the file you are trying to read.
3. Now read your data file.
The memory and set memory commands are new features of Stata 5.0 for reporting, allocating and repartitioning memory. The memory and set memory commands in release 5.0 replace the following commands in release 4.0: the memsize command and the set commands for memsize, maxvar and maxobs.
• Use the Stata memory command to determine how much memory is currently being used and how much is available. For example, when the following command is issued at the Stata dot prompt:
memory
The following report will be displayed:
Total memory 2,097,144 bytes 100.00%

overhead (pointers) 0 0.00%
data 0 0.00%
------------
data + overhead 0 0.00%

programs, saved results, etc. 352 0.02%
------------
Total 352 0.02%

Free 2,096,792 99.98%
This reports indicates that more than 2M bytes of memory are available for the current session, the default allocation in UNIX.
• Use the set memory command to allocate additional memory, if needed. The general syntax is:
set memory [memsize]
where memsize is the amount of memory requested. For example, issue the following command to request 4 MB of memory for your current Stata session:
set memory 4m
How to Reduce Memory Requirements

You may be able to reduce your memory requirements by saving your data more efficiently. You can use the compress command to reduce the amount of memory your data consumes.
This command does not create a compressed version of your file in the way that compression utilities do, such as zip or pkzip. Rather, the Stata command compress changes the data types used to store your variables such that each variable is stored optimally.
In numerous examples, memory reqirements have been reduced by as much as 85% using the compress command.
1. Check the size of the original file using the UNIX ls command. This command can be issued from within a Stata session:
ls -l myfile.dta
where myfile.dta is the name of your Stata file. The -l (long) option displays a long listing of the file that includes its number of bytes. In Windows, you can ude the DOS dir command:
dir myfile.dta
2. Open (use) your Stata file. For example:
use myfile
(Recall that the .dta file extension is assumed on the use command.)
3. issue the compress command:
compress
Stata will consider changing the data type for each of your variables. A message will be displayed if the storage type is changed.
4. You need to save the new version of the file in order for the changes to remain in effect. Issue the Stata save command followed either by a new name for your Stata file or by the replace operand. For example
save myfile2
will save a new Stata file named myfile2.dta. You will have the original file, myfile.dta plus the new file.
Alternatively, the following command will overwrite the original file:
save, replace
This will save the data is in Stata's work area as a file named myfile.dta, replacing your original data file. You should be certain that you want to overwrite your original Stata file before using the replace option.
5. Now for the fun part. Check the size of the new file using the UNIX ls command. This command can be issued from within a Stata session:
ls -l myfile2.dta
where myfile2.dta is the name of your Stata file. You may see a significant decrease in the size of your file. You need to compress your Stata file only once.
Examining your Data

Describe
Once you have the data in Stata, you will want to make sure that all the variables are there and that they are in the format you need. You can do this with the "describe" command. Describe, which can be abbreviated as simply "d," will provide basic information about the file and the variables. You don’t have to call the data into Stata to be able to describe it, though. The command:
d using myfile :
will accomplish this. This can be very useful if the the file is too large to fit into memory. Often, your dataset will be larger than the default and you will need to increase the amount of memory Stata uses. To do this, look at the "size" line in the describe output; this is the size of the file in bytes. Since 1,000,000 bytes is a megabyte, files larger than 4,000,000 will not be loaded into Stata unless you increase the memory with the "set memory" command:
set mem 10m
This example increases the memory to 10 megabytes. Make sure that you give yourself some extra memory so you can create new variables and/or add observations.

List
It is often useful to just look at the data without doing any kind of analysis. The "list" command, abbreiviated as "l" will let you do this. Simply giving the "l" command will display all of the data on your screen, if you specify certain variables in the command, then only those variables will be printed:

l :
l var1 var2 var3 - displays just var1, var2 and var3 :
In Stata for PC you can have a direct look at your data as if you were working with a spredsheet (e.g. Excel), by using the "Stata Editor". You enter the editor by clicking on the "Editor" button. :You can edit your data by highlighting the cell you want ot change, entering the new value for that cell, and clicking on "return". You can edit the variable name (and the variable label) by double clicking on the relevant column : .
Creating Variables

Stata can store data as either numbers or characters. Stata will allow you to do most analyses only on numeric data. Sometimes, when you use insheet or infile, a numeric variable gets read in as a character, or string variable. This may be due to a ‘space’ character or a ‘.’ in one of the cells. Since Stata allows you to do analyses only on numeric variables, you will need to convert them to numeric data.
There are different types of numeric variables - float, binary, double, long and int - the differences among them are simply how much space they take up in the file. In most cases, you will not need to concern yourself with these differences.

Encode
Use encode when the original variable is, indeed, a character variable (such as gender being coded as "m" and "f’") and you need numbers instead. The encode command does not produce dummy variables, it just assigns numbers to each group defined by the character variable. In this example, gender was the original character variable and sex is the new numeric variable:
encode gender, gen(sex) :
Generate
The "gen … real" command will create a numeric variable based on the original string variable. Use this if the original was defined as a character variable "by mistake." The following command will create a new, numeric variable var1n by converting the string variable, var1:
gen var1n=real(var1)
Often, you will need to create new variables based on the ones you have already. The two most common ways of creating new variables is by using "generate" and "egen." Here are some examples of gen:
gen total= var1n + var2 + var3
gen cumtot=sum(total)
gen id=_n
In the first example, we generate a new variable called "total" which is simply the addition of var1n , var2, and var3. In the next example, we create a cumulative sum of total (the value of cumtot for this observation is the sum of total for all previous observations). Finally, we create an "id" variable by simply setting it equal to "_n", which is Stata’s way of numbering the observations in a dataset. This id variable can be very useful when you need to sort your data in different ways (see sorting below). The variable "_n" can be used at any time, but you must remember that it refers to the number of the observation in the current order, not the original order. Here are some more examples of gen and _n:
gen total2=42.3 if _n==125
gen lagtot=total[_n-1]
In the first example we simply set total equal to a specific value for the first through the 125th observation, namely, observation number 125. In the second, we create what is called a "lagged" variable, lagtot, which is the value of total from the previous observation designated by the "[_n-1]." You can use any valid arithmetic expression inside the brackets. Here are two last examples:
gen price = total + lagtot if company=="IBM"
replace price = total + cumtot if company~="IBM"
Here, we generate a new variable, "price", which is the sum of total and lagtot, but only if the value of company is IBM. Note two things here: first is the double "=" after "if", and second, the use of the quotes around IBM. When you refer to string variables in equality expressions such as this, you must put the quotes around the value so Stata will not confuse it with a variable name. Had we left the quotes out, Stata would have thought we wanted to generate price for observations where the variable company equals the variable "IBM." In the second example we use "replace" instead of "gen." This reason for this is that in the first example, Stata creates values only for those observation fulfilling the "if" condition. All other observations get "missing" values. We want, then, to replace those missing values with something. The "~" in Stata means "not," and suffices as the extra "=."

Dummy variables
Sometimes we need to generate a "dummy" variable, or variables. Stata makes this very easy:
tab size, gen(dummy) :
Here, Stata will create a dummy variable for each value found in size. So, dummy1 = 1 if it is a big company, 0 otherwise; dummy2 = 1 if it is a medium company, 0 otherwise; dummy3 = 1 if it is a small company, 0 otherwise. (See a list and desc)
If you want a dummy variable to indicate only a particular size category:
gen d_med = (size==2)
Here Stata will create a dummy variable such that: d_med = 1 if it is a medium company, 0 otherwise. In general, using this command will make Stata create a dummy variable equal to 1 for each observation where the expression in brackets is true, and equal to zero otherwise.

Character Variables
Stata assumes that the variable you are creating is numeric unless you tell it otherwise. Sometimes, though, you will want to create a variable whose values are strings. Here’s how:
gen str5 name="John"
gen str10 team="Atlanta" in 1/5
As above, the values for string variables must be enclosed in quotes. You must also specify the largest number of characters, including spaces, that the variable will have in the "str#". In other words, if "John" is the longest value for the variable name, then you must specify name to have at least 4 characters. It’s fine if you specify more characters than the values will actually take, but if you specify less, you will either get an error message, or some of the characters will be cut off.
The "in" option simply tells Stata to apply the "gen" command to observations 1 to 5, inclusive, in the order they appear in the dataset. You can specify any consecutive range. The "in" option can also be used with numeric variables, as well as any other commands.
gen str3 yn="yes" if team=="Atlanta" & name=="john"
replace yn="no" if team=="New York" | team=="Boston"
Here we extend the use of the "if" option to include two expressions. As you may have guessed, the "&" means "and." The "|" means "or." One thing you must be wary of is that the values of character variables are case-sensitive, so "John" is not the same as "john."
gen day=substr(code,1,2)
The substr function can be helpful; it specifies certain parts of a character variable to be used. The first argument specifies the variable to be picked apart (code), the second argument is the starting character, and the last argument is the number of characters to be pulled out. In the first example, starting with the first character in the variable "code," two characters will be extracted.


Date Variables
Stata stores dates as the number of elapsed days since January 1, 1960. There are two ways to create elapsed dates, depending on how your original variables are stored. If you have separate variables for month, day and year then use the mdy() function to create an elapsed date variable. If your original dataset already contains a date variable, then use the date() function. Once you have created an elapsed date variable, you will probably want to format it, as described below.
• Use the mdy() function to create an elapsed date variable when your original data contains separate variables for month, day and year. The month, day and year variables must be numeric. For example, suppose you are working with these data.
month day year
7 11 1948
1 21 1952
11 2 1994
8 12 1993
Use the following Stata command to generate a new variable named mydate:
gen mydate = mdy(month,day,year)
where mydate is an elapsed date varible, mdy() is a Stata function, and month, day, and year are the names of the variables that contain data for month, day and year, respectively. The mdy() function can put three variables together to create a new date variable. The variables do not have to be called month, day, and year, but they do have to follow that order. One more caveat is that the year must be stored as a four-digit number, otherwise Stata will not know what century you want. You may need to do some more programming to add "1900" to the value.
• Use the date() function to create an elapsed date variable when your original dataset contains a date variable. For example, suppose you have a string variable, oldate, that represents the date of some event.
olddate
7-11-92
3-24-93
9-16-94
12-18-1996
Use the following Stata command to generate a new variable named newdate:
gen newdate = date(oldate,"mdy")
where newdate is an elapsed date varible; date() is a Stata function; olddate is the name of the variable that contains a date; and "mdy" is the format for reading the data (i.e., month, day and year).
Other formats are available for the date() function including "dmy" and "ymd". The format must match the order in which the values appear in the original date variable. You can use only one format to read a given date variable since the month, day and year components must appear in the same order for every occurance of the original date variable. However, you can represent the month, day and year components in a number of ways, for example:
somedate
7/11/92
Sept 24, 1993
9.16.94
December 18, 1996
The date() function can be used to read these data and create an elapsed date variable named newdate2:
gen newdate2 = date(somedate,"mdy")
• Use the format command to display elapsed dates as calendar dates. In the example given above, the elapsed date variable, mydate, has the following values, which represent the number of days before or after January 1, 1960.
month day year mydate
7 11 1948 -4191
1 21 1952 -2902
8 12 1993 12277
11 2 1994 12724
You can use the format command to display elapsed dates in a more customary way. For example:
format mydate %d
where mydate is an elapsed date variable and %d is the format which will be used to display values for that variable.
month day year mydate
7 11 1948 11jul48
1 21 1952 21jan52
8 12 1993 12aug93
11 2 1994 02nov94
Other formats are available to control the display of elapsed dates (see the manuals).

Here are some other useful functions for working with dates:
gen day = day(date)
gen weekday = dow(date)
gen month = month(date)
gen year = year(date)
All of these examples assume that the date variable is an elapsed date. The day() function returns the number of the day, ie, 23. The dow() returns a number from 0 to 7: 0 if it is a Sunday and 6 if it is a Saturday. month() returns a number from 1 to 12. year() returns the year as a four-digit number. :

Extended Generate (egen)
"egen", or "extended generate" is useful when you need a new variable that is the mean, median, etc. of another variable, for all observations or for groups of observations. Egen is also useful when you need to simply number groups of observations based on some classification variables. Here are some examples:
egen sumvar1 = sum(var1)
egen meanvar1= mean(var1), by(var3)
egen count = count(id), by(company)
egen group = group(month year)
In the first example, we simply create a variable whose value for each observation is equal to the sum of var1 for all observations (all observations will have the same value). The second example shows how to create a variable that is the mean of another variable for each group designated by var3 (all observations within a group will have the same value). The third example simply counts the number of non-missing id values in each company (this basically counts the number of observations you have for each company). The last example simply assigns a number to each of the groups created by the combination of month and year. :
There are many functions that can be used with "extended generate". To have a brief summary, type:
help egen
Variable and Value labels

Now that we’ve created all these new variables, we’ll want some way of keeping track of what each one is and what their values mean. We can do this by creating variable labels and value labels. These labels are not necessary in Stata, they just make the output easier to read. Variable labels correspond to the variable names, whereas value labels correspond to the different values a variable may have. Here’s how to create them:
label variable var1 "The first variable"
This assigns a label to the variable, var1, so whenever the variable name var1 is displayed on the screen, the description "The first variable" will be displayed as well.
In Stata for PC, variable names and variable labels will automatically appear in the "Variables" window, and you can enter a variable name in the "Stata Command" window simply by clicking on it in the "Variables" window. :
Variable labels must be 31 characters or less. Often, though, it’s more helpful to label the values a variable can take so you don’t have to memorize them or keep referring to a codebook.
label define grp 1 "Male" 2 "Female" 3 "N/A"
label values gender grp
Value labels correspond to the actual number or letters in the data. They are used so that the printout will show "Male," "Female," and "N/A" instead of "1," "2," and "3". First you have to "define" the label, then associate that label with a variable. Each label can be associated with more than one variable, so if there are several "yes/no/maybe" questions, you just have to define one value label and use it for all the questions. You may associate only one value label with a particular variable, however. :
You can check the list and content of all value labels by typing:
label list
Keep, Drop, and Rename

Well, now we’ve created many new variables and converted some old ones. Since we no longer need all of these variables, we’ll want to eliminate some of the ones we don’t really need and perhaps rename some of the ones we keep. We can either keep the variable we are interested in:
keep code date var4
or we can drop those we don't need, and rename some of those we keep:
drop var1 var2 var3
rename code date
Here we drop the variables, var1, var2 and var3. Be careful! Once they are dropped, they can’t be picked up again unless you clear the data and use it again. You can drop as many variables as you want in one command. The same is true for "keep". The second example renames the variable "code" to "date." You can rename only one variable at a time. You can also drop (or keep) observations:
drop if size==1
or
keep if size~=1
will drop all observations whose size is equal to 1. :
________________________________________


Log and Do Files
Log files

Log and do files are very useful. Logs keep a record of what commands you have issued and their results during your Stata session. Do files are good for long series of commands that may need to be "tweaked" to work properly. They are also necessary to replicate things that you have done on new or modified datasets.
You can create a log file with the command:
log using filename
Where filename is any name you wish to give the file. You will find it helpful to use names that will help you to remember what you did during that session. Stata will automatically append an extension of ".log" to the filename. By default, everything displayed on the screen will be recorded in the log file. You can give the file any name you like, but you should use names that will help you remember what analyses you did.
If you work for a long time, this file can get to be very large and have quite a bit of unnecessary output. So, if you are trying different things to "see what happens," you may want to start and stop the logging several times in one session. You can stop the logging with the command:
log close
If you turn the logging on and off during a session, you may need to use the append and/or replace options:
log using filename, append
log using filename, replace
The append option simply adds more information to the file, whereas the replace option erases anything that was already in the file. Be careful! It’s often helpful to put comments in the log file to help you remember why you did something. You can do this by simply preceding anything you type with a "*".
But, what happens if you forget to start a log file? There is a way to save what you have done by using the log file and the #review command. The #review command will list the last however many commands you have issued. This command can be used at any time and you can specify any number of commands to retrieve. After some minor editing and changing the extension to ".do", you can run it as a do file. Here’s an example:
log using iforgot - start a log file
#review 20 - list the last 20 commands
log close - close the log file
The file, iforgot.log, will have the last twenty commands you issued in it. You can then run this file as a do file with a new log started.
In Stata for PC you open a .log file by clicking on the "Log..." button. : If you want to open an existing .log file, you have to choose whether to overwrite it or to append the output to its existing content :. At this point a "Stata Log" window will pop up, and the contents of the .log file will be visualized there :. Once the "Stata Log" window is open, every command you entere "Stata Command" window will have its output saved in the log file. By clicking on the "Log..." buttom you can choose either to keep the log file open and suspend saving the output in it, or close the log file :. You can also print your log file by opening it and using the "File/Print Log" menu.

Do Files

A "do" file is a set of commands just as you would type them in one-by-one during a regular Stata session. Any command you use in Stata can be part of a do file. Do files are very useful, particularly when you have many commands to issue repeatedly, or to reproduce results with minor or no changes.
You can use any editor you wish to create your do files. If you use pine for email, then you already know how to use pico. The only thing you need to remember is that if a command is longer than one line, then you need to use the #delimiter command. In Stata, hitting the return key tells Stata to execute the command. In a .do file, the return key is at the end of every line, so you need a way of telling Stata that the command is longer than one line. Here is an example of a short do file we’ll call mydofile.do that you create in a text editor:
log using mydofile - start a log file
#delimiter ; - set the delimiter to ";"
use mydata; - open the data file
des; - describe the file
collapse (mean) var1 var2 var3 (meadian) medvar=var1, by(var4);
- collapse the data
save mydata2; - save the collapsed data
#delimiter cr - set the delimiter back to the "enter" key
clear - clear the data from memory
log close - stop logging

To run this file in Stata, you would simply issue the command:
do mydofile
You don’t even need to exit Stata to cretae or edit the do file. The command:
!pico myfile.do
will allow you to enter the text editor, pico, without exiting Stata. Once you save the file and exit from pico you are automatically retuned to Stata. You can use the ! character to tell Stata that you want this command issued to UNIX (or to DOS if you are working with Stata for PC in Windows), rather than Stata itself. Any command you would normally issue to UNIX (or to DOS)can be issued from within Stata this way. The "ls" (list files) and "cd" (change directory) commands do not have to be prefixed by the !.
Two other useful things are the Ctrl-r and Ctrl-c keys. You can use Ctrl-r to recall on the command line the commands you have issued. Sometimes the output from a command is very long and you don’t need to look at all of it; you can use Ctrl-c to stop the output.
In Stata for PC, run a .do file by using the "File/Do..." menu : . You can edit it with any normal text editor ( e.g. )
________________________________________

File Management

Once you have all of your variables in the format you want, you’ll need to get the entire file in a format that will make it easier to use. You can do this by sorting, appending, merging and collapsing.

Sort
"sort" puts the observations in a data set in a specific order. Some procedures require the file to be sorted before it can work. You can sort a file based on more than one variable. One thing you must be careful of, though, is that Stata will randomize the order of the observations within the variables used to sort. This is why creating the id variable mentioned earlier is important. With the id variable, you will always be able to go back to the original order and start over.
sort month :
sort month year
In Stata for PC you can sort data in the Stata Editor, by clicking on the variable you want to sort by, and then clicking the "Sort" button.:

Append
Sometimes, you have more than one file of data which you need to analyze. One case may be that you have two files with the same variables but different observations. The other case is when you have two files with the same observations, but different variables.
Use append when you simply want to add more observations, in other words you already have data for 1990, and now you want to include new observations from 1991 for the same variables. For example:
use class90
append using class91 :
will add the observations from the file, class91 (what Stata refers to as the "using" dataset), to the end of class90 (what Stata refers to as the "master" dataset). Any variables with different names in the two files will have missing values for the observations from the other dataset.

Merge
If you have two files that have the same observations, but different variables, then you’ll want to "merge" them so you can use all of the variables at once. When you merge datasets, you are adding new variables to existing observations rather than adding observations to existing variables. There are two basic kinds of merges, a one-to-one and a match.
A one-to-one merge simply takes the two files and puts them side-by-side, regardless of whether the observations in each dataset are in the same order. A simple one-to-one merge isn’t very common, and isn’t recommended, even if you think all the observations match. Sometimes there may be a "glitch" that can throw off the order of the data. It’s just as easy to do a match-merge, and you’ll be able to check that all the observations matched correctly.
There are some "rules" when doing a match-merge. First, each dataset must have a "key" variable by which the observations can be matched - social security number is a good example. Second, both datasets must be sorted by this key variable. You can use more than one key variable if you want, such as month and year. These key variables must be of the same type (string or numeric) in both datasets. By default, if a variable is present in both datasets, then the values in the master dataset will remain unchanged. If the variables in the using dataset have the same names as ones in the master dataset, but represent additional information, then you will need to rename them in one of the datasets before you can merge them.
use class90
sort date
merge date using class91 :
In this example we simply match observations using the variable "date" as the key.
Sometimes, you will want to update the data you have, namely, fill in missing values. Using the update option will accomplish this:
merge date using class91, update
Values in the master dataset will be changed only if they are missing.
In some instances, you may want to replace non-missing values in the master dataset. Use the replace option in addition to the update option to do this:
merge date using class91, update replace
"replace" will not work by itself, it must always be used with update. Stata will not, under any circumstances, change a non-missing value in the master dataset with a missing value from the using dataset. You must use the "replace" command discussed earlier to do this.
Stata will automatically create a variable called _merge which will indicate the results of the merge. Always check this variable to make sure that you got what you wanted. Here are what the possible values of _merge mean:
1 = Observations from the master dataset that did not match observations from the using dataset.
2 = Observations from the using dataset that did not match observations from the master dataset.
3 = Observations from both datasets that did match.
4 = Observations from both datasets that did match, missing values in the master dataset were updated.
5 = Observations from both datasets that matched, values in the master dataset disagree with those in the using dataset.
Usually, you will want all the observations to have a value of 3. Values of 4 occur only when you use the "update" option. Values of 5 occur only when you use the "update" and "replace" options. If you need to merge another dataset, you have to re-name or drop _merge first, other wise you will get a "_merge already defined" error. :


Collapse
"collapse" is used when you want to create a dataset containing the means sums, etc., of the various groups in the data. One example might be when you have one dataset of monthly data and another of yearly data and you need to analyze both sets of information together.
collapse (mean) var1 var2, by(month)
This will create a dataset of the means of var1 and var2 by month. If you have twelve months in your original dataset, then you will have twelve observations in the collapsed dataset. You must be careful, though, because Stata will compute the statistics on a variable-by-variable basis. If one variable has more missing observations than another, the means (and any other statistics you request) will be based on a different number of observations. This is not always an acceptable practice. To avoid this, you must use the "cw" options for "casewise deletion." This means that Stata will drop any observation that does not have data for both var1 AND var2, thereby ensuring that all the statistics will be based on the same number of observations.
collapse (mean) var1 var2, by(month) cw :
________________________________________


Basic Commands

Now that you have your data in a format you want, check it before doing any analyses. This can save you quite a bit of frustration later on.

Summarize
"sum", short for summarize, will give you the means, sd’s, etc. of the variables listed. If you don’t list any variables, it will give you the information for all numeric variables. If a variable you thought was numeric shows up as having 0 observations and a mean of 0, then, most likely, Stata still thinks it’s a character variable.
summ :
summ var1 var2
The "detail" option gives you additional information about the distribution of the variable.
summ var1 , detail :

Inspect
"inspect" is another easy way to eyeball the distribution of a variable.
inspect var1 :

Tabulate
"tab", short for tabulate, will produce frequency tables. By specifying two variables, you will get a crosstab. There are other options to get the row, column and cell percentages as well as chi-square and other statistics; check the manual.
tab var1
tab var1 var2
tab var1 var2, row col chisq :
"By" group processing
Sometimes you’ll want to run a command or analysis on different groups of observations. The "by variable:" subcommand is the same thing as running the command with separate "if" statements for each group. You must sort the data before you can use the "by:"
sort sex
by sex: summ var2
by sex: gen lag2=var2[_n-1] :
Sometimes, this can produce more output on the screen than you would care to look at. To suppress output to the screen, but not to the log file, use "quietly."

quietly by var1: gen lag2=var2[_n-1]

________________________________________


Fancy commands





Macro variables
Sometimes you need to use many variables the same way many times. One example would be if you want to run regressions on different dependent variables using the same set of independent variables. This can mean a lot of typing. One way around this is to create a macro variable. A macro variable is simply a variable that has as its value a particular string. This string can be anything you specify: a list of variables, a particular command, or whatever. Whenever Stata comes across the macro variable, it will interpret it to mean whatever string you set the variable to.
local macvar var1 var2 lagtot lag2
reg depvar `macvar’
In this example, when Stata "sees" the variable macvar in the regression command, it replaces `macvar’ with the string "att att2 itt itt2 date." Pay careful attention to the different type of quotation marks: the first quote in `macvar’ is the opening left quote usually found under the ~. The second quote is the closing right, or single quote usually found under the ".

For
Other times, you will want to perform the same command on several variables. Again, this can mean a lot of typing. The "for:" command can also be very useful in these situations. It can use several types of variable lists, and with a little ingenuity, can be very powerful. Here are a few examples:
for var1-var25: replace @=. if @==99
for var*: replace @=. if @=99
for 1-3, ltype(numeric): gen q@=0
for a b c, ltype(any): gen str2 @="x"
The @ in the for command represents the variable names in the list, one by one. In the first two examples, Stata would interpret the command as though you had typed replace var1=. if var1=99, replace var2=. If var2=99, and so on.
In any command, you can specify two kinds of lists:
1. var1-var25 or date-q25 indicates all variables between var1 and var25, inclusive, even if they do not begin with "var" as found when you run the describe command.
2. var* indicates all variables starting with "var."
in the third example, the command will create three new variables called q1, q2, and q3 and sets them all equal to zero. The last example creates three new variables called a, b, and c, and sets them all equal to "x."
________________________________________
Regression

Linear Regression Model
In Stata you can estimate Linear Regression Models using the "regress" command (shortcut reg). Here we suppose that you are familiar with linear regression models, and more in general with multiple regression analysis, and thus we don't provide any introduction to this subject.
Now, suppose your dependent variable is depvar and your independent variables are listesd in varlist. Then the command to run the linear regression is:
reg depvar varlist
The first variable after the regress command is always the dependent variable (or left-hand-side variable), and the following list gives the relevant independent variables (or right-hand-side variiables).
Suppose, for example, that you have a data set with individual observations on earnings, and you are interested in regressing the natural logarithm of average weekly earnings (lnw) on a sex dummy variable (sex: 0=male, 1=female), on a race dummy variable (race: 0=white, 1=non-white), on years of schooling (school), on age (age) and on a constant. Then you should type
reg lnw sex race school age :
Notice that Stata automatically adds the "constant" (_cons) to the list of independent variables. If you want to exclude it, you can use the option nocons:
reg lnw sex race school age, nocons
As for many other commands in Stata, the regress command can be also applied to a subset of your data set by specifying an if (or in) statement. Suppose that you want to run a regression only for the male population:
reg lnw race school age if sex==0 :
Several options are available with the regress command (see help regress). Among the most useful is the robust option, that allows you to run the regression using a "robust" estimator for the variance (specifically, the Huber/White/sandwich estimator). This option is mainly used when the residuals are thought to be heteroschedastic:
reg lnw sex race school age, robust
Another option that may be useful is the one that allows you to run Instrumental Variable regressions (or 2SLS). Suppose you have a dependent variable y1 and a list of independent variables x1 x2 x3; and let's suppose that you want to instrument x3 with z1, x1 and x2. Then you can simply run the (2SLS) regression by typing:
reg y1 x1 x2 x3 (z1 x1 x2)
and you will get the results of the second stage in the 2SLS procedure.



Logit, Probit
When you dependent variable is dichotomous (zero-one), you can run a Linear Probability Model (using the regress command) or you might be interested in running a Logit or a Probit regression. Here we assume that you are familiar with the Logit and Probit models, as well as with maximum likelihood estimation. As a quick reminder, the only difference between the Linear Probability Model (LPM), the Logit model and the Probit model is the assumption that you make about the probability distribution of your dependent variable. Let y be your dependent variable (dummy: zero-one), and let x and z be your independent variables. Then the general specification in all these three models is
Prob (y = 1) = F (bx + cz), Prob (y = 0) = 1 - F (bx + cz)
where b and c are the coefficeints of x and z, and F(.) is some function. The LPM assumes that F(a) = a. The Logit model assumes that F(.) is the logistic distribution function, while the Probit model assumes that F(.) to be the normal distribution function.
To run a Logit regression in Stata, you simply type:
logit depvar varlist
where depvar is your dependent (dichotomous) variable, and varlist is the list of independent variables that enter the model. To run a Probit model, the command is similar:
probit depvar varlist
Suppose that you are trying to estimate the effectiveness of a new trug being test for cancer patients. Your dependent variable is died (1 if patient died, 0 otherwise), and your independent variables are age (age of patient at start of experiment) and placebo (1 if drug is placebo). Then your logit and probit estimates will be: : .
Stata reports the maximum likelihood estimates for the original coefficients in both the probit and logit commands. You might be interested in other "coefficients". The dprobit command estimates a Probit model, but reports the change in the probability for an infinitesimal change in each independent variable instead of the coefficient for each independent variable. The logistic command estimates a Logit model but reports the odds ratios for each independent variable instead of the coefficients. After both these commands, you can obtain the original coefficients by simply typing (respectively) probit or logit :.
The logistic command also offers several "post-estimation" commands that can used to evaluate the goodness of fit of the model. These commands are entered after the estimation is performed, and are always referred to the "last" logistic estimation perfomed. See help logistic for a list.
Finally, all the above commands (as in regress) can be applied only to a subset of the observations, can be run without constant, and can generate estimates using robust standard errors.


Prediction
After all estimation commands (i.e. reg, logistic, logit, dprobit, probit) several "predicted" values can be computed. The most important are the "predicted values for the dependent variable" and the "predicted residuals". In order to obtain the predicted values for the dependent variable for any model you have to generate a new variable (call it yhat) that associate to each observation its predicted value based on the coefficients obtained from the estimation and based on the particular values of the independent variables for that observation. Suppose that you run a linear regression and you want to estimate the predicted (ln) average weekly wage for each individual: then you type
reg lnw sex race school age
predict yhat
then a new variable called yhat will be generated by Stata (if such a variable already exists you will get an error message, and you will have to choose another name for the predicted values) : . In the same way you can build a new variable (call it res) with the predicted "residual" for each observation:
reg lnw sex race school age
predict res, residual
Notice that after the logistic command, several other "predictions" are possible using the lpredict command (see help lpredict).


Testing Linear Hypothesis
Here we assume that you are familiar with the theory of hypothesis testing. After each estimation, for each independent variable Stata automatically provides a t-test (for linear regressions) and a z-test (for Logit or Probit models) on the null hypothesis that the "true" coefficient is equal to zero. However you can perform several other tests on linear hypothesis about the coefficients. Suppose you run a linear regression of (log) average weekly earnings on sex, race, schooling and age. You could test the hypothesis that the coefficient on race is equal to -.05 (i.e. earnings for non-whites are 5% lower than for whites):
reg lnw sex race school age
test race=-.05 :
You could also be interested in testing hypotheses on several variables. For example you could be interested in testing the null hypothesis that the coefficients on sex and race are jointly equal to zero, or that they sum up to one:
reg lnw sex race school age
test sex race - tests the null that they are jointly equal to zero
test sex + race == 1 :
If you are interested in non-linear hypothesis testing, you can look at help testnl .

No comments:

Post a Comment