Introduction
The SAS system for Unix and the Windows is available at NAU. These versions
of SAS are similar, although they do have some important differences.
These differences will be outlined throughout this and the machine specific
documents. This document is intended to be a brief introduction to
using SAS on the different systems on campus. In order to use the
full potential of SAS, you should acquire and read the SAS Reference Manuals,
which are available on the main NAU Stats Web Page. Manuals may also
be ordered through the NAU Bookstore.
SAS runs in two different modes, batch (non-interactive) and interactive
(under Windows or X on Unix). Batch mode allows you to issue SAS
commands from a file. This file may contain your data and should
contain all of the statistical commands that you want to run on your data.
To start a SAS batch job type:
where command-file is a file consisting of the SAS commands that you want
to execute.
This document only covers execution in batch mode. You will need to
know how to use the Unix operating system to use SAS in batch mode.
The Data Step
The data step is used to describe your data to the SAS system. It
is also used to create new variables from your raw data. The following
is a sample data step:
DATA A1 (LABEL='1984 General Social Survey');
INFILE gss;
input id race age sex marital satjob hapmar life postlife educ
degree paeduc
maeduc speduc income82 health hompop agewed sibs;
The header line for the data step begins with the key word DATA. The next
item after data is the name of the dataset that you will be creating.
This is a temporary data set and will be deleted after the run completes.
Options are supplied in parentheses at the end of the DATA statement.
In this example, a label for the run is included.
In the data step, the data file is specified by the infile statement.
in the example given above gss should be replaced by the actual file name
that contains the data. An example of this on Unix is as follows:
DATA A1 (LABEL='1984 General Social Survey');
INFILE "~wew/Files/gss84.dat";
Note that each statement (command, not line) in SAS must be terminated
with a semi-colon (;). Please refer to the machine specific documentation
for a description of how to use the infile statement.
The input statement is used to specify the variables on the input record.
This statement is used to control how the data is to be read in by SAS.
In the example above no special controls are given. The data points
in this example are separated by spaces and so will be read in, one variable
at a time. If your data does not have spaces between the data points
(variables) then special formatting options would need to be included.
The standard format for input is as follows (anything in brackets is optional):
INPUT variable [$] [startcolumn[-endcolumn]] [.decimals]
where variable indicates the name of the variable to be input, $ indicates
that the variable is character instead of numeric (the default), start
column is the column where the variable begins, endcolumn is the column
where the variable ends and .decimals is the default number of decimal
places for the variable. Data values may be either numeric or character.
Examples of character data include a person's name, a state or country
name. In our example since the data is blank separated, we do not
need any special format controls.
It is also possible to read in multiple records per case. The forward
slash "/" is used to tell SAS to advance to the next line. For example
if the raw data has two records per case with ID, SEX and AGE on record
1 and questions one through five on record 2, we could use the following
command to read the data correctly:
input id sex age / q1 q2 q3 q4 q5
Missing Data
Missing data in SAS is normally represented by either a space or a period
in your raw data set. If you are using list input (no column or special
input control) you will need to use a period to indicate a missing value.
If your missing values are a special coded number (i.e. a 9) you can use
the if statement to indicate to SAS that it represents a missing value:
You may also use the MISSING statement to indicate special missing values.
This will allow you to have special character codes to represent specific
types of missing values. For example, if you are performing a survey
and you want to track how many people refused to answer a question versus
how many were not home you could use the following:
data survey;
missing r x;
infile surv;
input id questn1;
R and X are now special missing values in the raw data set.
Creating and recoding variables
Creating new variables under SAS is a very simple procedure. SAS
allows standard algebraic equations to be used in the Data Step. For example,
if you have the radius of an object in your input and want to calculate
the circumference as a new variable, you could use the following:
All of the standard arithmetic operators are included and also many comparison
operators (please refer to your SAS manual for a full explanation).
Comparison statements may also be used to create new variables. If
your data set has the day of the week as a character string and you wish
to create a numeric code for each day you could try the following:
An IF statement would be necessary for each day of the week. You
may also use the IF statement to collapse or recode your data. Here
is an example for producing an ordinal range from an interval scale:
if 0<age<10 then age=1;
if 10<age<25 then age=2;
if age>=25 then age=3;
Writing Raw Data Files
Creating a new raw data file from input data will be slightly different
on each of the SAS systems. In general you will need to use two statements
to create the new data file. The first statement is the FILE statement.
This statement is used to indicate the file reference (fileref) for the
output file. The second statement is the put statement. The
put statement "puts" an output record to the file referenced in the FILE
statement. If you want to create a subset of your whole dataset,
the if statement used in conjunction with a put may be used. The following
is an example:
file gssfem;
if sex=2 then put id race age educ income82;
This will create a new file containing the variables id, race, age, educ
and income82 for all cases where sex equalled a 2.
Creating SAS Data Sets
A SAS data set is a special data file that SAS creates in its own special
format. The file contains all of your data and information(variable
names, missing values, labels etc.) about your data. A SAS data set
is created whenever you issue a DATA step. The fileref given after
the DATA statement is the name of the data file that you will create.
This file by default is temporary and will be deleted when the program
completes. To create a permanent file you must use a two level file
name. A two level file name is where you have two names separated
by a period (i.e. food.prices). For machine specific information
on creating permanent SAS data sets please refer to the your SAS manuals.
The following commands will create a new SAS data set:
data gss.d84;
infile gss84;
input id age sex income84;
Once a permanent SAS data set has been saved, the SET command is used to
access the SAS data set. For example, if you have a SAS data set
named GSS.D84, to access that data you would use the following:
data a1;
set gss.d84;
run;
Producing Frequency Tables
The Proc Freq procedure is used to produce one-way to n-way frequency
and crosstabulation tables. PROC FREQ will produce percentages and
frequency counts, chi-squares, Fisher's Exact, PHI and Cramer's statistics.
The procedure has the following format:
PROC FREQ;
TABLES varlist;
[BY varlist];
where varlist is the list of variables to use. The BY statement allows
you to group your output by the variables listed. One-way tables
are produced by listing the variable names separated by spaces.
Figure 3 shows sample output for a one-way table.
To
obtain a multiway table an asterisk (*) is used to separate the variable
names. This produces a crosstabulation. Figure 4 shows output from
the following command:
Producing Correlations
Correlations may be produced using PROC CORR. PROC CORR also is used
to produce certain univariate statistics like standard deviation, means
and sums. The format for the CORR procedure is:
where
varlist is the list of variables to include in the correlation. You
may also use the BY statement to group your output. FIgure 5 shows
ouput from the following command:
proc corr;
var sex income82;
Descriptive and Univariate Statistics
The following is a list of the statistics produced by PROC UNIVARIATE:
mean, sum, standard deviation, variance, skewness,kurtosis, sum of the
weight, maximum, minimum, range, quartiles,percentiles, median, mode, signed
ranks, and tests for normality. The format for this procedure is:
PROC UNIVARIATE;
VAR varlist;
where varlist is the list of variables to test. You may also group
the output with the BY statement. Figure 6 shows output for the following
command:
proc univariate;
var income82;
by sex;
Labeling Output
The following SAS command file shows an example of labeling your output.
proc format;
value abnum 1 = 'yes' 2 = 'no';
value $abchar 'h' = 'hi' 'g' = 'goodbye';
run;
data a1;
input a b $;
cards;
1 h
2 g
3 h
4 i
;
run;
proc freq;
tables a*b;
title 'test of numeric and character value transformations';
format a abnum.;
format b $abchar.;
run;
The following shows sample output from the above commands:
| test
of numeric and character value transformations |
Frequency
Percent
Row
Pct
Col
Pct |
|
| Table
of a by b |
| a |
b |
Total |
| goodbye |
hi |
i |
| yes |
0
0.00
0.00
0.00 |
1
25.00
100.00
50.00 |
0
0.00
0.00
0.00 |
1
25.00
|
| no |
1
25.00
100.00
100.00 |
0
0.00
0.00
0.00 |
0
0.00
0.00
0.00 |
1
25.00
|
| 3 |
0
0.00
0.00
0.00 |
1
25.00
100.00
50.00 |
0
0.00
0.00
0.00 |
1
25.00
|
| 4 |
0
0.00
0.00
0.00 |
0
0.00
0.00
0.00 |
1
25.00
100.00
100.00 |
1
25.00
|
| Total |
1
25.00 |
2
50.00 |
1
25.00 |
4
100.00 |
|
Conclusion
Each machine that SAS runs on has specific commands and file access
methods that are unique to that machine. Documentation is available for
the systems on campus. If you have any questions, please call Academic
and Computing Services at x1511.