In this exercise we will read a data file into SPSS that is saved as an excel file (.
xls) and then produce some descriptive statistics and appropriate graphs.
SPSS Practical 1 – Data Entry & Descriptive Statistics In this exercise we will read a data file into SPSS that is saved as an excel file (.xls) and then produce some descriptive statistics and appropriate graphs. DATA ENTRY EXERCISE development.xls is an excel data file that contains data from a cross sectional study of child development performed in 1995. 200 five year olds were recruited to the study and information was collected on their characteristics at study entry. These include physical characteristics like height, developmental characteristics like Kaufman assessment battery for children, and socioeconomic variables like mother’s qualifications. A full list of variables in the dataset is given below. Note the numerical codes for the categorical variables.
Physical
Developmental
Socioeconomic
Variable Height Weight Head circumference Sex Kaufman assessment battery for children Number of pictures recognised Mother’s qualifications Number of rooms in home Parents’ marital status Parents’ joint income
cm kg cm 0 if male, 1 if female A global measure of IQ. Must be >0. Normal range is 105 ± 15 Out of 30 pictures shown GCSE / Higher than GCSE 0 = Married, 1 = cohabiting, 2 = separated, 3 = single 0 = open -> data. Change files of type to ‘Excel (.xls)’ and navigate to the location where you saved development.xls to open the data file for use. A prompt window will appear asking you if you wish variable names to be read from the top row, make sure this box is ticked, having just noted when in excel that the first row of data was the variable names. TWO SEPARATE DATA VIEWS After you have opened development.xls in SPSS, you will see the actual data. This is in ‘data view’. The data structure is maintained; one row per child, and one column per variable. Click on the tab at the bottom left corner of SPSS and switch to ‘variable view’. This view gives information about the dataset, such as the variable type, number of decimal points displayed and value labels. Note that SPSS has automatically set all variables to be scale (numerical) variables. We must now correctly define our variables as either nominal
1
(categorical), ordinal (ordered categorical) or scale (numerical) and assign labels to our nominal (categorical) variables in variable view before we do anything else. After you have read a data file in from excel this must be done. SPSS need to know the type of data included in the file in order to treat it appropriately. First we define Sex as a nominal variable. Look at the row corresponding to the sex variable and go to the ‘Measure’ column. Click on the word ‘Scale’. Select Nominal from the drop down list that appears. Appropriately define all the other nominal variables in the data set. To assign labels to the sex variable values look at the row corresponding to the sex variable, and go the values column and click on the word ‘none’. Click on the button. Type in the value 0 and the corresponding label ‘male’ Click ‘add’. Similarly enter value 1 and define this as female. Click ‘add’, then ‘ok’. Add labels for the other categorical variables. Make sure all the numerical variables are defined as ‘Scale’ variables. SUMMARISING CATEGORICAL DATA NUMERICALLY We wish to obtain some description of the dataset. First of all, let’s look at the categorical variables: sex, mother’s qualifications, number of rooms in home, parents’ marital status and parents’ joint income. Click ‘Analyze → Descriptive statistics → Frequencies’. In the box that appears you will see a list of the variables. Select the categorical variables and move them over to the box on the right using the arrow button. To summarise frequencies and percentages click ok. Look at the produced output in the output viewer. Suppose we want to know how many males have parents whose joint income is less that £20,000. Click ‘Analyze → Descriptive statistics → Cross tabs.’ Move ‘Sex’ over into the Row box, and move ‘ parent_income’ into the Column box. Click ok, SPSS will cross classify your data by the variables selected. SUMMARISING CONTINUOUS DATA NUMERICALLY To summarise continuous data, we want a measure of location, the mean or median, and a corresponding measure of dispersion, the standard deviation or quartiles respectively. To display the mean and SD, click ‘Analyze → Descriptive statistics → Descriptives’. As before, you will see a variable list. Select the numerical variables you wish to describe, move them to the box on the right and click ok. To display the median (50th percentile) and quartiles (25th and 75th percentiles), click ‘Analyze → Descriptive statistics → Frequencies’. Choose the variable you are summarising. Click on the ‘Statistics’ button. Tick the ‘Percentiles’ box and write the number 25 into the next door box, click ‘Add’. Now write the number 75 into the box and then click ‘Add.’ SPSS will display the 25th and 75th percentile for us which gives us the IQR. Also tick the box next to ‘Median’ so we also get this. Click ‘Continue’ then ‘ok’ to get the output.
2
MEAN AND STANDARD DEVIAITON or MEDIAN AND INTERQUARTILE RANGE? The mean and standard deviation is a good summary of a continuous variable if it follows an approximately normal distribution:
You can verify whether this holds by producing a histogram of your data. Click ‘Graphs → Legacy Dialogs → Histogram’. Drag the name of the variable you wish to summarise in the histogram to the ‘variable box’. Click ‘ok’. This should produce a histogram. Look at the shape for your chosen variable and decide whether you think it follows a normal distribution. If the data follows a normal distribution then the mean and SD are the best statistics to summarise this variable. If the variable does not follow a normal distribution then the median and IQR should be preferred. SUMMARISING DATA GRAPHICALLY Bar charts for categorical variables are not a good summary because they contain no more information – and sometimes contain less – than a table (whereas a histogram is more informative than summary numbers). However, if you wish to produce a bar chart then go to ‘Graphs → Legacy Dialogs → Bar’. Select the categorical variable you wish to display and place this in the ‘Category Axis’ box. For the final graphical summary we will compare two continuous variables using a scatter plot. Find the scatterplot dialog box from the graph menu. To create a scatterplot, drag the two variables you want to compare (for instance weight and height) into the horizontal (X) and vertical (Y) axes. Click ‘ok’. QUESTIONS: 1. 2. 3. 4. 5. 6. 7. 8.
What is the number and % of females in this group? How many children have parents who are separated? What is the most common number of rooms in a house? What is the mean (SD) for each of height and weight? What is the median (IQR) head circumference? Is the Kaufman score approximately normally distributed? What is the mean (SD) Kaufman score? From the scatterplot, what is the relationship between height and head circumference?
3