Most of the example programs on this web site and in the accompanying books use a small artificial data set named “mydata.” It’s small enough to enter easily but more importantly it allows you to see the full effect of many different data management tasks. The data set is a pretend survey of students who attended some workshops to learn statistical software. It records which workshop they took, their gender, and their responses to four questions:

q1 — The instructor was well prepared.

q2 — The instructor communicated well.

q3 — The course materials were helpful.

q4 — Overall, I found this workshop useful.

The values for the workshops are 1, 2, 3, and 4 for R, SAS, SPSS, and Stata respectively. In the smallest form of these data, only the R and SAS workshops appear in the smaller form of the data set.

Here is mydata:

workshop gender q1 q2 q3 q4

1 1 f 1 1 5 1

2 2 f 2 1 4 1

3 1 f 2 2 4 3

4 2 NA 3 1 NA 3

5 1 m 4 5 2 4

6 2 m 5 4 5 5

7 1 m 5 3 4 4

8 2 m 4 5 5 5

The letters “NA” stand for Not Available, or missing.

The examples that use graphics or statistics use a longer version of this file named mydata100. It has 100 observations, has values for all four software workshops, includes value labels for workshop and gender, and adds two additional variables, pretest and posttest:

workshop gender q1 q2 q3 q4 pretest posttest

1 R Female 4 3 4 5 72 80

2 SPSS Male 3 4 3 4 70 75

3 NA NA 3 2 NA 3 74 78

4 SPSS Female 5 4 5 3 80 82

5 Stata Female 4 4 3 4 75 81

6 SPSS Female 5 4 3 5 72 77

Hi, in the above set for mydata, are the first two numeric columns covered by the “workshop” header (or column label)? Also, what does the “5” value in the question (q) 1, 2, 3, or 4 columns represent? Should the values only be 1,2, 3, or 4? I am interested in this example and hope to learn more!