by Robert A. Muenchen
This article shows you how to enter data so that you can easily open in statistics packages such as R, SAS, SPSS, or jamovi (code or GUI steps below). Excel has some statistical analysis capabilities, but they often provide incorrect answers. For a comprehensive list of these limitations, see http://www.forecastingprinciples.com/paperpdf/McCullough.pdf and http://www.burns-stat.com/documents/tutorials/spreadsheet-addiction.
Simple Data Sets
Most data sets are easy to enter using the following rules.
- All your data should be in a single spreadsheet of a single file (for an exception to this rule, see Relational Data Sets below.)
- Enter variable names in the first row of the spreadsheet.
- Consider the length of your variable names. If you know for sure what software you will use, follow its rules for how many characters names can contain. When in doubt, use variable names that are no longer than 8 characters, beginning with a letter. Those short names can be used by any software.
- Variable names should not contain spaces, but may use the underscore character.
- No other text rows such as titles should be in the spreadsheet.
- No blank rows should appear in the data.
- Always include an ID variable on your original data collection form and in the spreadsheet to help you find the case again if you need to correct errors. You may need to sort the data later, after which the row number in Excel would then apply to a different subject or sampling unit, making it hard to find.
- Position the ID variable in the left-most column for easy reference.
- If you have multiple groups, put them in the same spreadsheet along with a variable that indicates group membership (see Gender example below).
- Many statistics packages don’t work well with alphabetic characters representing categorical values. For example to enter political party, you might enter 1 instead of Democrat, 2 instead of Republican and 3 instead of Other.
- Avoid the use of special characters in numeric columns. Currency signs ($, €, etc.) can cause trouble in some programs.
- If your group has only two levels, coding them 0 and 1 makes some analyses (e.g. linear regression) much easier to do. If the data are logical, use 0 for false, and 1 for true.
If the data represent gender, it’s common to use 0 for female, 1 for male.
- For missing values, leave the cell blank. Although SPSS and SAS use a period to represent a missing value, if you actually type a period in Excel, some software (like R) will read the column as character data so you will not be able to, for example, calculate the mean of a column without taking action to address the situation.
- You can enter dates with slashes (8/31/2018) and times with colons (12:15 AM). Note that dates are recorded differently across countries, so make sure you are using a format that matches your locale.
- For text analysis, you can enter up to 32K of text, or about 8 pages, in a single cell. However, if you cut & paste if from elsewhere, remove carriage returns first as they will cause it to jump to a new cell.
Relational Data Sets
Some data sets contain observations that are related in some way. They may be people who all live in the same home, or samples that all came from the same site. There may be higher levels of relations, such as students within classrooms, then classrooms within schools. Data that contains such relations (a.k.a. nesting) may be stored in a “relational” database, but those are harder to learn than spreadsheet software. Relational data can easily be entered as two or more spreadsheets and combined later during data analysis. This saves quite a lot of data entry as the higher level data (e.g. family house value, socio-economic status, etc.) only needs to be entered once, instead of on several lines (e.g. for each family member).
If you have such data, make sure that each data set contains a “key” variable that acts as a common ID number for family, site, school, etc. You can later read two files at a time and combine them matching on that key variable. R calls this combination a join or merge; SAS calls it a merge; and SPSS calls it Add Variables.
Example of a Good Data Structure
This data set follows all the rules for simple data sets above. Any statistics software can read it easily.
Excel Tips for Data Entry
- You can make sure your variable names are always visible at the top of your Excel spreadsheet by choosing View> Freeze Panes> Freeze Top Row. This helps you enter data in the proper columns.
- Avoid using Excel to sort your data. It’s too easy to sort one column independent of the others, which destroys your data! Statistics packages can sort data and they understand the importance of keeping all the values in each row locked together.
- If you need to enter a pattern of consecutive values such as an ID number with values such as 1,2,3 or 1001,1002,1003, enter the first two, select those cells, then drag the tiny square in the lower right corner as far downward as you wish. Excel will see the pattern of the first two entries and extend it as far as you drag your selection. This works for days of the week and dates too. You can create your own lists in Options>Lists, if you use a certain pattern often.
- To help prevent typos, you can set minimum and maximum values, or create a list of valid values. Select a column or set of similar columns, then go to the Data tab, then the Data Tools group, and choose Validation. To set minimum and maximum values, choose Allow: Whole Number or Decimals and then fill in the values in the Minimum and Maximum boxes. To create a list of valid values, choose Allow: List and then fill in the numeric or character values separated by commas in the Source box. Note that these rules only operate as you enter data, they will not help you find improper values that you have already entered.
- The gold standard for data accuracy is the dual entry method. With this method you actually enter all the data twice. Only this method can catch errors that are within the normal range of values, but still wrong. Excel can show you where the values differ. Enter the data first in Sheet1. Then enter it again using the exact same layout in Sheet2. Finally, in Sheet1 select all cells using CTRL-A. Then choose Conditional Formatting> New Rule. Choose “Use a formula to determine which cells to format,” enter this formula:
then click the Format button, make sure the Fill tab is selected, and choose a color. Then click OK twice. The inconsistencies between the two sheets will then be highlighted in Sheet1. You then check to see which entry was wrong and fix it. When you read the data into a statistics package, you will only need to read the data in Sheet1.
- When looking for data errors, it can be very helpful to display only a subset of values. To do this, select all the columns you wish to scan for errors, then click the Filter icon on the Data tab. A downward-pointing triangle will appear at the top of each column selected. Clicking it displays a list of the values contained in that column. If you have entered values that are supposed to be, for example, between 1 and 5 and you see 6 on this list, choosing it will show you only those rows in which you made that error. Then you can fix them. You can also use click on Number Filters to use simple logic to find, for example, all rows with values greater than 5. When you are finished, click on the filter icon again to turn it off.
Save your data frequently and make backup copies often. Don’t leave all your backup copies connected to a computer which would leave them vulnerable to attack by viruses. Don’t store them all in the same building or you risk losing all your hard work in a fire or theft. Get a free account at http://drive.google.com, http://dropbox.com, or http://onedrive.live.com and save copies there.
Steps for Reading Excel Data Into R
There are several ways to read an Excel file into R. Perhaps the easiest method uses the following commands. They read an excel file named mydata.xlsx into an R data frame called mydata. For examples on how to read many other file formats into R, see:
# Do this once to install: install.packages("readxl") # Each time you read a file, follow these steps library("readxl") mydata <- read_excel("mydata.xlsx") mydata
Steps for Reading Excel Data Into SPSS
- In SPSS, choose File> Open> Data.
- Change the “Files of file type” box to “Excel (*.xlsx)”
- When the Read Excel File box appears, select the Worksheet name and check the box for Read variable names from the first row of data, then click OK.
- When the data appears in the SPSS data editor spreadsheet, Choose File: Save as and leave the Save as type box to SPSS (*.sav).
- Enter the name of the file without the .sav extension and then click Save to save the file in SPSS format.
- Next time open the .sav version, you won’t need to convert the file again.
- If you create variable or value labels in the SPSS file and then need to read your data from Excel again you can copy them into the new file. First, make sure you use the same variable names. Next, after opening the file in SPSS, use Copy Data Properties from the Data menu. Simply name the SPSS file that has properties (such as labels) that you want to copy, check off the things you want to copy and click OK.
Steps for Reading Excel Data Into SAS
The code below will read an excel file called mydata.xlsx and store it as a permanent SAS dataset called sasuser.mydata. If your organization is considering migrating from SAS to R, I offer some tips here: http://r4stats.com/articles/migrate-to-r/.
proc import datafile="mydata.xlsx" dbms=xlsx out=sasuser.mydata replace; getnames=yes; run;
Steps for Reading Excel Data into jamovi
At the moment, jamovi can open CSV, JASP, SAS, SPSS, and Stata files, but not Excel. So you must open the data in Excel and Save As a comma separated value (CSV) file. The ability to read Excel files should be added to a release in the near future. For more information about the free and open source jamovi software, see my review here: