Pre Data Analysis Activities

Once you obtain survey results, the fun begins. Note however that analyzing and reporting survey data can be challenging in that it can be a time-consuming process. If results are not processed, analyzed, and reported properly, the results may be misleading and possibly inaccurate. There are several issues to consider and a few things to do before you get into the actual data analysis, interpretation, and reporting.

Before data analysis can take place, the information you obtained must be cleaned and organized. Cleaning data refers to the process of removing irrelevant data (as in the case where online surveys add variables to facilitate the survey's function), possibly de-identifying the responses (as required by IRB protocols), or coding open responses (see allowing "other" responses). Cleaning data is needed prior to examining response patterns and identifying incomplete surveys. The data may also require reorganizing (e.g., collapsing categories or performing summary calculations). For example, you may wish to report the percentage of those who "agree" or "strongly agree" with a statement in comparison to those who "disagree" or "strongly disagree." The precise cleaning needs will be determined by the purposes and reporting needs of the survey.

Once the data is cleaned, a review of the data is needed to ensure no systematic survey nonresponse patterns have occurred (see Response Rate Issues). Even when random sampling is used, you may have a problem if response refusal patterns indicate that an important group of potential respondents failed to complete the survey. Key indicators may include age, gender, or race patterns that do not match the expected demographics. However, other response refusal problems may cause a lack of generalizability—for example, location overrepresentation (e.g., urban vs. rural) or group underrepresentation (e.g., freshman vs. seniors). The degree to which this will be a problem depends on the degree to which the sample provides a reasonable representation of the population. Without pertinent information about respondents, you will not be able to conduct this investigation. Therefore, it is important to identify key factors and plan to obtain these details prior to administering the survey. If an issue is identified, you may need to get additional responses from underrepresented individuals to solve the problem.

One of the first things you should calculate once a survey has been administered and the results obtained is the response rate. The calculation is simple. The response rate is an indication of the number of invited participants who complete the survey. This is reported as a percentage but should always be accompanied by the size of the sample. A 66% response rate obtained from a sample comprised of three individuals wouldn’t provide very compelling evidence. However, a response rate of this magnitude obtained from a much larger sample would be much more impressive. You should also indicate how the sample was obtained when reporting the sample size (the population size as well if it is known). The sampling procedures will have been detailed in the methods section, but you should briefly report them (as a reminder) when presenting results (e.g., a random sample of three individuals).

$$Response\>Rate = \frac{Number\> of \>Usable\>Surveys\>Returned}{Number\>of\>Invites}$$

The only consideration (possible controversy) surrounding the response rate calculation is deciding what it means to complete the survey and whether information obtained from partially completed surveys might be usable. In order to make this determination, a careful examination of the survey results is needed. While there are no hard and fast rules, there are some principles that may help you make this determination. In situations where a returned survey is unusable, it should not be included in the response rate calculation. This will also affect the margin of error calculation.

**Requisite Data. **A guiding principle for making an inclusion decision is to determine what information is essential in order to answer the research question. Sometimes a partially completed survey can be used to answer some of the research questions. Other times, missing a single item on the survey will render the information unusable. For example, if one section of the survey was completed and not another, part of the information provided might be usable. However, the information provided in a partially completed survey would likely need to be excluded if, for example, the respondent only completed the demographics section but nothing else, or they failed to provide vital grouping information required to disaggregate the data and answer the research question.

**Accurate Data.** Another factor that should be considered regarding the completeness of the survey is that of accuracy. Unfortunately, there are times when participants are not completely honest in the way they answer questions on a survey. You may not know how accurate the information provided will be, but you can get indicators that the information is inaccurate. For example, suppose that while pilot testing the survey you determine that a survey typically takes 10 to 15 minutes to read, reflect, and answer all the questions. Then suppose a participant completed the entire survey in only two minutes. You might suspect the results to be inaccurate and quite possibly unusable.

Inaccurate (unusable) data might also be identified by examining a participant’s response pattern. For example, the results might be suspect if a respondent provided the same response for every question, even though it would be extremely unlikely that an honest respondent would actually answer in that way. Random response bias like this is more likely when incentives are provided to individuals for taking the survey. If trigger items were added to identify suspect response patterns, these would need to be examined (see random response bias).

Knowing the response rate is an important piece of information. Opinions about what constitutes an acceptable response rate are debatable. Ultimately, you need to feel confident that the information you obtained will adequately serve the purposes of the survey and answer the research questions. For some research purposes, only near-perfect response rates with no perceivable systematic response refusal patterns would be acceptable. In other situations, a much smaller response rate might suffice. Response rates for published social science research can range from 25%–75%, and a 30% response rate is typical (Baruch & Holtom, 2008). It is up to the researcher to persuasively argue that the response rate obtained is sufficient.

For some surveys, you may need to calculate the margin of error. You can calculate a margin of error for each of the values obtained from the survey. The margin of error is an estimate of the amount of error we might expect for each outcome. In practice, the margin of error is a confidence interval. Any statistic we obtain from a survey is an estimate that includes some amount of error. We don’t actually know what the real (true) value is, but we can be somewhat confident that the true value will fall within a specific range based on an estimate of the standard error (SE) and a specified confidence level (z).

In order to calculate the margin of error for a result represented by a proportion, we need three values: the confidence level, the sample size (i.e., number of completed surveys with usable data), and the sample proportion. If the population size is known (i.e., you have a finite population), the formula can be adjusted to account for any error that might occur from using a sample instead of taking a census. The modified formula (using the Finite Population Correction, or FPC) assumes you know the population size.

$$ Standard\>Error =\sqrt\frac{p(1-p)}{n}$$

$$ Margin\>of\>Error = Z\sqrt\frac{p(1-p)}{{n}}$$

$$ Margin\>of\>Error_{(fpc)} = Z\sqrt\frac{p(1-p)}{n}\sqrt\frac{N-n}{N-1}$$

- p = the sample proportion
- n = sample size (number of usable surveys)
- N = population size
- Z = z-value representing the desired confidence level

2.576 for 99% level of confidence

1.96 for 95% level of confidence

1.645 for 90% level of confidence

Continuing with the counseling services example from the beginning of this book, let's say you obtain a result where 84% of respondents selected a specific option on one item (p=.84). This means 16% selected a different option (1-p or .16). Suppose we decided to use a 95% confidence level, which would make Z=1.96. Given a sample size of n=180, that would make the margin of error equal to 2.73. Given this margin of error, we can say the result is assumed accurate within plus or minus 2.73 percentage points with a 95% confidence level. However, adjusting for the fact that this is a finite population (N=5000), an adjusted estimate suggests the margin of error might actually be 2.68. These estimates are quite close and both round to 2.7 percentage points. This means the statistic obtained might reasonably be anywhere between 82.3 and 86.7 percent (84 ± 2.7%).

$$ Margin\>of\>Error = 1.96\sqrt\frac{.84(1-.84)}{180} =\>.0273\>or\> ±2.73$$

$$ Margin\>of\>Error_{(fpc)} = .0273\sqrt\frac{5000-180}{5000-1} =\>.0268\>or\> ±2.68$$

In order to calculate the margin of error for a result represented by a mean, we need three values: the confidence level, the sample size (i.e., number of completed surveys with usable data), and the standard deviation of the sample mean. If the population size is known (i.e., you have a finite population). The formula can be modified using the FPC.

$$ Standard\>Error =\frac{σ}{\sqrt{n}}$$

$$ Margin\>of\>Error= Z\frac{σ}{\sqrt{n}}$$

$$ Margin\>of\>Error_{(fpc)} = Z\frac{σ}{\sqrt{n}}\sqrt\frac{N-n}{N-1}$$

- σ = standard deviation of the sample mean
- n = sample size (number of usable surveys)
- N = population size
- Z = z-value representing the desired confidence level

When the survey result is a mean rather than a proportion, the standard error calculation uses the standard deviation of the sample mean. Suppose you asked people how often (in days or weeks) they experience feelings of depression, and you determine that the average response was two days with a standard deviation of 2.8 (σ = 2.8). Using a 95% confidence level and given a sample size of n=180, the margin of error would be 0.409. Given this margin of error, we can say the result is assumed accurate within plus or minus 0.409 days at a 95% confidence level. However, adjusting for the fact that this is a finite population (N=5,000), an adjusted estimate suggests the margin of error might actually be 0.402. These estimates again are quite close, around 0.4 days. This means the statistic obtained might reasonably be anywhere between 1.6 and 2.4 days each week (2 ± 0.4).

$$ Margin\>of\>Error= 1.96\frac{2.8}{\sqrt{180}} = .409$$

$$ Margin\>of\>Error_{(fpc)} = .409\sqrt\frac{5000-180}{N-1} = .402$$

- Data cleaning is required prior to starting the data analysis process.
- If results are not processed, analyzed, and reported properly, the results may be misleading and possibly inaccurate.
- Data should be reviewed to identify unusable surveys and explore the possibility of any systematic response refusal pattern.
- Response rates, along with sample size and sampling methods, should be reported.
- An adequate response rate is needed to obtain a representative sample.
- Each statistic obtained from a survey is only an estimate of the true population parameter.
- A margin of error calculation can be used to provide a confidence interval for each sample statistic.

- What impact would you expect if you found a systematic response refusal pattern had occurred? What step might you consider taking to alleviate the problem?
- How does the response rate affect the sampling process?

Baruch, Y., & Holtom, B. C. (2008). Survey response rate levels and trends in organizational research. *Human relations*, *61*(8), 1139–1160.

This content is provided to you freely by BYU Open Learning Network.

Access it online or download it at https://open.byu.edu/designing_surveys/data_analysis.