I chose to simulate data based on my ongoing research project.
If the biological hypothesis were found to be true. I am investigating whether a handful of insulin binding proteins have a role in C. elegans associative learning ability. The data would take the form of four groups (two strains x two treatments), both following a normal distribution of indices betweein -1 and 1. The strains are n2 (wild type) and zig-4. The conditions are 25mM NaCl (low salt) and 100mM NaCl (high salt).
I used a for loop to create my data set that way the four sets of numbers would be in order for displaying data. I combined it into a data frame outside the loop because I needed some variables to remain in their current forms for example “exp” which I needed to scale according to sample size.
I did not do a posthoc test yet only when playing with parameters of the data. Simple anova to make sure the data datas!
exp <- c("n2high","zig4high","n2low","zig4low") #experimental labels, just as a group of four to be used later
for (j in 1:5) {
for (i in 1:4){#four total sets of data, each unique
if(i %% 2==1){#two must be closer to zero and two closer to one to simulate significant results
varname <- paste0("x_",i) #creating a unique name for each vector
assign(varname,rnorm(50,mean=runif(1,min=0.4,max=1),sd=runif(1,min=0.05,max=0.5)))#the data has a normal distribution and each new vector has a random mean in a desired range and a random standard deviation within a reasonable range
} else {
varname <- paste0("x_",i)
assign(varname,rnorm(50,mean=runif(1,min=0,max=0.6),sd=runif(1,min=0.05,max=0.5)))
}
if (i==4){ #two of the vectors (low salt treatment) have to be negative so I waited till the last loop and multiplied the vectors by -1
x_3 <- x_3*-1
x_4 <- x_4*-1
experiment <- rep(exp, each = 50) #scaling experimental labels to fit the amount of data
worm_index <- c(x_1,x_2,x_3,x_4) # combining data into a single vector
fakewormdata <- data.frame(worm_index,experiment) #data frame of whole dataset (just preference index and experimental labels)
anova1 <- aov(fakewormdata$worm_index~ fakewormdata$experiment)
sum <- summary(anova1) #quick initial anova before adjusting data parameters
cat("anova number",j,"\n")
print(sum)
cat("\n\n\n")
}
}
}
## anova number 1
## Df Sum Sq Mean Sq F value Pr(>F)
## fakewormdata$experiment 3 46.58 15.528 285 <0.0000000000000002 ***
## Residuals 196 10.68 0.054
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
##
## anova number 2
## Df Sum Sq Mean Sq F value Pr(>F)
## fakewormdata$experiment 3 72.89 24.296 212.1 <0.0000000000000002 ***
## Residuals 196 22.46 0.115
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
##
## anova number 3
## Df Sum Sq Mean Sq F value Pr(>F)
## fakewormdata$experiment 3 57.65 19.215 260.1 <0.0000000000000002 ***
## Residuals 196 14.48 0.074
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
##
## anova number 4
## Df Sum Sq Mean Sq F value Pr(>F)
## fakewormdata$experiment 3 38.24 12.746 208.8 <0.0000000000000002 ***
## Residuals 196 11.97 0.061
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
##
## anova number 5
## Df Sum Sq Mean Sq F value Pr(>F)
## fakewormdata$experiment 3 71.01 23.669 161.8 <0.0000000000000002 ***
## Residuals 196 28.67 0.146
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
*note: without the post hoc test all come back significant because half the data is below 0 and half is above. In the post hoc I will compare the groups that are on the same side of 0 with each other
Below I just assembled the data created above into a data frame
exp <- c("n2high","zig4high","n2low","zig4low")
experiment <- rep(exp, each = 50)
worm_index <- c(x_1,x_2,x_3,x_4)
fakewormdata <- data.frame(worm_index,experiment)
5/6. I used a for loop to adjust the sample size to explore how significance would change with sample size. I did this by subsetting the data based on the counter value and reassembling the data frame with the subsetted data. Then I ran an anova and a Tukey post-hoc test to extract the important comparisons from the statistical analysis.
for (j in 2:5) {
y_1 <- x_1[1:(j*10)] #resampling the data
y_2 <- x_2[1:(j*10)]
y_3 <- x_3[1:(j*10)]
y_4 <- x_4[1:(j*10)]
index <- c(y_1,y_2,y_3,y_4) #recombining the data
exp <- rep(exp,each=(j*10))
data <- data.frame(index,exp) #turning it into a data frame
anova <- aov(data$index ~ data$exp)
turkey <- TukeyHSD(anova) #post hoc test
high <- turkey$data[2,4] #these two lines isolate the desired parts of the post hoc test, one comparing the high salt conditioned groups to each other and one comparing the low salt conditioned ones.
low <- turkey$data[5,4]
cat("When n =",j*40,"and each strain x treatment has a random standard deviation between 0.05 and 0.5: \n Analysis: \n zig-4 (mean =",mean(y_2),") x wild type (mean =",mean(y_1),"), with high salt conditioning: \n p =",high,"\n Analysis: \n zig-4 (mean =",mean(y_4),") x wild type (mean =",mean(y_3),"), with low salt conditioning: \n p =",low,"\n\n\n")
}
## When n = 80 and each strain x treatment has a random standard deviation between 0.05 and 0.5:
## Analysis:
## zig-4 (mean = 0.2079407 ) x wild type (mean = 0.8984498 ), with high salt conditioning:
## p = 0.000009808103
## Analysis:
## zig-4 (mean = -0.1756984 ) x wild type (mean = -0.7844878 ), with low salt conditioning:
## p = 0.0001036103
##
##
## When n = 120 and each strain x treatment has a random standard deviation between 0.05 and 0.5:
## Analysis:
## zig-4 (mean = 0.1608291 ) x wild type (mean = 0.9156905 ), with high salt conditioning:
## p = 1
## Analysis:
## zig-4 (mean = -0.1877919 ) x wild type (mean = -0.8486365 ), with low salt conditioning:
## p = 1
##
##
## When n = 160 and each strain x treatment has a random standard deviation between 0.05 and 0.5:
## Analysis:
## zig-4 (mean = 0.1751816 ) x wild type (mean = 0.957495 ), with high salt conditioning:
## p = 1
## Analysis:
## zig-4 (mean = -0.1735121 ) x wild type (mean = -0.7591877 ), with low salt conditioning:
## p = 1
##
##
## When n = 200 and each strain x treatment has a random standard deviation between 0.05 and 0.5:
## Analysis:
## zig-4 (mean = 0.1911334 ) x wild type (mean = 0.9550838 ), with high salt conditioning:
## p = 1
## Analysis:
## zig-4 (mean = -0.1723003 ) x wild type (mean = -0.6806365 ), with low salt conditioning:
## p = 1