Simulation Study 1: Significance of Regression
Simulation Study 2: Using RMSE for Selection?
Simulation Study 3: Power

Simulation Study 1: Significance of Regression

Introduction

The purpose of this first simulation study revolves around the significance of regression test for multiple linear regression models. We are given two different models of the same length to generate data from. Both models will contain the same value for β0, the sample size(n), use the same three values of sigma, and the same predictor values (x1, x2, x3) with the only parameter difference being the values for the β1, β2, and β3. The “significant” model will have the value of “1” for β1, β2, and β3 where the “non-significant” model will have the value of “0” for those parameters. As a result, we see that the “non-significant” model traditionally represents the null hypothesis of β1 = β2 = β3 = 0. Using a sample size of 25 and three different values for sigma (1, 5, 10) we will simulate 2000 times for each model (2) and sigma (3) combinations resulting in a total of 12,000 simulations. For each simulation we will then extract the statistics from each fitted model that are required to evaluate the significance of regression test. Considering we are using multiple linear regression models these statistics will be the F-test statistic, the p-value associated with the F-test, and the proportion of the response variable that is explained (R2). After these three values are extracted from each simulation, they will be properly stored to allow for examining differences between the two models as well as each model and sigma combination.

Methods

data <- read.csv("C:/Users/jadam/Box/Job Applications & Content/Data_Projects/study_1.csv")
birthday = 09201988
set.seed(birthday)
x1 = data$x1
x2 = data$x2
x3 = data$x3
s_b0 <- 3
s_b1 <- 1
s_b2 <- 1
s_b3 <- 1
ns_b0 <- 3
ns_b1 <- 0
ns_b2 <- 0
ns_b3 <- 0
n <- 25
sigma1 <- 1
sigma2 <- 5
sigma3 <- 10
num_sim <- 2000

# Create storage variables for data to be stored in
m1_s1_f = rep(0, num_sim)
m1_s1_p = rep(0, num_sim)
m1_s1_r2 = rep(0, num_sim)

m1_s2_f = rep(0, num_sim)
m1_s2_p = rep(0, num_sim)
m1_s2_r2 = rep(0, num_sim)

m1_s3_f = rep(0, num_sim)
m1_s3_p = rep(0, num_sim)
m1_s3_r2 = rep(0, num_sim)

m2_s1_f = rep(0, num_sim)
m2_s1_p = rep(0, num_sim)
m2_s1_r2 = rep(0, num_sim)

m2_s2_f = rep(0, num_sim)
m2_s2_p = rep(0, num_sim)
m2_s2_r2 = rep(0, num_sim)

m2_s3_f = rep(0, num_sim)
m2_s3_p = rep(0, num_sim)
m2_s3_r2 = rep(0, num_sim)


# Function for fitting the model's 
sim = function(b0, b1, b2, b3, sigma) {
  eps = rnorm(n, mean = 0 , sd = sigma)
  y = b0 + b1 * x1 + b2 * x2 + b3 * x3 + eps
  data.frame(y, x1, x2, x3)
}

# Simulate data
for (i in 1:num_sim){
  # Significant model
  sigdat_m1_s1 = sim(s_b0, s_b1, s_b2, s_b3, sigma1)
  sigdat_m1_s2 = sim(s_b0, s_b1, s_b2, s_b3, sigma2)
  sigdat_m1_s3 = sim(s_b0, s_b1, s_b2, s_b3, sigma3)
  s1_model <- lm(y ~ x1 + x2 + x3, data = sigdat_m1_s1)
  s2_model <- lm(y ~ x1 + x2 + x3, data = sigdat_m1_s2)
  s3_model <- lm(y ~ x1 + x2 + x3, data = sigdat_m1_s3)
  m1_s1_f[i] <- summary(s1_model)$fstatistic[1]
  m1_s1_p[i] <- pf(summary(s1_model)$fstatistic[1], summary(s1_model)$fstatistic[2], summary(s1_model)$fstatistic[3], lower.tail = FALSE)
  m1_s1_r2[i] <- summary(s1_model)$r.squared
  m1_s2_f[i] <- summary(s2_model)$fstatistic[1]
  m1_s2_p[i] <- pf(summary(s2_model)$fstatistic[1], summary(s2_model)$fstatistic[2], summary(s2_model)$fstatistic[3], lower.tail = FALSE)
  m1_s2_r2[i] <- summary(s2_model)$r.squared
  m1_s3_f[i] <- summary(s3_model)$fstatistic[1]
  m1_s3_p[i] <- pf(summary(s3_model)$fstatistic[1], summary(s3_model)$fstatistic[2], summary(s3_model)$fstatistic[3], lower.tail = FALSE)
  m1_s3_r2[i] <- summary(s3_model)$r.squared
  
  # Non Significant model
  nsigdat_m1_s1 = sim(s_b0, s_b1, s_b2, s_b3, sigma1)
  nsigdat_m1_s2 = sim(s_b0, s_b1, s_b2, s_b3, sigma2)
  nsigdat_m1_s3 = sim(s_b0, s_b1, s_b2, s_b3, sigma3)
  ns1_model <- lm(y ~ x1 + x2 + x3, data = nsigdat_m1_s1)
  ns2_model <- lm(y ~ x1 + x2 + x3, data = nsigdat_m1_s2)
  ns3_model <- lm(y ~ x1 + x2 + x3, data = nsigdat_m1_s3)
  m2_s1_f[i] <- summary(ns1_model)$fstatistic[1]
  m2_s1_p[i] <- pf(summary(ns1_model)$fstatistic[1], summary(ns1_model)$fstatistic[2], summary(ns1_model)$fstatistic[3], lower.tail = FALSE)
  m2_s1_r2[i] <- summary(ns1_model)$r.squared
  m2_s2_f[i] <- summary(ns2_model)$fstatistic[1]
  m2_s2_p[i] <- pf(summary(ns2_model)$fstatistic[1], summary(ns2_model)$fstatistic[2], summary(ns2_model)$fstatistic[3], lower.tail = FALSE)
  m2_s2_r2[i] <- summary(ns2_model)$r.squared
  m2_s3_f[i] <- summary(ns3_model)$fstatistic[1]
  m2_s3_p[i] <- pf(summary(ns3_model)$fstatistic[1], summary(ns3_model)$fstatistic[2], summary(ns3_model)$fstatistic[3], lower.tail = FALSE)
  m2_s3_r2[i] <- summary(ns3_model)$r.squared
}

Results

Significant Model

Non-Significant Model

Discussion

Looking briefly at the two 3 x 3 graphs above, it is clear that there is a strong relationship between the significant and non-significant models for the F statistic, p-value, and R2 regardless of the value for sigma. Previously, we stated the non-significant model can be viewed as representing the null hypothesis because by making β1, β2, and β3 zero it is the same as saying H0: β1 = β2 = β3 = 0. Therefore, because we see a strong relationship between both models for each of the three empirical values any assumptions made for one model can be made for both. With that said, we also know that the F statistic follows a F distribution under the assumption that the null hypothesis is true. When looking at the graphs we see that as the variance (sigma) increases, the f-statistic becomes smaller while the p-value gets bigger. This is what we would expect to see under the null hypothesis and can therefore infer this as the true distribution of the f statistic. We don’t, however, know the true distribution for the p-value and R2. Therefore, a true distribution curve was added for only the F statistic in the graphs above. Here we see that the empirical distribution for the F statistic follows the true distribution as sigma increases. When sigma = 1 there is virtually no relationship between the simulated values and the true distribution. As sigma increases to 5 we begin to see the true distribution curve begin to shape closely to the empirical distribution however, the top of the curve scales out of the graph. When sigma is 10, we see this true distribution curve overlay near perfectly to the empirical distribution. Concurrently, we notice that as sigma increases there is an increase in our p-value and a decrease in our R2. We can ultimately conclude that after 2000 simulations with our model parameters and predictors, we observe a large f-statistic and small p-value that is largely explained (R2 > 0.6) when sigma is 1. When sigma is 5, we still observe some of these large f-statistics and small p-values however, there is an extreme shift in our R2 values (σ=1: R2 > 0.6, σ=5: R2 < 0.6) that reduces the proportion of the response variable that is explained. This pattern for all three of our empirical values continues when sigma is 10 and can be expected to continue this trend the larger sigma becomes.

Simulation Study 2: Using RMSE for Selection?

Introduction

Room mean square error (RMSE) is a statistic that can be used to select the best fitting model. This is generally done by randomly splitting the dataset into a “Test” and “Train” set that is then used to train and test the model. RMSE is then calculated for each the test and train data. A low train RMSE and high test RMSE is viewed as overfitting the model whereas a high train RMSE and high test RMSE is viewed as an underfitted model. We prefer a model that predicts the test data therefore our best model will be predicted based on low test RMSE. When this process is repeated a large number of times, on average it should sufficiently predict (or select) the correct model. In this second simulation study, we will carry out this process 1000 times and determine whether RMSE is able to select the correct model. Furthermore, we are given three different standard deviations and will evaluate how this effects RMSE and the model we would select based on this value. To do this we, are given the model (Yi = …), the beta parameters, the predictors, sample size, and the three value’s for SD. We will first simulate three different response values for the three standard deviations using the provided information. We will then randomly split the data in half providing an equal number of observations in both the Test and Train subsets. This will be used to fit nine different models, as well as calculating Train and Test RMSE for each, beginning with a single predictor in the first model (y ~ x1) and adding one additional predictor to each following model with the final ninth model consisting of all nine predictors (x1 – x9). We are previously given the answer that the sixth model is in fact the correct model. Therefore after simulating the above process 1000 times, we will examine the Test and Train RMSE for all three values of SD for each model (9models x 3SD x 1000simulations = 27000) and analyze the success of this method in choosing the correct model and how the different SD’s affect the results.

Methods

birthday = 09201988
set.seed(birthday)
study_2 <- read.csv("C:/Users/jadam/Box/Job Applications & Content/Data_Projects/study_2.csv")
x1 <- study_2$x1
x2 <- study_2$x2
x3 <- study_2$x3
x4 <- study_2$x4
x5 <- study_2$x5
x6 <- study_2$x6
x7 <- study_2$x7
x8 <- study_2$x8
x9 <- study_2$x9
b0 <- 0
b1 <- 3
b2 <- -4
b3 <- 1.6
b4 <- -1.1
b5 <- 0.7
b6 <- 0.5
n <- 500
sigma1 <- 1
sigma2 <- 2
sigma3 <- 4
num_sim <- 1000
S1_Train_RMSE <- matrix(0, num_sim, 9)
S2_Train_RMSE <- matrix(0, num_sim, 9)
S3_Train_RMSE <- matrix(0, num_sim, 9)
S1_Test_RMSE <- matrix(0, num_sim, 9)
S2_Test_RMSE <- matrix(0, num_sim, 9)
S3_Test_RMSE <- matrix(0, num_sim, 9)
colnames(S1_Train_RMSE) <- paste("Model", 1:9, sep=" ")
colnames(S2_Train_RMSE) <- paste("Model", 1:9, sep=" ")
colnames(S3_Train_RMSE) <- paste("Model", 1:9, sep=" ")
colnames(S1_Test_RMSE) <- paste("Model", 1:9, sep=" ")
colnames(S2_Test_RMSE) <- paste("Model", 1:9, sep=" ")
colnames(S3_Test_RMSE) <- paste("Model", 1:9, sep=" ")

# Simulating Data function
sim = function(sigma) {
  eps = rnorm(n, mean = 0 , sd = sigma)
  y = b0 + b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4 + b5 * x5 + b6 * x6 + eps
  data.frame(y, x1, x2, x3, x4, x5, x6, x7, x8, x9)
}

# RMSE Function
rmse <- function(actual, predicted){
  sqrt(mean((actual - predicted) ^ 2))
}

# Sims data obtaining Y's
for (i in 1:num_sim){
  # Simulates data
  simdata_s1 = sim(sigma1)
  simdata_s2 = sim(sigma2)
  simdata_s3 = sim(sigma3)
  
  #Splits the data into train and test
  idx = sample(1:nrow(study_2), 250)
  s1_trn = simdata_s1[idx, ]
  s1_tst = simdata_s1[-idx, ]
  s2_trn = simdata_s2[idx, ]
  s2_tst = simdata_s2[-idx, ]
  s3_trn = simdata_s3[idx, ]
  s3_tst = simdata_s3[-idx, ]
  
  # Fit the model for Sigma 1
  model_1_s1 <- lm(y ~ x1, data = s1_trn)
  model_2_s1 <- lm(y ~ x1 + x2, data = s1_trn)
  model_3_s1 <- lm(y ~ x1 + x2 + x3, data = s1_trn)
  model_4_s1 <- lm(y ~ x1 + x2 + x3 + x4, data = s1_trn)
  model_5_s1 <- lm(y ~ x1 + x2 + x3 + x4 + x5, data = s1_trn)
  model_6_s1 <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6, data = s1_trn)
  model_7_s1 <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7, data = s1_trn)
  model_8_s1 <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8, data = s1_trn)
  model_9_s1 <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9, data = s1_trn)
  
  # Fit the model for Sigma 2
  model_1_s2 <- lm(y ~ x1, data = s2_trn)
  model_2_s2 <- lm(y ~ x1 + x2, data = s2_trn)
  model_3_s2 <- lm(y ~ x1 + x2 + x3, data = s2_trn)
  model_4_s2 <- lm(y ~ x1 + x2 + x3 + x4, data = s2_trn)
  model_5_s2 <- lm(y ~ x1 + x2 + x3 + x4 + x5, data = s2_trn)
  model_6_s2 <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6, data = s2_trn)
  model_7_s2 <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7, data = s2_trn)
  model_8_s2 <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8, data = s2_trn)
  model_9_s2 <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9, data = s2_trn)
  
  # Fit the model for Sigma 3
  model_1_s3 <- lm(y ~ x1, data = s3_trn)
  model_2_s3 <- lm(y ~ x1 + x2, data = s3_trn)
  model_3_s3 <- lm(y ~ x1 + x2 + x3, data = s3_trn)
  model_4_s3 <- lm(y ~ x1 + x2 + x3 + x4, data = s3_trn)
  model_5_s3 <- lm(y ~ x1 + x2 + x3 + x4 + x5, data = s3_trn)
  model_6_s3 <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6, data = s3_trn)
  model_7_s3 <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7, data = s3_trn)
  model_8_s3 <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8, data = s3_trn)
  model_9_s3 <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9, data = s3_trn)
  
  # Calculate Test RMSE for each Sigma
  S1_Test_RMSE[i,] = c(
    rmse(s1_tst$y, predict(model_1_s1, s1_tst)),
    rmse(s1_tst$y, predict(model_2_s1, s1_tst)),
    rmse(s1_tst$y, predict(model_3_s1, s1_tst)),
    rmse(s1_tst$y, predict(model_4_s1, s1_tst)),
    rmse(s1_tst$y, predict(model_5_s1, s1_tst)),
    rmse(s1_tst$y, predict(model_6_s1, s1_tst)),
    rmse(s1_tst$y, predict(model_7_s1, s1_tst)),
    rmse(s1_tst$y, predict(model_8_s1, s1_tst)),
    rmse(s1_tst$y, predict(model_9_s1, s1_tst))
  )
  S2_Test_RMSE[i,] = c(
    rmse(s2_tst$y, predict(model_1_s2, s2_tst)),
    rmse(s2_tst$y, predict(model_2_s2, s2_tst)),
    rmse(s2_tst$y, predict(model_3_s2, s2_tst)),
    rmse(s2_tst$y, predict(model_4_s2, s2_tst)),
    rmse(s2_tst$y, predict(model_5_s2, s2_tst)),
    rmse(s2_tst$y, predict(model_6_s2, s2_tst)),
    rmse(s2_tst$y, predict(model_7_s2, s2_tst)),
    rmse(s2_tst$y, predict(model_8_s2, s2_tst)),
    rmse(s2_tst$y, predict(model_9_s2, s2_tst))
  )
  
  S3_Test_RMSE[i,] = c(
    rmse(s3_tst$y, predict(model_1_s3, s3_tst)),
    rmse(s3_tst$y, predict(model_2_s3, s3_tst)),
    rmse(s3_tst$y, predict(model_3_s3, s3_tst)),
    rmse(s3_tst$y, predict(model_4_s3, s3_tst)),
    rmse(s3_tst$y, predict(model_5_s3, s3_tst)),
    rmse(s3_tst$y, predict(model_6_s3, s3_tst)),
    rmse(s3_tst$y, predict(model_7_s3, s3_tst)),
    rmse(s3_tst$y, predict(model_8_s3, s3_tst)),
    rmse(s3_tst$y, predict(model_9_s3, s3_tst))
  )
  
  # Calculate Train RMSE for each Sigma 
  S1_Train_RMSE[i,] = c(
    rmse(s1_trn$y, predict(model_1_s1, s1_trn)),
    rmse(s1_trn$y, predict(model_2_s1, s1_trn)),
    rmse(s1_trn$y, predict(model_3_s1, s1_trn)),
    rmse(s1_trn$y, predict(model_4_s1, s1_trn)),
    rmse(s1_trn$y, predict(model_5_s1, s1_trn)),
    rmse(s1_trn$y, predict(model_6_s1, s1_trn)),
    rmse(s1_trn$y, predict(model_7_s1, s1_trn)),
    rmse(s1_trn$y, predict(model_8_s1, s1_trn)),
    rmse(s1_trn$y, predict(model_9_s1, s1_trn))
  )
  S2_Train_RMSE[i,] = c(
    rmse(s2_trn$y, predict(model_1_s2, s2_trn)),
    rmse(s2_trn$y, predict(model_2_s2, s2_trn)),
    rmse(s2_trn$y, predict(model_3_s2, s2_trn)),
    rmse(s2_trn$y, predict(model_4_s2, s2_trn)),
    rmse(s2_trn$y, predict(model_5_s2, s2_trn)),
    rmse(s2_trn$y, predict(model_6_s2, s2_trn)),
    rmse(s2_trn$y, predict(model_7_s2, s2_trn)),
    rmse(s2_trn$y, predict(model_8_s2, s2_trn)),
    rmse(s2_trn$y, predict(model_9_s2, s2_trn))
  )
  
  S3_Train_RMSE[i,] = c(
    rmse(s3_trn$y, predict(model_1_s3, s3_trn)),
    rmse(s3_trn$y, predict(model_2_s3, s3_trn)),
    rmse(s3_trn$y, predict(model_3_s3, s3_trn)),
    rmse(s3_trn$y, predict(model_4_s3, s3_trn)),
    rmse(s3_trn$y, predict(model_5_s3, s3_trn)),
    rmse(s3_trn$y, predict(model_6_s3, s3_trn)),
    rmse(s3_trn$y, predict(model_7_s3, s3_trn)),
    rmse(s3_trn$y, predict(model_8_s3, s3_trn)),
    rmse(s3_trn$y, predict(model_9_s3, s3_trn))
  )
}

Results

name	Freq
Model 5	1
Model 6	540
Model 7	208
Model 8	126
Model 9	125

name	Freq
Model 3	12
Model 4	15
Model 5	15
Model 6	489
Model 7	205
Model 8	123
Model 9	141

name	Freq
Model 2	15
Model 3	182
Model 4	110
Model 5	56
Model 6	308
Model 7	140
Model 8	90
Model 9	99

Discussion

The results for the average train and test RMSE across each model was plotted for each of the three different values for sigma (1, 2, 4). As stated in the introduction, low test RMSE was used as the method for choosing the correct model. It was shown that when this process was repeated 1000 times, on average this method effectively selected the correct model (model 6). However, this was not always the case for every simulation. There were times that other models resulted in the lowest test RMSE by chance which increased as sigma increased. For example when sigma was equal to 1, five models had been chosen at least once as the correct model however, when sigma was equal to 4 nearly all nine of the models (eight) had been chosen at least once as the correct model. Although on average Model 6 was consistently chosen as the correct model, this example illustrates the significance for large simulations especially as sigma increases. Additionally, train RMSE was plotted concurrently with test RMSE to examine differences. Consistent with our expectations, test RMSE failed to predict the correct model regardless of what sigma was and instead predicted the largest model (model 9). Based on the RMSE formula, the larger model will always have a RMSE that is lower then or equal to a smaller model. This is why we see that on average model 9 has the lowest Train RMSE, which is again an example of overfitting. We did see that the results were affected by the level of noise. Although test RMSE did predict the correct model on average across all three sigma values, this occurrence increased as sigma levels increased.

Simulation Study 3: Power

Introduction

In this stimulation study, we will be examining power estimates for a given simple linear regression hypothesis test and how this is influenced by sample size(n), signal strength(β), and noise level(σ). Power of a significance of regression test is the probability of rejecting the null hypothesis when an alternative is true. We are given the null β1 = 0 and alternative β 1 != 0 hypothesis, three different sample sizes (n = 10, 20, 30), three different noise levels (1, 2, 4), the significance level (α = 0.05) and 41 incremental β1 values where zero is the median. Predictor values (x) were then generated for the three different sample sizes which was then used to simulate data from the SLR model 1000 times for every sample size(3), signal strength(41), and noise level(3) combination. For each simulation and combination, a significance of the regression test was performed and then stored in a separate variable. Once completed, we added up the number of times the null hypothesis was correctly rejected (for alpha = 0.05) and divided this by the total number of simulations (1000) to calculate the power estimate for each β. Three plots for each value of σ was then created with each depicting three power curves representing the three sample sizes. Findings are then discussed to interpret meaning and examine if what we observed would be consistent with what we would expect.

Methods

birthday = 09201988
set.seed(birthday)
b0 <- 0
b1 <- seq(-2,2,0.1)
n <- c(10,20,30)
sigma <- c(1,2,4)
alpha <- 0.05
x_values10 <- seq(0,5, length = n[1])
x_values20 <- seq(0,5, length = n[2])
x_values30 <- seq(0,5, length = n[3])
num_sim <- 1000

# Storage pieces that will be filled via for loop 
N_10_s1 <- matrix(nrow = num_sim, ncol = 41)
N_10_s2 <- matrix(nrow = num_sim, ncol = 41)
N_10_s3 <- matrix(nrow = num_sim, ncol = 41)
N_20_s1 <- matrix(nrow = num_sim, ncol = 41)
N_20_s2 <- matrix(nrow = num_sim, ncol = 41)
N_20_s3 <- matrix(nrow = num_sim, ncol = 41)
N_30_s1 <- matrix(nrow = num_sim, ncol = 41)
N_30_s2 <- matrix(nrow = num_sim, ncol = 41)
N_30_s3 <- matrix(nrow = num_sim, ncol = 41)

# Function for simulating data
simdata = function(x, b1, sigma) {
  n = length(x)
  eps = rnorm(n, mean = 0 , sd = sigma)
  y = b0 + b1 *x + eps
  fit = lm(y ~ x)
  summary(fit)$coefficients[2, "Pr(>|t|)"]
}

# Simulating and storing data 1000x's for each combination
for (i in 1:num_sim){
  # Combinations of beta and sigma for N = 10
  N_10_s1[i,] = c(simdata(x_values10, b1[1], sigma[1]), simdata(x_values10, b1[2], sigma[1]), simdata(x_values10, b1[3], sigma[1]), simdata(x_values10, b1[4], sigma[1]),
                  simdata(x_values10, b1[5], sigma[1]), simdata(x_values10, b1[6], sigma[1]), simdata(x_values10, b1[7], sigma[1]), simdata(x_values10, b1[8], sigma[1]), 
                  simdata(x_values10, b1[9], sigma[1]), simdata(x_values10, b1[10], sigma[1]), simdata(x_values10, b1[11], sigma[1]), simdata(x_values10, b1[12], sigma[1]), 
                  simdata(x_values10, b1[13], sigma[1]), simdata(x_values10, b1[14], sigma[1]), simdata(x_values10, b1[15], sigma[1]), simdata(x_values10, b1[16], sigma[1]), 
                  simdata(x_values10, b1[17], sigma[1]), simdata(x_values10, b1[18], sigma[1]), simdata(x_values10, b1[19], sigma[1]), simdata(x_values10, b1[20], sigma[1]), 
                  simdata(x_values10, b1[21], sigma[1]), simdata(x_values10, b1[22], sigma[1]), simdata(x_values10, b1[23], sigma[1]), simdata(x_values10, b1[24], sigma[1]), 
                  simdata(x_values10, b1[25], sigma[1]), simdata(x_values10, b1[26], sigma[1]), simdata(x_values10, b1[27], sigma[1]), simdata(x_values10, b1[28], sigma[1]), 
                  simdata(x_values10, b1[29], sigma[1]), simdata(x_values10, b1[30], sigma[1]), simdata(x_values10, b1[31], sigma[1]), simdata(x_values10, b1[32], sigma[1]),
                  simdata(x_values10, b1[33], sigma[1]), simdata(x_values10, b1[34], sigma[1]), simdata(x_values10, b1[35], sigma[1]), simdata(x_values10, b1[36], sigma[1]), 
                  simdata(x_values10, b1[37], sigma[1]), simdata(x_values10, b1[38], sigma[1]), simdata(x_values10, b1[39], sigma[1]), simdata(x_values10, b1[40], sigma[1]),
                  simdata(x_values10, b1[41], sigma[1]))
  
  N_10_s2[i,] = c(simdata(x_values10, b1[1], sigma[2]), simdata(x_values10, b1[2], sigma[2]), simdata(x_values10, b1[3], sigma[2]), simdata(x_values10, b1[4], sigma[2]),
                  simdata(x_values10, b1[5], sigma[2]), simdata(x_values10, b1[6], sigma[2]), simdata(x_values10, b1[7], sigma[2]), simdata(x_values10, b1[8], sigma[2]), 
                  simdata(x_values10, b1[9], sigma[2]), simdata(x_values10, b1[10], sigma[2]), simdata(x_values10, b1[11], sigma[2]), simdata(x_values10, b1[12], sigma[2]), 
                  simdata(x_values10, b1[13], sigma[2]), simdata(x_values10, b1[14], sigma[2]), simdata(x_values10, b1[15], sigma[2]), simdata(x_values10, b1[16], sigma[2]), 
                  simdata(x_values10, b1[17], sigma[2]), simdata(x_values10, b1[18], sigma[2]), simdata(x_values10, b1[19], sigma[2]), simdata(x_values10, b1[20], sigma[2]), 
                  simdata(x_values10, b1[21], sigma[2]), simdata(x_values10, b1[22], sigma[2]), simdata(x_values10, b1[23], sigma[2]), simdata(x_values10, b1[24], sigma[2]),
                  simdata(x_values10, b1[25], sigma[2]), simdata(x_values10, b1[26], sigma[2]), simdata(x_values10, b1[27], sigma[2]), simdata(x_values10, b1[28], sigma[2]), 
                  simdata(x_values10, b1[29], sigma[2]), simdata(x_values10, b1[30], sigma[2]), simdata(x_values10, b1[31], sigma[2]), simdata(x_values10, b1[32], sigma[2]), 
                  simdata(x_values10, b1[33], sigma[2]), simdata(x_values10, b1[34], sigma[2]), simdata(x_values10, b1[35], sigma[2]), simdata(x_values10, b1[36], sigma[2]), 
                  simdata(x_values10, b1[37], sigma[2]), simdata(x_values10, b1[38], sigma[2]), simdata(x_values10, b1[39], sigma[2]), simdata(x_values10, b1[40], sigma[2]), 
                  simdata(x_values10, b1[41], sigma[2]))
  
  N_10_s3[i,] = c(simdata(x_values10, b1[1], sigma[3]), simdata(x_values10, b1[2], sigma[3]), simdata(x_values10, b1[3], sigma[3]), simdata(x_values10, b1[4], sigma[3]),
                  simdata(x_values10, b1[5], sigma[3]), simdata(x_values10, b1[6], sigma[3]), simdata(x_values10, b1[7], sigma[3]), simdata(x_values10, b1[8], sigma[3]), 
                  simdata(x_values10, b1[9], sigma[3]), simdata(x_values10, b1[10], sigma[3]), simdata(x_values10, b1[11], sigma[3]), simdata(x_values10, b1[12], sigma[3]), 
                  simdata(x_values10, b1[13], sigma[3]), simdata(x_values10, b1[14], sigma[3]), simdata(x_values10, b1[15], sigma[3]), simdata(x_values10, b1[16], sigma[3]), 
                  simdata(x_values10, b1[17], sigma[3]), simdata(x_values10, b1[18], sigma[3]), simdata(x_values10, b1[19], sigma[3]), simdata(x_values10, b1[20], sigma[3]), 
                  simdata(x_values10, b1[21], sigma[3]), simdata(x_values10, b1[22], sigma[3]), simdata(x_values10, b1[23], sigma[3]), simdata(x_values10, b1[24], sigma[3]), 
                  simdata(x_values10, b1[25], sigma[3]), simdata(x_values10, b1[26], sigma[3]), simdata(x_values10, b1[27], sigma[3]), simdata(x_values10, b1[28], sigma[3]), 
                  simdata(x_values10, b1[29], sigma[3]), simdata(x_values10, b1[30], sigma[3]), simdata(x_values10, b1[31], sigma[3]), simdata(x_values10, b1[32], sigma[3]), 
                  simdata(x_values10, b1[33], sigma[3]), simdata(x_values10, b1[34], sigma[3]), simdata(x_values10, b1[35], sigma[3]), simdata(x_values10, b1[36], sigma[3]), 
                  simdata(x_values10, b1[37], sigma[3]), simdata(x_values10, b1[38], sigma[3]), simdata(x_values10, b1[39], sigma[3]), simdata(x_values10, b1[40], sigma[3]), 
                  simdata(x_values10, b1[41], sigma[3]))
  
  # Combinations of beta and sigma for N = 20
  N_20_s1[i,] = c(simdata(x_values20, b1[1], sigma[1]), simdata(x_values20, b1[2], sigma[1]), simdata(x_values20, b1[3], sigma[1]), simdata(x_values20, b1[4], sigma[1]),
                  simdata(x_values20, b1[5], sigma[1]), simdata(x_values20, b1[6], sigma[1]), simdata(x_values20, b1[7], sigma[1]), simdata(x_values20, b1[8], sigma[1]), 
                  simdata(x_values20, b1[9], sigma[1]), simdata(x_values20, b1[10], sigma[1]), simdata(x_values20, b1[11], sigma[1]), simdata(x_values20, b1[12], sigma[1]), 
                  simdata(x_values20, b1[13], sigma[1]), simdata(x_values20, b1[14], sigma[1]), simdata(x_values20, b1[15], sigma[1]), simdata(x_values20, b1[16], sigma[1]), 
                  simdata(x_values20, b1[17], sigma[1]), simdata(x_values20, b1[18], sigma[1]), simdata(x_values20, b1[19], sigma[1]), simdata(x_values20, b1[20], sigma[1]), 
                  simdata(x_values20, b1[21], sigma[1]), simdata(x_values20, b1[22], sigma[1]), simdata(x_values20, b1[23], sigma[1]), simdata(x_values20, b1[24], sigma[1]), 
                  simdata(x_values20, b1[25], sigma[1]), simdata(x_values20, b1[26], sigma[1]), simdata(x_values20, b1[27], sigma[1]), simdata(x_values20, b1[28], sigma[1]), 
                  simdata(x_values20, b1[29], sigma[1]), simdata(x_values20, b1[30], sigma[1]), simdata(x_values20, b1[31], sigma[1]), simdata(x_values20, b1[32], sigma[1]),
                  simdata(x_values20, b1[33], sigma[1]), simdata(x_values20, b1[34], sigma[1]), simdata(x_values20, b1[35], sigma[1]), simdata(x_values20, b1[36], sigma[1]), 
                  simdata(x_values20, b1[37], sigma[1]), simdata(x_values20, b1[38], sigma[1]), simdata(x_values20, b1[39], sigma[1]), simdata(x_values20, b1[40], sigma[1]),
                  simdata(x_values20, b1[41], sigma[1]))
  
  N_20_s2[i,] = c(simdata(x_values20, b1[1], sigma[2]), simdata(x_values20, b1[2], sigma[2]), simdata(x_values20, b1[3], sigma[2]), simdata(x_values20, b1[4], sigma[2]),
                  simdata(x_values20, b1[5], sigma[2]), simdata(x_values20, b1[6], sigma[2]), simdata(x_values20, b1[7], sigma[2]), simdata(x_values20, b1[8], sigma[2]), 
                  simdata(x_values20, b1[9], sigma[2]), simdata(x_values20, b1[10], sigma[2]), simdata(x_values20, b1[11], sigma[2]), simdata(x_values20, b1[12], sigma[2]), 
                  simdata(x_values20, b1[13], sigma[2]), simdata(x_values20, b1[14], sigma[2]), simdata(x_values20, b1[15], sigma[2]), simdata(x_values20, b1[16], sigma[2]), 
                  simdata(x_values20, b1[17], sigma[2]), simdata(x_values20, b1[18], sigma[2]), simdata(x_values20, b1[19], sigma[2]), simdata(x_values20, b1[20], sigma[2]), 
                  simdata(x_values20, b1[21], sigma[2]), simdata(x_values20, b1[22], sigma[2]), simdata(x_values20, b1[23], sigma[2]), simdata(x_values20, b1[24], sigma[2]),
                  simdata(x_values20, b1[25], sigma[2]), simdata(x_values20, b1[26], sigma[2]), simdata(x_values20, b1[27], sigma[2]), simdata(x_values20, b1[28], sigma[2]), 
                  simdata(x_values20, b1[29], sigma[2]), simdata(x_values20, b1[30], sigma[2]), simdata(x_values20, b1[31], sigma[2]), simdata(x_values20, b1[32], sigma[2]), 
                  simdata(x_values20, b1[33], sigma[2]), simdata(x_values20, b1[34], sigma[2]), simdata(x_values20, b1[35], sigma[2]), simdata(x_values20, b1[36], sigma[2]), 
                  simdata(x_values20, b1[37], sigma[2]), simdata(x_values20, b1[38], sigma[2]), simdata(x_values20, b1[39], sigma[2]), simdata(x_values20, b1[40], sigma[2]), 
                  simdata(x_values20, b1[41], sigma[2]))
  
  N_20_s3[i,] = c(simdata(x_values20, b1[1], sigma[3]), simdata(x_values20, b1[2], sigma[3]), simdata(x_values20, b1[3], sigma[3]), simdata(x_values20, b1[4], sigma[3]),
                  simdata(x_values20, b1[5], sigma[3]), simdata(x_values20, b1[6], sigma[3]), simdata(x_values20, b1[7], sigma[3]), simdata(x_values20, b1[8], sigma[3]), 
                  simdata(x_values20, b1[9], sigma[3]), simdata(x_values20, b1[10], sigma[3]), simdata(x_values20, b1[11], sigma[3]), simdata(x_values20, b1[12], sigma[3]), 
                  simdata(x_values20, b1[13], sigma[3]), simdata(x_values20, b1[14], sigma[3]), simdata(x_values20, b1[15], sigma[3]), simdata(x_values20, b1[16], sigma[3]), 
                  simdata(x_values20, b1[17], sigma[3]), simdata(x_values20, b1[18], sigma[3]), simdata(x_values20, b1[19], sigma[3]), simdata(x_values20, b1[20], sigma[3]), 
                  simdata(x_values20, b1[21], sigma[3]), simdata(x_values20, b1[22], sigma[3]), simdata(x_values20, b1[23], sigma[3]), simdata(x_values20, b1[24], sigma[3]), 
                  simdata(x_values20, b1[25], sigma[3]), simdata(x_values20, b1[26], sigma[3]), simdata(x_values20, b1[27], sigma[3]), simdata(x_values20, b1[28], sigma[3]), 
                  simdata(x_values20, b1[29], sigma[3]), simdata(x_values20, b1[30], sigma[3]), simdata(x_values20, b1[31], sigma[3]), simdata(x_values20, b1[32], sigma[3]), 
                  simdata(x_values20, b1[33], sigma[3]), simdata(x_values20, b1[34], sigma[3]), simdata(x_values20, b1[35], sigma[3]), simdata(x_values20, b1[36], sigma[3]), 
                  simdata(x_values20, b1[37], sigma[3]), simdata(x_values20, b1[38], sigma[3]), simdata(x_values20, b1[39], sigma[3]), simdata(x_values20, b1[40], sigma[3]), 
                  simdata(x_values20, b1[41], sigma[3]))
  
  # Combinations of beta and sigma for N = 30
  N_30_s1[i,] = c(simdata(x_values30, b1[1], sigma[1]),  simdata(x_values30, b1[2], sigma[1]), simdata(x_values30, b1[3], sigma[1]), simdata(x_values30, b1[4], sigma[1]),
                  simdata(x_values30, b1[5], sigma[1]),  simdata(x_values30, b1[6], sigma[1]), simdata(x_values30, b1[7], sigma[1]), simdata(x_values30, b1[8], sigma[1]), 
                  simdata(x_values30, b1[9], sigma[1]),  simdata(x_values30, b1[10], sigma[1]), simdata(x_values30, b1[11], sigma[1]), simdata(x_values30, b1[12], sigma[1]), 
                  simdata(x_values30, b1[13], sigma[1]), simdata(x_values30, b1[14], sigma[1]), simdata(x_values30, b1[15], sigma[1]), simdata(x_values30, b1[16], sigma[1]), 
                  simdata(x_values30, b1[17], sigma[1]), simdata(x_values30, b1[18], sigma[1]), simdata(x_values30, b1[19], sigma[1]), simdata(x_values30, b1[20], sigma[1]), 
                  simdata(x_values30, b1[21], sigma[1]), simdata(x_values30, b1[22], sigma[1]), simdata(x_values30, b1[23], sigma[1]), simdata(x_values30, b1[24], sigma[1]), 
                  simdata(x_values30, b1[25], sigma[1]), simdata(x_values30, b1[26], sigma[1]), simdata(x_values30, b1[27], sigma[1]), simdata(x_values30, b1[28], sigma[1]), 
                  simdata(x_values30, b1[29], sigma[1]), simdata(x_values30, b1[30], sigma[1]), simdata(x_values30, b1[31], sigma[1]), simdata(x_values30, b1[32], sigma[1]),
                  simdata(x_values30, b1[33], sigma[1]), simdata(x_values30, b1[34], sigma[1]), simdata(x_values30, b1[35], sigma[1]), simdata(x_values30, b1[36], sigma[1]), 
                  simdata(x_values30, b1[37], sigma[1]), simdata(x_values30, b1[38], sigma[1]), simdata(x_values30, b1[39], sigma[1]), simdata(x_values30, b1[40], sigma[1]),
                  simdata(x_values30, b1[41], sigma[1]))
  
  N_30_s2[i,] = c(simdata(x_values30, b1[1], sigma[2]),  simdata(x_values30, b1[2], sigma[2]), simdata(x_values30, b1[3], sigma[2]), simdata(x_values30, b1[4], sigma[2]),
                  simdata(x_values30, b1[5], sigma[2]),  simdata(x_values30, b1[6], sigma[2]), simdata(x_values30, b1[7], sigma[2]), simdata(x_values30, b1[8], sigma[2]), 
                  simdata(x_values30, b1[9], sigma[2]),  simdata(x_values30, b1[10], sigma[2]), simdata(x_values30, b1[11], sigma[2]), simdata(x_values30, b1[12], sigma[2]), 
                  simdata(x_values30, b1[13], sigma[2]), simdata(x_values30, b1[14], sigma[2]), simdata(x_values30, b1[15], sigma[2]), simdata(x_values30, b1[16], sigma[2]), 
                  simdata(x_values30, b1[17], sigma[2]), simdata(x_values30, b1[18], sigma[2]), simdata(x_values30, b1[19], sigma[2]), simdata(x_values30, b1[20], sigma[2]), 
                  simdata(x_values30, b1[21], sigma[2]), simdata(x_values30, b1[22], sigma[2]), simdata(x_values30, b1[23], sigma[2]), simdata(x_values30, b1[24], sigma[2]),
                  simdata(x_values30, b1[25], sigma[2]), simdata(x_values30, b1[26], sigma[2]), simdata(x_values30, b1[27], sigma[2]), simdata(x_values30, b1[28], sigma[2]), 
                  simdata(x_values30, b1[29], sigma[2]), simdata(x_values30, b1[30], sigma[2]), simdata(x_values30, b1[31], sigma[2]), simdata(x_values30, b1[32], sigma[2]), 
                  simdata(x_values30, b1[33], sigma[2]), simdata(x_values30, b1[34], sigma[2]), simdata(x_values30, b1[35], sigma[2]), simdata(x_values30, b1[36], sigma[2]), 
                  simdata(x_values30, b1[37], sigma[2]), simdata(x_values30, b1[38], sigma[2]), simdata(x_values30, b1[39], sigma[2]), simdata(x_values30, b1[40], sigma[2]), 
                  simdata(x_values30, b1[41], sigma[2]))
  
  N_30_s3[i,] = c(simdata(x_values30, b1[1], sigma[3]),  simdata(x_values30, b1[2], sigma[3]), simdata(x_values30, b1[3], sigma[3]), simdata(x_values30, b1[4], sigma[3]),
                  simdata(x_values30, b1[5], sigma[3]),  simdata(x_values30, b1[6], sigma[3]), simdata(x_values30, b1[7], sigma[3]), simdata(x_values30, b1[8], sigma[3]), 
                  simdata(x_values30, b1[9], sigma[3]),  simdata(x_values30, b1[10], sigma[3]), simdata(x_values30, b1[11], sigma[3]), simdata(x_values30, b1[12], sigma[3]), 
                  simdata(x_values30, b1[13], sigma[3]), simdata(x_values30, b1[14], sigma[3]), simdata(x_values30, b1[15], sigma[3]), simdata(x_values30, b1[16], sigma[3]), 
                  simdata(x_values30, b1[17], sigma[3]), simdata(x_values30, b1[18], sigma[3]), simdata(x_values30, b1[19], sigma[3]), simdata(x_values30, b1[20], sigma[3]), 
                  simdata(x_values30, b1[21], sigma[3]), simdata(x_values30, b1[22], sigma[3]), simdata(x_values30, b1[23], sigma[3]), simdata(x_values30, b1[24], sigma[3]), 
                  simdata(x_values30, b1[25], sigma[3]), simdata(x_values30, b1[26], sigma[3]), simdata(x_values30, b1[27], sigma[3]), simdata(x_values30, b1[28], sigma[3]), 
                  simdata(x_values30, b1[29], sigma[3]), simdata(x_values30, b1[30], sigma[3]), simdata(x_values30, b1[31], sigma[3]), simdata(x_values30, b1[32], sigma[3]), 
                  simdata(x_values30, b1[33], sigma[3]), simdata(x_values30, b1[34], sigma[3]), simdata(x_values30, b1[35], sigma[3]), simdata(x_values30, b1[36], sigma[3]), 
                  simdata(x_values30, b1[37], sigma[3]), simdata(x_values30, b1[38], sigma[3]), simdata(x_values30, b1[39], sigma[3]), simdata(x_values30, b1[40], sigma[3]), 
                  simdata(x_values30, b1[41], sigma[3]))
}

Results

Discussion

When we talk about power for a significant of regression test we are referring to the probability that a signal of a particular strength(β1) will be detected. Of the number of things affecting the power of a test, we conducted a stimulation study to examine how sample size(n), signal strength(β), and noise level(σ) influenced our power estimate with a fixed alpha. Based on the formula for p-value (2*P(tn-2 > |t|) and t-test (EST – HYP/SE) we know what to expect for a given parameter. For instance, the standard error (SE) of the mean depends inversely on the square root of the sample size. Meaning that the standard error is smaller when N is large compared to when N is small. We therefore know that increasing the sample size increases the power estimate. Additionally, sigma accounts for error variance in the sample size and when given a fixed sample size a smaller variance with lead to increased power compared to a larger variance when all else is considered to be fixed. Lastly, our null hypothesis is that β1 = 0 with the alternative being β1 != 0. Power is the probability that we reject the null when the alternative is true. In the case, we would expect power to increase the further away β is from 0 in magnitude. Understanding these concepts and what we would expect to observe on average, we performed a simulation study that was ran 1000 times. These results were plotted (above) in three separate graphs comparing power estimates for each possible β1 and sigma combination given a specific sample size. At a glance, the inverted-U shape we see across all three graphs tells us that the power estimate increases the farther away β is from 0. We also see that the shape of this inverted-U has a “pointer” bottom with a smaller sigma and gradually widens the larger sigma becomes. Telling us that the change in power between two incremental β values approaching zero (e.g., β0.7 – β 0.6) is greater for smaller sigma’s and more gradual as sigma increases. In the first graph with the lowest sigma (sigma = 1) we see that the power estimate is consistently near 1.0 with larger β values and then drastically decreases when β is less then 1 in magnitude. This is generally consistent regardless of the sample size though the trend shows that as we increase the sample sizes, we are able to maintain the increased power even as β approaches zero. This trend continues proportionally in the following two graphs. As sigma increases, the power curve decreases with larger values of β and a smaller between β difference. Additionally, we see that sample size greatly influences power estimates when sigma gets larger. For example, the power estimates instantly begin to decrease at the extreme β values in the smallest sample size (n=10) whereas power tends to remain higher at these extreme β values as we observe larger sample sizes. What we observed in these three graphs follows what would be expected based on the mathematical equations thus making 1000 simulations sufficient for this study. Based on the hypothesis test that was given, we expected to see the power of significance to be lowest when β was near zero and for the power to gradually increase the further away in magnitude β was from zero. We saw that the power estimates drastically increased at a smaller β value when sigma was smallest compared to a more gradual and slower increase in power the larger sigma became. Finally, we see that sample size followed a consistent trend for each combination. Larger sample sizes were able to maintain increased power estimates with extreme β values the larger sigma became.

Data Simulation Project

John F. Adamek

6/29/2020

Simulation Study 1: Significance of Regression

Introduction

Methods

Results

Discussion

Simulation Study 2: Using RMSE for Selection?

Introduction

Methods

Results

Discussion

Simulation Study 3: Power

Introduction

Methods

Results

Discussion