Description
Lab homework using r instructions below
LAB HOMEWORK INSTRUCTIONS
——————————————————————-
#save your plots as an image file and upload separately (export > save as image)
#save them as png with filename your_name_(question number)_(histogram/plot)
#You are told that in Providence Rhode Island the height of women averages 64 inches (5’4″) with a standard deviation of 1.5 inches.
#Of the 178 thousand people in Providence Rhode Island, 52% are women.
#PART 1
#1.1) Simulate the population described above. Call the population variable popri
#1.2) Compute the mean and sd of this population.
#How does this mean and sd compare to the true mean (64) and sd (1.5)?
#1.3) Plot a density plot of this population. What does this distribution look like?
#PART 2
#2.1) Take a random sample of popri containing 10 people and call the variable sam10
#2.2) Compute the mean and sd of the sample sam10
#2.3) Take a random sample of popri containing 1000 people and call the variable sam1000.
#2.4) Compute the mean and sd of the sample sam1000
#2.5) Which is closer to the true population mean and sd; the sample with 10 people or the sample with 1000 people? Why?
#PART 3
#3.1) Create a matrix called samsri containing 200 random samples of 500 subjects in each sample from the population variable popri.
#HINT: Declare an empty matrix and then use a ‘for loop’ to fill it
#3.2) Create a vector called samsri200means that contains the means for each column (sample) from samsri.
#3.3)Plot a density plot of samsri200means (0.5 pt). What does this distribution look like?
——————————————————————————————
HERE’S A LECTURE ON WHAT WE LEARNED IN CLASS FOR THE LAB HOMEWORK DOWN BELOW FOR YOUR REFERENCE
#Lab 4-Contents
#0. Review of Normal Probability Distribution Functions
#1. Simulating Populations using Random Variables
#2. Taking Samples from a Population: The sampling Distribution
#3. Programming in R: Using Loops
#4. Programming in R: The apply function
#5. Sampling Distribution of the Uniform Distribution
#——————————————————–
# 0. Review of Normal Probability Distribution Functions
#——————————————————–
#Last week we learned how to calculate:
#1) Probabilities from a Normal Distribution using pnorm(Z, mean, sd)
#Ex: What is the probability of a student getting a 75 or less on the exam
#2) Quantiles from a Normal Distribution using qnorm(Z, mean, sd)
#Ex: What score would a student have to achieve to be in the top 10% on the exam
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
#EXERCISE 0-1: What is the probability of a student
#getting a 75 or less on the exam given that
#the scores on the exam follow a normal distribution
#of mean 78, and standard deviation of 10?
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
pnorm(75, 78, 10)
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
#EXERCISE 0-2: What score would a student have to achieve
#to be in the top 10% on the exam given
#the scores on the exam follow a normal distribution of mean 78,
#and standard deviation of 10?
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
qnorm(.9, 78, 10)
#——————————————————–
#1. Simulating Populations using Random Variables
#——————————————————–
#The difference between a population and a sample:
#Your sample is the group of individuals who participate
#in your study, and your population is the broader group
#of people to whom your results will apply.
#Therefore: “population” in statistics includes all members of a defined group.
#A part of the population is called a sample.
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#Last week (Lab 3), we used the command rnorm()
#to create a variable with a normal distribution
#Random Normal variable: rnorm(n, mean, sd)
#NOTE: We can specify how large our population is,the mean and SD.
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
#EXERCISE 1-1:
# Create a normally distributed population variable called pop
# that consists of 10,000 subjects with mean of 15 and sd of 2
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
#set.seed(1)
pop = rnorm(n=10000, mean=15, sd=2)
#Let’s verify that this did what we wanted.
hist(pop); mean(pop); sd(pop)
#??????????????????????????????????????????????????????????????????????????????????#
#Thought Question 1: Why is your mean and sd for pop slightly different
#than mine OR rather, why is noones exactly a mean of 15 and sd of 2
#??????????????????????????????????????????????????????????????????????????????????#
#??????????????????????????????????????????????????????????????????????????????????#
# Now, let’s pretend this variable x is a population of
# undergrads + graduate students at USC who have ever used marijuana
#Thought Question 2: Considering that USC is ~20k students,
#In reality, could I actually collect this information from every USC
#student to form this distribution? What should I do instead?
#??????????????????????????????????????????????????????????????????????????????????#
#—————————————————————
#2. Taking Samples from a Population: The sampling Distribution
#—————————————————————
#All (most) research studies deal with samples
#We can take a sample from data we consider to be our population
#by using the sample() function in R
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^#
#Random sample:sample(x, size, replace=TRUE)
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^#
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
#Example 1: Let’s pretend we are researchers.
#While ideally we would like to study the POPULATION of marijuana users
#at USC, we realize that we only have funding to ask 200 students.
#We can see what our data might look like if we take a sample from
#the population variable pop.
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
sam = sample(x=pop, size=200, replace=TRUE)
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
#Exercise 2-1: Calcualte the mean and SD of the sample sam.
#How do these results differ from the means
# and SDs in the pop variable?
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
mean(sam); sd(sam); hist(sam)
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
#Exercise 2-2:
# A) Create a random sample called sam20 from pop containing 20 subjects.
# B) Create a random sample called sam750 from pop containing 750 subjects.
# C) Compute the Means and SDs for sam20 and sam750. Create Histograms for both.
# D) How do the means from each sample compare to the true population mean of 15?
# E) Is there an association of the number of people in the sample with the magnitude of
# the difference from the populaton mean?
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
#A)
sam20=sample(pop, 20, replace=TRUE)
#B)
sam750=sample(pop, 750, replace=TRUE)
#C)
mean(sam20); mean(sam750)
sd(sam20); sd(sam750)
plot(density(sam20)); lines(density(sam750), col=”red”)
#D)
abs(15-mean(sam20)); abs(15-mean(sam750))
#E)
#As the number of people in the sample goes up,
#the mean becomes closer to the true population mean
#—————————————————————
#3. Programming in R: Using Loops
#—————————————————————
#In practice, as researchers we almost always have samples
#and NEVER really know the true population
#Simulating a population and taking samples from it can tell us something
#about how well a given estimator (mean, trimmed mean, median etc.)
#represents a distributions (eg. normal vs skewed)
#To begin to understand how taking samples can give us information about an estimator,
#we need to take MANY samples from our simulated population.
#Let’s say we wanted to have 100 different samples of pop with 200 subjects in each sample.
#We could do this two ways:
#1) Write the function sample() many times
sam1=sample(pop, 200, replace=TRUE)
sam2=sample(pop, 200, replace=TRUE)
# …
sam100=sample(pop, 200, replace=TRUE)
#2) Or use a loop
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^#
#Loop over ii: for (ii in X:Y) { # ii is the counter
#COMMANDS WITH ii # X is the first value the counter
#} # Y is the last value of the counter
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^#
# x=23+24;x
#print(x)
#Here is a simple loop going from a value of 1 to 10
for (jj in 1:10) {
print(jj)
} #NOTE: When executing the loop,
#you MUST highlight and run the entire loop from { to } including the brackets.
# ii will take the values specified with the “in 1:10” argument
#At first ii = 1, but then will increase by 1 (1,2,3,4,5…) until it reaches 10
#Also we can change ii to be whatever we want.
#Below I’ve used my name to demonstrate this
for (kk in 1:15) {
print(kk)
}
#Back to our goal:
#We want to be able to take 100 different samples of pop with 200 people in each sample
#A good way to do this is to first create an EMPTY matrix to put our data
#into using the matrix() command
mysams = matrix(, ncol=100, nrow=200) #? Why ncol=100 and nrow=200?
#Then we can use a loop to place each of our samples into a column of this empty matrix called “mysams”
for (ii in 1:100) {
mysams[ ,ii] = sample(pop, size=200, replace=TRUE)
}
#Look to your right and double click over ‘mysams’
#—————————————————————
#4. Programming in R: The apply function
#—————————————————————
# We just learned how to use a loop to take MANY random samples
# from a population variable. While the purpose of this
# may not be clear just yet, it will be later on in the semester.
#Once we have our dataset containing 100 samples of 200 people, I’d like to find out the mean of each sample
#I could do this two ways:
#1) By manually doing it
mean(mysams[,1])
mean(mysams[,2])
#…
mean(mysams[,100])
#2) By using a loop
for (jj in 1:100) {
print( mean(mysams[,jj]) ) #I have to use print() here because things in loops don’t get output to the screen without it
}
#However, I don’t just want to KNOW the means of each sample, instead I’d like to have a variable
#where each observation is the mean of a given sample so that I can analyse the means of the samples
#We can do this by first creating an empty Vector of length 100
sam100means = numeric(100)
#And then using a loop to populate the vector
for (jj in 1:100) {
sam100means[jj] = mean(mysams[,jj])
}
#With this, I can examine the average (mean) of the means for each sample
mean(sam100means)
#And their distribution
hist(sam100means)
#There is an easier way to get this sam100means variable,
#We can use the apply() function!
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^#
#Collapse Data (apply): apply(X, MARGIN, FUN)
# X=dataset; MARGIN: 1=Rows, 2=Columns; FUN=Function
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^#
#I will create a variable called sam100means2 containing the column means of mysams
sam100means2 = apply(X=mysams, MARGIN=2, FUN=mean)
#In the above: MARGIN=2 tells R to do the operation on the columns
#FUN=mean tells R to take the mean
#A Density plot of this:
plot(density(sam100means2))
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
#Exercise 4:
# A) Create a variable called sam100sd that contains the standard deviations of each
# sample from mysams. Use whatever method you prefer to do this.
# B) Show a density plot of the SDs from mysams
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
#A)
sam100sd = apply(X=mysams, MARGIN=2, FUN=sd)
#Alternately
sam100sd1=numeric(100)
for (i in 1:100) {
sam100sd1[i]=sd(mysams[, i])
}
#B)
plot(density(sam100sd))
#—————————————————————
#5. Sampling Distribution of the Uniform Distribution
#—————————————————————
#While we’ve seen that the distribution of means
#from random samples taken from a NORMAL population
#are normally distributed, what if our population is not normally distributed?
#Let’s see for example the uniform distribution
# which can be created using the runif() function
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^#
#Random Uniform variable: rnunif(n)
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^#
#I’ll create a uniform population distribution of 10,000 subjects
popunif = runif(10000) #Unifor distribution
#This distribution looks like:
hist(popunif)
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
#Exercise 5:
# A) Create a matrix called unifsams that contains 150 random samples of 250 subjects from popunif
# B) Create a variable called unif150means that contains the means for each of the 150 samples
# C) Create a density plot of means in unif150means. What does this plot look like?
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
#A)
unifsams = matrix(, ncol=150, nrow=250)
for (ii in 1:150) {
unifsams[,ii] = sample(popunif, 250, replace=TRUE)
}
#B)
unif150means = apply(unifsams, 2, mean)
#C)
plot(density(unif150means ))
#Normally Distributed remember the Central Limit Theorem.
# Read from the book section 5.3.2 to get a theoretical explanation about this last excercise
#Section:5.3.2 Approximating the Sampling Distribution of the Sample Mean: The General Case