出于与基于调查的比例置信区间(CI)进行比较的目的,我在这里发布了代码,以使用中心极限定理(CLT)来计算不相同分布的变量(如原始要求海报(OP)在他的评论中):
1)辅助函数定义
#' Compute survey-based Confidence Intervals
#'
#' @param df data frame with at least one column: "Category".
#' @param design survey design,normally created with \code{svydesign()} of the survey package.
#' @details The confidence interval for the proportion of each "Category" present in \code{df}
#' is computed using the \code{svymean()} function of the survey package on the given \code{design}.
#' @return data frame containing estimated proportions for each "Category" and respective
#' 95% confidence intervals.
#'
#' ASSUMPTION: the data do not have any missing values
ci_survey <- function(df,design) {
design.mean = svymean(~Category,design,deff="replace")
CI = confint( design.mean )
# Add the estimated proportions and Delta = Semi-size CI for comparison with the CLT-based results below
CI = cbind( p=design.mean,Delta=( CI[,"97.5 %"] - CI[,"2.5 %"] ) / 2,CI )
return(as.data.frame(CI))
}
#' Compute CLT-based Confidence Intervals
#'
#' @param df data frame with at least two columns: "Category","Weight" (sampling weights)
#' @details The confidence interval for the proportion of each "Category" present in \code{df}
#' is computed using Lyapunov's or Lindenberg's version of the Central Limit Theorem (CLT) which
#' applies to independent but NOT identically distributed random variables
#' Ref: https://en.wikipedia.org/wiki/Central_limit_theorem#Lack_of_identical_distribution
#' @return data frame containing estimated proportions for each "Category" and respective
#' \code{ci_level} confidence intervals.
#'
#' ASSUMPTION: the data do not have any missing values
ci_clt <- function(df,ci_level) {
### The following calculations are based on the following setup for each category given in 'df':
### 1) Let X(i) be the measurement of variable X for sampled case i,i = 1 ... n (n=500 in this case)
### where X is a 0/1 variable indicating absence or presence of a selected category.
### From the X(i) samples we would like to estimate the
### true proportion p of the presence of the category in the population.
### Therefore X(i) are iid random variables with Binomial(1,p) distribution
###
### 2) Let Y(i) = w(i)*X(i)
### where w(i) is the sampling weight applied to variable X(i).
###
### We apply the CLT to the sum of the Y(i)'s,using:
### - E(Y(i)) = mu(i) = w(i) * E(X(i)) = w(i) * p (since w(i) is a constant and the X(i) are identically distributed)
### - Var(Y(i)) = sigma2(i) = w(i)^2 * Var(X(i)) = w(i)^2 * p*(1-p) (since the X(i) iid)
###
### Hence,by CLT:
### Sum{Y(i) - mu(i)} / sigma -> N(0,1)
### where:
### sigma = sqrt( Sum{ sigma2(i) } ) = sqrt( Sum{ w(i)^2 } ) * sqrt( p*(1-p) )
### and note that:
### Sum{ mu(i) } = Sum{ w(i) } * p = n*p
### since the sampling weights are assumed to sum up to the sample size.
###
### Note: all the Sums are from i = 1,...,n
###
### 3) Compute the approximate confidence interval for p based on the N(0,1) distribution
### in the usual way,by first estimating sigma replacing p for the estimated p.
###
alpha = 1 - ci_level # area outside the confidence band
z = qnorm(1 - alpha/2) # critical z-quantile from Normal(0,1)
n = nrow(df) # Sample size (assuming no missing values)
ws = df$Weight / sum(df$Weight) * n # Weights scaled to sum the sample size (assumed for sampling weights)
S = aggregate(ws ~ Category,sum,data=df) # Weighted-base estimate of the total by category (Sum{ Y(i) })
sigma2 = sum( ws^2 ) # Sum of squared weights (note that we must NOT sum by category)
S[,"p"] = S[,"ws"] / n # Estimated proportion by category
S[,"Delta"] = z * sqrt( sigma2 ) *
sqrt( S$p * (1 - S$p) ) / n # Semi-size of the CI by category
LB_name = paste(formatC(alpha/2*100,format="g"),"%") # Name for the CI's Lower Bound column
UB_name = paste(formatC((1 - alpha/2)*100,"%") # Name for the CI's Upper Bound column
S[,LB_name] = S[,"p"] - S[,"Delta"] # CI's Lower Bound
S[,UB_name] = S[,"p"] + S[,"Delta"] # CI's Upper Bound
return(S)
}
#' Show the CI with the specified number of significant digits
show_values <- function(values,digits=3) {
op = options(digits=digits)
print(values)
options(op)
}
2)模拟数据
### Simulated data
set.seed(100)
df <- data.frame(
Category = rep(c("A","B","C","D"),times = seq(50,200,length.out = 4)),Weight = sample(c(1,1/2,1/3,1/4,1/5),500,prob = c(0.1,0.2,0.4,0.1),replace = TRUE)
)
3)基于调查的CI(仅供参考)
# Computation of CI using survey sampling theory (implemented in the survey package)
library(survey)
design <- svydesign(ids = ~1,weights = ~Weight,data = df)
CI_survey = ci_survey(df,design)
show_values(CI_survey)
给出:
p Delta 2.5 % 97.5 %
CategoryA 0.0896 0.0268 0.0628 0.116
CategoryB 0.2119 0.0417 0.1702 0.254
CategoryC 0.2824 0.0445 0.2379 0.327
CategoryD 0.4161 0.0497 0.3664 0.466
4)基于CLT的CI
上面定义的ci_clt()
函数顶部的注释中包含了所用方法的描述。
# Computation of CI using the Central Limit Theorem for non-identically distributed variables
CI_clt = ci_clt(df,ci_level=0.95)
show_values(CI_clt)
给出:
Category ws p Delta 2.5 % 97.5 %
1 A 44.8 0.0896 0.0286 0.061 0.118
2 B 106.0 0.2119 0.0410 0.171 0.253
3 C 141.2 0.2824 0.0451 0.237 0.328
4 D 208.0 0.4161 0.0494 0.367 0.465
5)CI大小的比较
这里我们计算基于CLT的CI和基于调查的CI之间的比率。
# Comparison of CI size
show_values(
data.frame(Category=CI_clt[,"Category"],ratio_DeltaCI_clt2survey=CI_clt[,"Delta"] / CI_survey[,"Delta"])
)
给出:
Category ratio_DeltaCI_clt2survey
1 A 1.067
2 B 0.982
3 C 1.013
4 D 0.994
由于所有比率均接近1,因此我们得出结论,两种方法的CI大小非常相似!
6)检查CI的基于CLT的实现似乎正确
对基于CLT的配置项计算执行的一项方便检查是在简单随机样本(SRS)情况下运行该计算,并验证结果与svymean()
计算所给出的结果相符。 SRS设计。
# CLT-based calculation
df_noweights = df
df_noweights$Weight = 1 # SRS: weights equal to 1
show_values( ci_clt(df_noweights,ci_level=0.95) )
给出:
Category p Delta 2.5 % 97.5 %
1 A 0.1 0.0263 0.0737 0.126
2 B 0.2 0.0351 0.1649 0.235
3 C 0.3 0.0402 0.2598 0.340
4 D 0.4 0.0429 0.3571 0.443
我们将其与基于调查的计算进行比较:
# Survey-based calculation
design <- svydesign(ids=~1,probs=NULL,data=df)
show_values( ci_survey(df,design) )
给出:
p Delta 2.5 % 97.5 %
CategoryA 0.1 0.0263 0.0737 0.126
CategoryB 0.2 0.0351 0.1649 0.235
CategoryC 0.3 0.0402 0.2598 0.340
CategoryD 0.4 0.0430 0.3570 0.443
我们看到结果是一致的,这表明在不等权重的一般情况下,基于CLT的CI计算的实现是正确的。
,
我终于找到了解决问题的简单方法。包裹“调查”正是我想要的:
set.seed(100)
library(survey)
df <- data.frame(
Category = rep(c("A",replace = TRUE)
)
d <- svydesign(id = ~1,data = df)
svymean(~Category,d)
该代码将得出加权比例(与“ ws”中的结果相同)和相应的标准误差,从而使计算置信区间变得容易。