如何使用R

任务:计算围绕这些加权股份的分类变量和查找置信区间的加权加权份额

library(dplyr)
set.seed(100)

用分类变量和权重变量组成数据集:

df <- data.frame(
  Category = rep(c("A","B","C","D"),times = seq(50,200,length.out = 4)),Weight = sample(c(1,1/2,1/3,1/4,1/5),500,prob = c(0.1,0.2,0.4,0.1),replace = TRUE)
)

快速浏览数据

head(df,10)
tail(df,10)

现在,我可以不考虑权重而完成任务了:

Write函数返回分类变量的一种表示形式的未加权份额及其95%置信区间 (有关确定总体比例的置信区间的一般信息,请参见: https://www.dummies.com/education/math/statistics/how-to-determine-the-confidence-interval-for-a-population-proportion/

ci.share <- function(category,manifestation){
  n = length(category)
  share = length(which(category == manifestation)) / n
  se = sqrt (share*(1-share) / n) 

  if(n*share*(1-share) >= 9){
    U <- share - 1.96*se
    O <- share + 1.96*se
    KI <- c(U,share,O)
    names(KI) <- c("lower boundary","share","upper boundary")
    KI = KI * 100
    return(KI)
  } else {return("Error,because n*p*(1-p) < 9")}
}

使用所有清单的功能并将结果存储在列表中

cis <- list()
for(i in c("A","D")){
  cis[[i]] <- ci.share(category = df$Category,manifestation = i)
}

使结果易于阅读:

cis <- t(sapply(cis,"["))
cis <- round(cis,digits = 2)
cis  #TASK DONE

问题是:如何考虑重量的“顺式”等价物

我能做的是找到加权份额:

ws <- summarise_at(group_by(df,Category),vars(Weight),funs(sum))
ws[,2] <- (ws[,2]/sum(df$Weight)) * 100
names(ws) <- c("Category","Weighted_Share")
sum(ws$Weighted_Share) # Should be 100,is 100
ws

但是现在如何获取置信区间?

将非常感谢您提供解决方案。提前致谢! 安迪

qingabc 回答:如何使用R

出于与基于调查的比例置信区间(CI)进行比较的目的,我在这里发布了代码,以使用中心极限定理(CLT)来计算不相同分布的变量(如原始要求海报(OP)在他的评论中):

1)辅助函数定义

#' Compute survey-based Confidence Intervals
#'
#' @param df data frame with at least one column: "Category".
#' @param design survey design,normally created with \code{svydesign()} of the survey package.
#' @details The confidence interval for the proportion of each "Category" present in \code{df}
#' is computed using the \code{svymean()} function of the survey package on the given \code{design}.
#' @return data frame containing estimated proportions for each "Category" and respective
#' 95% confidence intervals.
#'
#' ASSUMPTION: the data do not have any missing values
ci_survey <- function(df,design) {
  design.mean = svymean(~Category,design,deff="replace")
  CI = confint( design.mean )
  # Add the estimated proportions and Delta = Semi-size CI for comparison with the CLT-based results below
  CI = cbind( p=design.mean,Delta=( CI[,"97.5 %"] - CI[,"2.5 %"] ) / 2,CI )

  return(as.data.frame(CI))
}

#' Compute CLT-based Confidence Intervals
#'
#' @param df data frame with at least two columns: "Category","Weight" (sampling weights)
#' @details The confidence interval for the proportion of each "Category" present in \code{df}
#' is computed using Lyapunov's or Lindenberg's version of the Central Limit Theorem (CLT) which
#' applies to independent but NOT identically distributed random variables
#' Ref: https://en.wikipedia.org/wiki/Central_limit_theorem#Lack_of_identical_distribution
#' @return data frame containing estimated proportions for each "Category" and respective
#' \code{ci_level} confidence intervals.
#'
#' ASSUMPTION: the data do not have any missing values
ci_clt <- function(df,ci_level) {
  ### The following calculations are based on the following setup for each category given in 'df':
  ### 1) Let X(i) be the measurement of variable X for sampled case i,i = 1 ... n (n=500 in this case)
  ### where X is a 0/1 variable indicating absence or presence of a selected category.
  ### From the X(i) samples we would like to estimate the
  ### true proportion p of the presence of the category in the population.
  ### Therefore X(i) are iid random variables with Binomial(1,p) distribution
  ###
  ### 2) Let Y(i) = w(i)*X(i)
  ### where w(i) is the sampling weight applied to variable X(i).
  ###
  ### We apply the CLT to the sum of the Y(i)'s,using:
  ### - E(Y(i)) = mu(i) = w(i) * E(X(i)) = w(i) * p (since w(i) is a constant and the X(i) are identically distributed)
  ### - Var(Y(i)) = sigma2(i) = w(i)^2 * Var(X(i)) = w(i)^2 * p*(1-p) (since the X(i) iid)
  ###
  ### Hence,by CLT:
  ###   Sum{Y(i) - mu(i)} / sigma -> N(0,1)
  ### where:
  ###   sigma = sqrt( Sum{ sigma2(i) } ) = sqrt( Sum{ w(i)^2 } ) * sqrt( p*(1-p) )
  ### and note that:
  ###   Sum{ mu(i) } = Sum{ w(i) } * p = n*p
  ### since the sampling weights are assumed to sum up to the sample size.
  ###
  ### Note: all the Sums are from i = 1,...,n
  ###
  ### 3) Compute the approximate confidence interval for p based on the N(0,1) distribution
  ### in the usual way,by first estimating sigma replacing p for the estimated p.
  ###

  alpha = 1 - ci_level                                         # area outside the confidence band
  z = qnorm(1 - alpha/2)                                       # critical z-quantile from Normal(0,1)

  n = nrow(df)                                                 # Sample size (assuming no missing values)
  ws = df$Weight / sum(df$Weight) * n                          # Weights scaled to sum the sample size (assumed for sampling weights)
  S = aggregate(ws ~ Category,sum,data=df)                   # Weighted-base estimate of the total by category (Sum{ Y(i) })
  sigma2 = sum( ws^2 )                                         # Sum of squared weights (note that we must NOT sum by category)
  S[,"p"] = S[,"ws"] / n                                       # Estimated proportion by category
  S[,"Delta"] =  z * sqrt( sigma2 ) *
                     sqrt( S$p * (1 - S$p) ) / n               # Semi-size of the CI by category
  LB_name = paste(formatC(alpha/2*100,format="g"),"%")       # Name for the CI's Lower Bound column
  UB_name = paste(formatC((1 - alpha/2)*100,"%") # Name for the CI's Upper Bound column
  S[,LB_name] = S[,"p"] - S[,"Delta"]                          # CI's Lower Bound
  S[,UB_name] = S[,"p"] + S[,"Delta"]                          # CI's Upper Bound

  return(S)
}

#' Show the CI with the specified number of significant digits
show_values <- function(values,digits=3) {
  op = options(digits=digits)
  print(values)
  options(op)
}

2)模拟数据

### Simulated data
set.seed(100)
df <- data.frame(
  Category = rep(c("A","B","C","D"),times = seq(50,200,length.out = 4)),Weight = sample(c(1,1/2,1/3,1/4,1/5),500,prob = c(0.1,0.2,0.4,0.1),replace = TRUE)
)

3)基于调查的CI(仅供参考)

# Computation of CI using survey sampling theory (implemented in the survey package)
library(survey)
design <- svydesign(ids = ~1,weights = ~Weight,data = df)
CI_survey = ci_survey(df,design)
show_values(CI_survey)

给出:

               p  Delta  2.5 % 97.5 %
CategoryA 0.0896 0.0268 0.0628  0.116
CategoryB 0.2119 0.0417 0.1702  0.254
CategoryC 0.2824 0.0445 0.2379  0.327
CategoryD 0.4161 0.0497 0.3664  0.466

4)基于CLT的CI

上面定义的ci_clt()函数顶部的注释中包含了所用方法的描述。

# Computation of CI using the Central Limit Theorem for non-identically distributed variables
CI_clt = ci_clt(df,ci_level=0.95)
show_values(CI_clt)

给出:

  Category    ws      p  Delta 2.5 % 97.5 %
1        A  44.8 0.0896 0.0286 0.061  0.118
2        B 106.0 0.2119 0.0410 0.171  0.253
3        C 141.2 0.2824 0.0451 0.237  0.328
4        D 208.0 0.4161 0.0494 0.367  0.465

5)CI大小的比较

这里我们计算基于CLT的CI和基于调查的CI之间的比率。

# Comparison of CI size
show_values(
  data.frame(Category=CI_clt[,"Category"],ratio_DeltaCI_clt2survey=CI_clt[,"Delta"] / CI_survey[,"Delta"])
)

给出:

  Category ratio_DeltaCI_clt2survey
1        A                    1.067
2        B                    0.982
3        C                    1.013
4        D                    0.994

由于所有比率均接近1,因此我们得出结论,两种方法的CI大小非常相似!

6)检查CI的基于CLT的实现似乎正确

对基于CLT的配置项计算执行的一项方便检查是在简单随机样本(SRS)情况下运行该计算,并验证结果与svymean()计算所给出的结果相符。 SRS设计。

# CLT-based calculation
df_noweights = df
df_noweights$Weight = 1                               # SRS: weights equal to 1
show_values( ci_clt(df_noweights,ci_level=0.95) )

给出:

  Category   p  Delta  2.5 % 97.5 %
1        A 0.1 0.0263 0.0737  0.126
2        B 0.2 0.0351 0.1649  0.235
3        C 0.3 0.0402 0.2598  0.340
4        D 0.4 0.0429 0.3571  0.443

我们将其与基于调查的计算进行比较:

# Survey-based calculation
design <- svydesign(ids=~1,probs=NULL,data=df)
show_values( ci_survey(df,design) )

给出:

            p  Delta  2.5 % 97.5 %
CategoryA 0.1 0.0263 0.0737  0.126
CategoryB 0.2 0.0351 0.1649  0.235
CategoryC 0.3 0.0402 0.2598  0.340
CategoryD 0.4 0.0430 0.3570  0.443

我们看到结果是一致的,这表明在不等权重的一般情况下,基于CLT的CI计算的实现是正确的。

,

我终于找到了解决问题的简单方法。包裹“调查”正是我想要的:

set.seed(100)
library(survey)

df <- data.frame(
  Category = rep(c("A",replace = TRUE)
)

d <- svydesign(id = ~1,data = df)
svymean(~Category,d)

该代码将得出加权比例(与“ ws”中的结果相同)和相应的标准误差,从而使计算置信区间变得容易。

本文链接:https://www.f2er.com/2664825.html

大家都在问