statsmodels ols 从公式与 groupby 熊猫

我有一个类型的数据框:

       date         TICKER        x1       x2  ...       Z        Y  month    x3
0 1999-12-31    A UN Equity  52.1330  51.9645  ...  0.0052      NaN     12   NaN
1 1999-12-31   AA UN Equity  92.9415  92.8715  ...  0.0052      NaN     12   NaN
2 1999-12-31  ABC UN Equity   3.6843   3.6539  ...  0.0052      NaN     12   NaN
3 1999-12-31  ABF UN Equity  22.0625  21.9375  ...  0.0052      NaN     12   NaN
4 1999-12-31  ABM UN Equity  10.2188  10.1250  ...  0.0052      NaN     12   NaN

我想从 'Y ~ x1 + x2:x3'['TICKER','year','month'] 的公式 statsmodels.formula.api as smf 运行 OLS 回归(年份是此处未出现的列)来自 data.groupby(['TICKER','month']).apply(lambda x: smf.ols(formula='Y ~ x1 + x2:x3',data=x)) 。因此我使用:

IndexError: tuple index out of range

但是,我收到以下错误:

Traceback (most recent call last):
  File "<input>",line 1,in <module>
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\pandas\core\groupby\groupby.py",line 894,in apply
    result = self._python_apply_general(f,self._selected_obj)
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\pandas\core\groupby\groupby.py",line 928,in _python_apply_general
    keys,values,mutated = self.grouper.apply(f,data,self.axis)
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\pandas\core\groupby\ops.py",line 238,in apply
    res = f(group)
  File "<input>",in <lambda>
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\model.py",line 195,in from_formula
    mod = cls(endog,exog,*args,**kwargs)
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\regression\linear_model.py",line 872,in __init__
    super(OLS,self).__init__(endog,missing=missing,File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\regression\linear_model.py",line 703,in __init__
    super(WLS,line 190,in __init__
    super(Regressionmodel,**kwargs)
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\model.py",line 237,in __init__
    super(Likelihoodmodel,line 77,in __init__
    self.data = self._handle_data(endog,missing,hasconst,File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\model.py",line 101,in _handle_data
    data = handle_data(endog,**kwargs)
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\data.py",line 672,in handle_data
    return klass(endog,exog=exog,hasconst=hasconst,File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\data.py",line 71,in __init__
    arrays,nan_idx = self.handle_missing(endog,line 247,in handle_missing
    if combined_nans.shape[0] != nan_mask.shape[0]:
IndexError: tuple index out of range

知道为什么吗?

完整的 tracebakc 是

{{1}}
xinsuandeluoba 回答:statsmodels ols 从公式与 groupby 熊猫

我看到你的 HTTP Error Code: 500 Error Message: /Login.jsp(44,56) java.io.UTFDataFormatException: JSPG0287E: Invalid byte 1 of 1-byte UTF-8 sequence. Root Cause: com.ibm.ws.jsp.JspCoreException: /Login.jsp(44,56) java.io.UTFDataFormatException: JSPG0287E: Invalid byte 1 of 1-byte UTF-8 sequence. at com.ibm.ws.jsp.translator.document.Jsp2Dom.getJspDocumentAsJspPage(Jsp2Dom.java:529) at com.ibm.ws.jsp.translator.document.Jsp2Dom.getJspDocument(Jsp2Dom.java:196) at com.ibm.ws.jsp.translator.JspTranslator.<init>(JspTranslator.java:73) at com.ibm.ws.jsp.translator.JspTranslatorFactory.createTranslator(JspTranslatorFactory.java:116) at com.ibm.ws.jsp.translator.utils.JspTranslatorUtil.translateJsp(JspTranslatorUtil.java:262) at com.ibm.ws.jsp.translator.utils.JspTranslatorUtil.translateJspAndCompile(JspTranslatorUtil.java:123) at com.ibm.ws.jsp.webcontainerext.AbstractJSPExtensionServletWrapper.translateJsp(AbstractJSPExtensionServletWrapper.java:560) at com.ibm.ws.jsp.webcontainerext.AbstractJSPExtensionServletWrapper._checkForTranslation(AbstractJSPExtensionServletWrapper.java:477) at com.ibm.ws.jsp.webcontainerext.AbstractJSPExtensionServletWrapper.checkForTranslation(AbstractJSPExtensionServletWrapper.java:298) at com.ibm.ws.jsp.webcontainerext.AbstractJSPExtensionServletWrapper.handleRequest(AbstractJSPExtensionServletWrapper.java:167) at com.ibm.ws.webcontainer.filter.WebAppFilterChain.invokeTarget(WebAppFilterChain.java:136) at com.ibm.ws.webcontainer.filter.WebAppFilterChain.doFilter(WebAppFilterChain.java:97) 列有很多 NaN,所以你需要确保子组有足够的观察,这样回归才能工作。

因此,如果我使用示例数据:

Y

如果我在上面的数据框上运行你的代码,我会得到同样的错误。

因此,如果我们仅使用完整数据(与您的回归相关):

import statsmodels.formula.api as smf
np.random.seed(123)
data = pd.concat([
    pd.DataFrame({'TICKER':np.random.choice(['A','B','C'],30),'year':np.random.choice([2000,2001],'month':np.random.choice([1,2],30)}),pd.DataFrame(np.random.normal(0,1,(30,4)),columns=['Y','x1','x2','x3'])
],axis=1)

data.loc[:6,'Y'] = np.nan

它有效:

complete_ix = data[['Y','x3']].dropna().index
data.loc[complete_ix].groupby(['TICKER','year','month']).apply(lambda x: smf.ols(formula='Y ~ x1 + x2:x3',data=x))
本文链接:https://www.f2er.com/15698.html

大家都在问