如何使用Apache Spark和lxml解析,过滤和聚合数据?

我已经使用etree.fromstring(x)lxml创建了一个通用XMLparser。现在,我必须解析如下的XML:

 <row acceptedAnswerId="88156" AnswerCount="6" Body="&lt;p&gt;I\'ve just played a game with my kids that basically boils down to: whoever rolls every number at least once on a 6-sided dice wins.&lt;/p&gt;&#10;&#10;&lt;p&gt;I won,eventually,and the others finished 1-2 turns later. Now I\'m wondering: what is the expectation of the length of the game?&lt;/p&gt;&#10;&#10;&lt;p&gt;I know that the expectation of the number of rolls till you hit a specific number is &#10;$\\sum_{n=1}^\\infty n\\frac{1}{6}(\\frac{5}{6})^{n-1}=6$.&lt;/p&gt;&#10;&#10;&lt;p&gt;However,I have two questions:&lt;/p&gt;&#10;&#10;&lt;ol&gt;&#10;&lt;li&gt;How many times to you have to roll a six-sided dice until you get every number at least once? &lt;/li&gt;&#10;&lt;li&gt;Among four independent trials (i.e. with four players),what is the expectation of the &lt;em&gt;maximum&lt;/em&gt; number of rolls needed? [note: it\'s maximum,not minimum,because at their age,it\'s more about finishing than about getting there first for my kids]&lt;/li&gt;&#10;&lt;/ol&gt;&#10;&#10;&lt;p&gt;I can simulate the result,but I wonder how I would go about calculating it analytically.&lt;/p&gt;&#10;&#10;&lt;hr&gt;&#10;&#10;&lt;p&gt;Here\'s a Monte Carlo simulation in Matlab&lt;/p&gt;&#10;&#10;&lt;pre&gt;&lt;code&gt;mx=zeros(1000000,1);&#10;for i=1:1000000,&#10;   %# assume it\'s never going to take us &amp;gt;100 rolls&#10;   r=randi(6,100,1);&#10;   %# since R2013a,unique returns the first occurrence&#10;   %# for earlier versions,take the minimum of x&#10;   %# and subtract it from the total array length&#10;   [~,x]=unique(r); &#10;   mx(i,1)=max(x);&#10;end&#10;&#10;%# make sure we haven\'t violated an assumption&#10;assert(~any(mx==100))&#10;&#10;%# find the expected value for the coupon collector problem&#10;expectationForOneRun = mean(mx)&#10;&#10;%# find the expected number of rolls as a maximum of four independent players&#10;maxExpectationForFourRuns = mean( max( reshape( mx,4,[]),[],1) )&#10;&#10;expectationForOneRun =&#10;   14.7014 (SEM 0.006)&#10;&#10;maxExpectationForFourRuns =&#10;   21.4815 (SEM 0.01)&#10;&lt;/code&gt;&lt;/pre&gt;&#10;" CommentCount="5" CreationDate="2013-01-24T02:04:12.570" FavoriteCount="9" Id="48396" LastactivityDate="2014-02-27T16:38:07.013" LastEditDate="2013-01-26T13:53:53.183" LastEditorUserId="198" OwnerUserId="198" PostTypeId="1" Score="23" Tags="&lt;probability&gt;&lt;dice&gt;" Title="How often do you have to roll a 6-sided dice to obtain every number at least once?" ViewCount="5585" />','  <row AnswerCount="1" Body="&lt;p&gt;Suppose there are $6$ people in a population. During $2$ weeks $3$ people get the flu. Cases of the flu last $2$ days. Also people will get the flu only once during this period. What is the incidence density of the flu?&lt;/p&gt;&#10;&#10;&lt;p&gt;Would it be $\\frac{3}{84 \\text{person days}}$ since each person is observed for $14$ days?&lt;/p&gt;&#10;" CommentCount="4" CreationDate="2013-01-24T02:23:13.497" Id="48397" LastactivityDate="2013-04-24T16:58:18.773" OwnerUserId="20010" PostTypeId="1" Score="1" Tags="&lt;epidemiology&gt;" Title="Incidence density" ViewCount="288" />',

我们假设我的目标是提取CommentCount值并进行汇总。当我通过PySpark执行此操作时,当然,这只是数据的一小部分。

我尝试将解析器与.filter()reduceByKey结合使用,但是并没有取得太大的成功。大概我上面提到的lxml解析器返回了一个字典,尽管我无法确认是这种情况。

有人能解释在上述XML中聚合CommentCount值的最佳方法吗?

注意:无法在我的系统上安装Databricks,任何解决方案都不需要这样做。

bingning128 回答:如何使用Apache Spark和lxml解析,过滤和聚合数据?

您可以尝试的一种方法是与Spark SQL xpath 相关的内置函数,但前提是这些xml都是有效的XML(或可以轻松转换为有效的XML)并且在其上自己的路线。

# read file in line mode,we get one column with column_name = 'value'
df = spark.read.text('....')

例如,使用当前的示例XML,我们可以修剪开头和结尾的逗号,单引号和空格,并使用XPATH row//@CommentCount,它是row标签下CommentCount属性的值,将得到匹配属性值的数组列:

df.selectExpr('''xpath(trim(both ",' " from value),"row//@CommentCount") as CommentCount''').show()   
+------------+
|CommentCount|
+------------+
|         [5]|
|         [4]|
+------------+

然后可以对每个数组的第一个元素求和:

df.selectExpr('''
    sum(xpath(trim(both ","row//@CommentCount")[0]) as sum_CommentCount
''').show()
+----------------+
|sum_CommentCount|
+----------------+
|             9.0|
+----------------+

此方法的问题是它非常脆弱,任何无效的XML都会使整个过程失败,并且到目前为止,我还没有找到针对此问题的任何解决方法。

另一种方法是使用API​​函数:regex_extract,因为您要检索的文本很简单(即没有嵌入的标签或引号等),所以这是实用的。

from pyspark.sql.functions import regexp_extract

df.select(regexp_extract('value',r'\bCommentCount="(\d+)"',1).astype('int').alias('CommentCount')).show()                                                                                                   
+------------+
|CommentCount|
+------------+
|           5|
|           4|
+------------+

然后您可以在该整数列上求和。就是我的2美分。

本文链接:https://www.f2er.com/3162134.html

大家都在问