如何使用pyspark将html文本转换为纯文本？替换字符串中的html标签

2024-05-01 • 问答

我有一个文本文件，其中有一列“ descn”，其中包含一些文本，但它们均为html格式。所以我想使用pyspark将html文本转换为纯文本。请帮我做到这一点。

文件名：

mdcl_insigt.txt

输入：

PROTEUSÂ <div><br></div><div>We are struggling with pathology. We don&#39;t control specimens of prostatectomy. The hospital pathology is not cooperating. I am reaching out to another hospital. You have pretty intense manual guidelines on pathology in the [PROTEUS] protocol for managing of RP [specimens]. Please e-mail me with work around options.</div>

它应该这样转换，输出：

PROTEUS We are struggling with pathology. We don't control specimens of prostatectomy. The hospital pathology is not cooperating. I am reaching out to another hospital. You have pretty intense manual guidelines on pathology in the [PROTEUS] protocol for managing of RP [specimens]. Please e-mail me with work around options.

biefanwoxingme 回答：如何使用pyspark将html文本转换为纯文本？替换字符串中的html标签

您可以尝试进行regexp_replace()：

from pyspark.sql.functions import regexp_replace

df = df.withColumn("parsed_descn",regexp_replace("descn","<[^>]+>",""))

正则表达式并不完美，可能会失败。请做一些进一步的研究以使其更好。

当我在regexr上尝试使用它时，它可以处理您的示例字符串

这是屏幕截图：

Pyspark输出：

df.withColumn("parsed",F.regexp_replace("descn","")).select("parsed").collect()

[Row(parsed='PROTEUSÂ We are struggling with pathology. We don&#39;t control specimens of prostatectomy. The hospital pathology is not cooperating. I am reaching out to another hospital. You have pretty intense manual guidelines on pathology in the [PROTEUS] protocol for managing of RP [specimens]. Please e-mail me with work around options.')]

pyspark pyspark-sql

本文链接：https://www.f2er.com/3125950.html

如何使用pyspark将html文本转换为纯文本？替换字符串中的html标签

biefanwoxingme 回答：如何使用pyspark将html文本转换为纯文本？替换字符串中的html标签

大家都在问