我有一个文本文件,其中有一列“ descn”,其中包含一些文本,但它们均为html格式。所以我想使用pyspark将html文本转换为纯文本。请帮我做到这一点。
文件名:
mdcl_insigt.txt
输入:
PROTEUSÂ <div><br></div><div>We are struggling with pathology. We don't control specimens of prostatectomy. The hospital pathology is not cooperating. I am reaching out to another hospital. You have pretty intense manual guidelines on pathology in the [PROTEUS] protocol for managing of RP [specimens]. Please e-mail me with work around options.</div>
它应该这样转换,输出:
PROTEUS We are struggling with pathology. We don't control specimens of prostatectomy. The hospital pathology is not cooperating. I am reaching out to another hospital. You have pretty intense manual guidelines on pathology in the [PROTEUS] protocol for managing of RP [specimens]. Please e-mail me with work around options.