如何从检索到的JSON数据中删除双引号

2024-04-30 • 问答

我目前正在使用BeautifulSoup从工作网站上抓取列表中的内容，并通过网站的HTML代码将数据输出到JSON中。

我修复了正则表达式中的错误，但是我遇到了这个特殊问题。在抓取工作清单时，我选择从HTML源代码（< script type = "application/ld+json" >）中提取JSON数据，而不是从每个感兴趣的容器中提取信息。从那里，我将BeautifulSoup结果转换为字符串，清除HTML残留的内容，然后将字符串转换为JSON。但是，由于工作清单中的文字使用引号引起了我的困扰。由于实际数据很大，因此我将使用替代项。

example_string = '{"Category_A" : "Words typed describing stuff","Category_B" : "Other words speaking more irrelevant stuff","Category_X" : "Here is where the "PROBLEM" lies"}'

现在，上面的代码无法在Python中运行，但是我从工作清单的HTML中提取的字符串与上面的格式相当。传递给json.loads()时，它返回错误：json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 5035

我完全不确定如何解决此问题。

编辑 以下是导致错误的实际代码：

from bs4 import BeautifulSoup
from urllib.request import urlopen
import json,re

uClient = urlopen("http://www.ethiojobs.net/display-job/227974/Program-Manager---Mental-Health%2C-Child-Care-Gender-%26-Protection.html")
page_html = uClient.read()
uClient.close()

listing_soup = BeautifulSoup(page_html,"lxml")

json_script = listing_soup.find("script","type":"application/ld+json"}).strings

extracted_json_str = ''.join(json_script)

## Clean up the string with regex
extracted_json_str_CLEAN1 = re.sub(pattern = r"\r+|\n+|\t+|\\l+|  |&nbsp;|amp;|\u2013|</?.{,6}>",# last is to get rid of </p> and </strong>
                                repl='',string = extracted_json_str)
extracted_json_str_CLEAN2 = re.sub(pattern = r"\\u2019",repl = r"'",string = extracted_json_str_CLEAN1)
extracted_json_str_CLEAN3 = re.sub(pattern=r'\u25cf',repl=r" -",string = extracted_json_str_CLEAN2)
extracted_json_str_CLEAN4 = re.sub(pattern=r'\\',repl="",string = extracted_json_str_CLEAN3)

## Convert to JSON (HERE'S WHERE THE ERROR ARISES)
json_listing = json.loads(extracted_json_str_CLEAN4)

我确实知道导致错误的原因：在job description中目标4的最后一个要点中，作者在引用作业的必需任务（即“质量控制”）时使用引号。我一直在从这些工作清单中提取信息的方式是，有人使用引号的简单实例使我的整个方法崩溃。当然，必须有一种更好的方法来构建此脚本而又不存在这样的责任（以及必须使用正则表达式来修复每次出现的故障）。

谢谢！

# WHen you extracting this I think you shood make a chekc for this.
# example:
if "\"" in extraction:
    extraction = extraction.replace("\"","\'")
print(extraction)

在这种情况下，您将转换“从提取中”，我的意思是您需要转换某些内容，因为如果您要在字符串内使用“”，则python会给您提供一种同时使用两者的方式，您需要将辛博尔求逆： / p>

示例：

"this is a 'test'"
'this was a "test"'
"this is not a \"test\""

#in case the condition is meat
if "\"" in item:
    #use this
    item = item.replace("\"","\'")
    #or use this
    item = item.replace("\"","\\\"")

如果要在值中使用双引号（“），则需要应用转义序列（\）。因此，您对json.loads（）的String输入应如下所示。

example_string = '{"Category_A": "Words typed describing stuff","Category_B": "Other words speaking more irrelevant stuff","Category_X": "Here is where the \\"PROBLEM\\" lies"}'

json.loads可以对此进行解析。

如何从检索到的JSON数据中删除双引号

delong521 回答：如何从检索到的JSON数据中删除双引号

大家都在问