如何解析带有第4列的制表符分隔的文本文件作为json并删除某些键？

2024-05-07 • 问答

我有一个26 Gb的文本文件，行格式如下

/type/edition /books/OL10000135M 4 2010-04-24T17:54:01.503315 {"publishers": ["bernan Press"],"physical_format": "Hardcover","subtitle": "9th November - 3rd December,1992","key": "/books/OL10000135M","title": "Parliamentary Debates,House of Lords,Bound Volumes,1992-93","identifiers": {"goodreads": ["6850240"]},"isbn_13": ["9780107805401"],"languages": [{"key": "/languages/eng"}],"number_of_pages": 64,"isbn_10": ["0107805405"],"publish_date": "December 1993","last_modified": {"type": "/type/datetime","value": "2010-04-24T17:54:01.503315"},"authors": [{"key": "/authors/OL2645777A"}],"latest_revision": 4,"works": [{"key": "/works/OL7925046W"}],"type": {"key": "/type/edition"},"subjects": ["Government - Comparative","Politics / Current Events"],"revision": 4}

我试图仅获取最后一个列，即json，然后从该Json中保存“ title”，“ isbn 13”，“ isbn 10”

我只能用此代码保存最后一列

csv.field_size_limit(sys.maxsize)
# File names: to read in from and read out to
input_file = '../inputFile/ol_dump_editions_2019-10-31.txt'
output_file = '../outputFile/output.txt'

## ==================== ##
##  Using module 'csv'  ##
## ==================== ##
with open(input_file) as to_read:
    with open(output_file,"w") as tmp_file:
        reader = csv.reader(to_read,delimiter = "\t")
        writer = csv.writer(tmp_file)

        desired_column = [4]        # text column

        for row in reader:     # read one row at a time
            myColumn = list(row[i] for i in desired_column)   # build the output row (process)
            writer.writerow(myColumn) # write it

，但这不会返回正确的json对象，而是返回所有带有双引号的对象。另外我将如何从json中提取某些值作为新的json

编辑：

"{""publishers"": [""bernan Press""],""physical_format"": ""Hardcover"",""subtitle"": ""9th November - 3rd December,1992"",""key"": ""/books/OL10000135M"",""title"": ""Parliamentary Debates,1992-93"",""identifiers"": {""goodreads"": [""6850240""]},""isbn_13"": [""9780107805401""],""languages"": [{""key"": ""/languages/eng""}],""number_of_pages"": 64,""isbn_10"": [""0107805405""],""publish_date"": ""December 1993"",""last_modified"": {""type"": ""/type/datetime"",""value"": ""2010-04-24T17:54:01.503315""},""authors"": [{""key"": ""/authors/OL2645777A""}],""latest_revision"": 4,""works"": [{""key"": ""/works/OL7925046W""}],""type"": {""key"": ""/type/edition""},""subjects"": [""Government - Comparative"",""Politics / Current Events""],""revision"": 4}"

编辑2：

所以我正试图读取此文件，该文件是由制表符分隔的文件，其中包含以下各列：

type-记录类型（/ type / edition，/ type / work等）键-记录的唯一键。（/书/ OL1M等）修订-记录的修订号 last_modified-上次修改的时间戳 JSON-JSON格式的完整记录

我正在尝试读取JSON文件，而从那个Json im中只是尝试获取json的“标题”，“ isbn 13”，“ isbn 10”并将其作为一行保存到文件中

所以每一行都应该看起来像原始行，但只包含那些键和值

cx3232 回答：如何解析带有第4列的制表符分隔的文本文件作为json并删除某些键？

因此，鉴于您当前的代码返回以下内容：

result = '{""publishers"": [""Bernan Press""],""physical_format"": ""Hardcover"",""subtitle"": ""9th November - 3rd December,1992"",""key"": ""/books/OL10000135M"",""title"": ""Parliamentary Debates,House of Lords,Bound Volumes,1992-93"",""identifiers"": {""goodreads"": [""6850240""]},""isbn_13"": [""9780107805401""],""languages"": [{""key"": ""/languages/eng""}],""number_of_pages"": 64,""isbn_10"": [""0107805405""],""publish_date"": ""December 1993"",""last_modified"": {""type"": ""/type/datetime"",""value"": ""2010-04-24T17:54:01.503315""},""authors"": [{""key"": ""/authors/OL2645777A""}],""latest_revision"": 4,""works"": [{""key"": ""/works/OL7925046W""}],""type"": {""key"": ""/type/edition""},""subjects"": [""Government - Comparative"",""Politics / Current Events""],""revision"": 4}'

看起来您需要做的是：首先-用常规双引号替换那些双-双引号，否则事情将变得不可行：

res = result.replace('""','"')

现在res可转换为JSON对象：

import json
my_json = json.loads(res)

my_json现在看起来像这样：

{'authors': [{'key': '/authors/OL2645777A'}],'identifiers': {'goodreads': ['6850240']},'isbn_10': ['0107805405'],'isbn_13': ['9780107805401'],'key': '/books/OL10000135M','languages': [{'key': '/languages/eng'}],'last_modified': {'type': '/type/datetime','value': '2010-04-24T17:54:01.503315'},'latest_revision': 4,'number_of_pages': 64,'physical_format': 'Hardcover','publish_date': 'December 1993','publishers': ['Bernan Press'],'revision': 4,'subjects': ['Government - Comparative','Politics / Current Events'],'subtitle': '9th November - 3rd December,1992','title': 'Parliamentary Debates,1992-93','type': {'key': '/type/edition'},'works': [{'key': '/works/OL7925046W'}]}

您可以方便地从此对象获取任何字段：

my_json['title']
# 'Parliamentary Debates,1992-93'
my_json['isbn_10'][0]
# '0107805405'

这是一种简单的方法。您需要重复此操作，并在读取文件的每一行中逐行提取所需的数据（Python处理文本文件的默认方式）。

import json

line = '/type/edition   /books/OL10000135M  4   2010-04-24T17:54:01.503315  {"publishers": ["Bernan Press"],"physical_format": "Hardcover","subtitle": "9th November - 3rd December,1992","key": "/books/OL10000135M","title": "Parliamentary Debates,1992-93","identifiers": {"goodreads": ["6850240"]},"isbn_13": ["9780107805401"],"languages": [{"key": "/languages/eng"}],"number_of_pages": 64,"isbn_10": ["0107805405"],"publish_date": "December 1993","last_modified": {"type": "/type/datetime","value": "2010-04-24T17:54:01.503315"},"authors": [{"key": "/authors/OL2645777A"}],"latest_revision": 4,"works": [{"key": "/works/OL7925046W"}],"type": {"key": "/type/edition"},"subjects": ["Government - Comparative","Politics / Current Events"],"revision": 4}'

csv_cols = line.split('\t')
json_data = json.loads(csv_cols[4])
#print(json.dumps(json_data,indent=4))

desired = {key: json_data[key] for key in ("title","isbn_13","isbn_10")}
result = json.dumps(desired,indent=4)
print(result)

样本行的输出：

{
    "title": "Parliamentary Debates,"isbn_13": [
        "9780107805401"
    ],"isbn_10": [
        "0107805405"
    ]
}

特别是因为您的示例如此之大，所以我建议您使用专门的库，例如pandas，它具有read_csv方法，甚至是dask，它支持存储操作。

这两个系统都会自动为您解析报价，dask会直接从磁盘中“分段”执行报价，因此您不必尝试将26GB加载到RAM。

然后在两个库中，您都可以像这样访问所需的列：

data = read_csv(PATH)
data["ColumnName"]

然后您可以使用json.loads()（import json）解析这些行，也可以使用pandas / dask json实现。如果您可以提供期望的更多详细信息，我可以帮助您起草更具体的代码示例。

祝你好运！

我将您的数据保存到文件中以查看是否只能读取行，请告诉我这是否可行：

lines = zzread.split('\n')  
temp=[] 
for to_read in lines: 
    if len(to_read) == 0:  
        break  
    new_to_read = '{' + to_read.split('{',1)[1] 
    temp.append(json.loads(new_to_read)) 
for row in temp: 
      print(row['isbn_13'])

如果可行，应该为您创建一个json：

lines = zzread.split('\n')  
temp=[] 
for to_read in lines: 
    if len(to_read) == 0:  
        break  
    new_to_read = '{' + to_read.split('{',1)[1] 
    temp.append(json.loads(new_to_read)) 
new_json=[]
for row in temp: 
    new_json.append({'title': row['title'],'isbn_13': row['isbn_13'],'isbn_10': row['isbn_10']})

csv json parsing text

本文链接：https://www.f2er.com/3126933.html

如何解析带有第4列的制表符分隔的文本文件作为json并删除某些键？

cx3232 回答：如何解析带有第4列的制表符分隔的文本文件作为json并删除某些键？

大家都在问