如何解析带有第4列的制表符分隔的文本文件作为json并删除某些键?

我有一个26 Gb的文本文件,行格式如下

/type/edition /books/OL10000135M 4 2010-04-24T17:54:01.503315 {"publishers": ["bernan Press"],"physical_format": "Hardcover","subtitle": "9th November - 3rd December,1992","key": "/books/OL10000135M","title": "Parliamentary Debates,House of Lords,Bound Volumes,1992-93","identifiers": {"goodreads": ["6850240"]},"isbn_13": ["9780107805401"],"languages": [{"key": "/languages/eng"}],"number_of_pages": 64,"isbn_10": ["0107805405"],"publish_date": "December 1993","last_modified": {"type": "/type/datetime","value": "2010-04-24T17:54:01.503315"},"authors": [{"key": "/authors/OL2645777A"}],"latest_revision": 4,"works": [{"key": "/works/OL7925046W"}],"type": {"key": "/type/edition"},"subjects": ["Government - Comparative","Politics / Current Events"],"revision": 4}

我试图仅获取最后一个列,即json,然后从该Json中保存“ title”,“ isbn 13”,“ isbn 10”

我只能用此代码保存最后一列

csv.field_size_limit(sys.maxsize)
# File names: to read in from and read out to
input_file = '../inputFile/ol_dump_editions_2019-10-31.txt'
output_file = '../outputFile/output.txt'

## ==================== ##
##  Using module 'csv'  ##
## ==================== ##
with open(input_file) as to_read:
    with open(output_file,"w") as tmp_file:
        reader = csv.reader(to_read,delimiter = "\t")
        writer = csv.writer(tmp_file)

        desired_column = [4]        # text column

        for row in reader:     # read one row at a time
            myColumn = list(row[i] for i in desired_column)   # build the output row (process)
            writer.writerow(myColumn) # write it

,但这不会返回正确的json对象,而是返回所有带有双引号的对象。另外我将如何从json中提取某些值作为新的json

编辑:

"{""publishers"": [""bernan Press""],""physical_format"": ""Hardcover"",""subtitle"": ""9th November - 3rd December,1992"",""key"": ""/books/OL10000135M"",""title"": ""Parliamentary Debates,1992-93"",""identifiers"": {""goodreads"": [""6850240""]},""isbn_13"": [""9780107805401""],""languages"": [{""key"": ""/languages/eng""}],""number_of_pages"": 64,""isbn_10"": [""0107805405""],""publish_date"": ""December 1993"",""last_modified"": {""type"": ""/type/datetime"",""value"": ""2010-04-24T17:54:01.503315""},""authors"": [{""key"": ""/authors/OL2645777A""}],""latest_revision"": 4,""works"": [{""key"": ""/works/OL7925046W""}],""type"": {""key"": ""/type/edition""},""subjects"": [""Government - Comparative"",""Politics / Current Events""],""revision"": 4}"

编辑2:

所以我正试图读取此文件,该文件是由制表符分隔的文件,其中包含以下各列:

type-记录类型(/ type / edition,/ type / work等) 键-记录的唯一键。 (/书/ OL1M等) 修订-记录的修订号 last_modified-上次修改的时间戳 JSON-JSON格式的完整记录

我正在尝试读取JSON文件,而从那个Json im中只是尝试获取json的“标题”,“ isbn 13”,“ isbn 10”并将其作为一行保存到文件中

所以每一行都应该看起来像原始行,但只包含那些键和值

cx3232 回答:如何解析带有第4列的制表符分隔的文本文件作为json并删除某些键?

因此,鉴于您当前的代码返回以下内容:

result = '{""publishers"": [""Bernan Press""],""physical_format"": ""Hardcover"",""subtitle"": ""9th November - 3rd December,1992"",""key"": ""/books/OL10000135M"",""title"": ""Parliamentary Debates,House of Lords,Bound Volumes,1992-93"",""identifiers"": {""goodreads"": [""6850240""]},""isbn_13"": [""9780107805401""],""languages"": [{""key"": ""/languages/eng""}],""number_of_pages"": 64,""isbn_10"": [""0107805405""],""publish_date"": ""December 1993"",""last_modified"": {""type"": ""/type/datetime"",""value"": ""2010-04-24T17:54:01.503315""},""authors"": [{""key"": ""/authors/OL2645777A""}],""latest_revision"": 4,""works"": [{""key"": ""/works/OL7925046W""}],""type"": {""key"": ""/type/edition""},""subjects"": [""Government - Comparative"",""Politics / Current Events""],""revision"": 4}'

看起来您需要做的是:首先-用常规双引号替换那些双-双引号,否则事情将变得不可行:

res = result.replace('""','"')

现在res可转换为JSON对象:

import json
my_json = json.loads(res)

my_json现在看起来像这样:

{'authors': [{'key': '/authors/OL2645777A'}],'identifiers': {'goodreads': ['6850240']},'isbn_10': ['0107805405'],'isbn_13': ['9780107805401'],'key': '/books/OL10000135M','languages': [{'key': '/languages/eng'}],'last_modified': {'type': '/type/datetime','value': '2010-04-24T17:54:01.503315'},'latest_revision': 4,'number_of_pages': 64,'physical_format': 'Hardcover','publish_date': 'December 1993','publishers': ['Bernan Press'],'revision': 4,'subjects': ['Government - Comparative','Politics / Current Events'],'subtitle': '9th November - 3rd December,1992','title': 'Parliamentary Debates,1992-93','type': {'key': '/type/edition'},'works': [{'key': '/works/OL7925046W'}]}

您可以方便地从此对象获取任何字段:

my_json['title']
# 'Parliamentary Debates,1992-93'
my_json['isbn_10'][0]
# '0107805405'
,

这是一种简单的方法。您需要重复此操作,并在读取文件的每一行中逐行提取所需的数据(Python处理文本文件的默认方式)。

import json

line = '/type/edition   /books/OL10000135M  4   2010-04-24T17:54:01.503315  {"publishers": ["Bernan Press"],"physical_format": "Hardcover","subtitle": "9th November - 3rd December,1992","key": "/books/OL10000135M","title": "Parliamentary Debates,1992-93","identifiers": {"goodreads": ["6850240"]},"isbn_13": ["9780107805401"],"languages": [{"key": "/languages/eng"}],"number_of_pages": 64,"isbn_10": ["0107805405"],"publish_date": "December 1993","last_modified": {"type": "/type/datetime","value": "2010-04-24T17:54:01.503315"},"authors": [{"key": "/authors/OL2645777A"}],"latest_revision": 4,"works": [{"key": "/works/OL7925046W"}],"type": {"key": "/type/edition"},"subjects": ["Government - Comparative","Politics / Current Events"],"revision": 4}'

csv_cols = line.split('\t')
json_data = json.loads(csv_cols[4])
#print(json.dumps(json_data,indent=4))

desired = {key: json_data[key] for key in ("title","isbn_13","isbn_10")}
result = json.dumps(desired,indent=4)
print(result)

样本行的输出:

{
    "title": "Parliamentary Debates,"isbn_13": [
        "9780107805401"
    ],"isbn_10": [
        "0107805405"
    ]
}
,

特别是因为您的示例如此之大,所以我建议您使用专门的库,例如pandas,它具有read_csv方法,甚至是dask,它支持存储操作。

这两个系统都会自动为您解析报价,dask会直接从磁盘中“分段”执行报价,因此您不必尝试将26GB加载到RAM。

然后在两个库中,您都可以像这样访问所需的列:

data = read_csv(PATH)
data["ColumnName"]

然后您可以使用json.loads()import json)解析这些行,也可以使用pandas / dask json实现。如果您可以提供期望的更多详细信息,我可以帮助您起草更具体的代码示例。

祝你好运!

,

我将您的数据保存到文件中以查看是否只能读取行,请告诉我这是否可行:

lines = zzread.split('\n')  
temp=[] 
for to_read in lines: 
    if len(to_read) == 0:  
        break  
    new_to_read = '{' + to_read.split('{',1)[1] 
    temp.append(json.loads(new_to_read)) 
for row in temp: 
      print(row['isbn_13'])

如果可行,应该为您创建一个json:

lines = zzread.split('\n')  
temp=[] 
for to_read in lines: 
    if len(to_read) == 0:  
        break  
    new_to_read = '{' + to_read.split('{',1)[1] 
    temp.append(json.loads(new_to_read)) 
new_json=[]
for row in temp: 
    new_json.append({'title': row['title'],'isbn_13': row['isbn_13'],'isbn_10': row['isbn_10']})

本文链接:https://www.f2er.com/3126933.html

大家都在问