我有一个26 Gb的文本文件,行格式如下
/type/edition /books/OL10000135M 4 2010-04-24T17:54:01.503315 {"publishers": ["bernan Press"],"physical_format": "Hardcover","subtitle": "9th November - 3rd December,1992","key": "/books/OL10000135M","title": "Parliamentary Debates,House of Lords,Bound Volumes,1992-93","identifiers": {"goodreads": ["6850240"]},"isbn_13": ["9780107805401"],"languages": [{"key": "/languages/eng"}],"number_of_pages": 64,"isbn_10": ["0107805405"],"publish_date": "December 1993","last_modified": {"type": "/type/datetime","value": "2010-04-24T17:54:01.503315"},"authors": [{"key": "/authors/OL2645777A"}],"latest_revision": 4,"works": [{"key": "/works/OL7925046W"}],"type": {"key": "/type/edition"},"subjects": ["Government - Comparative","Politics / Current Events"],"revision": 4}
我试图仅获取最后一个列,即json,然后从该Json中保存“ title”,“ isbn 13”,“ isbn 10”
我只能用此代码保存最后一列
csv.field_size_limit(sys.maxsize)
# File names: to read in from and read out to
input_file = '../inputFile/ol_dump_editions_2019-10-31.txt'
output_file = '../outputFile/output.txt'
## ==================== ##
## Using module 'csv' ##
## ==================== ##
with open(input_file) as to_read:
with open(output_file,"w") as tmp_file:
reader = csv.reader(to_read,delimiter = "\t")
writer = csv.writer(tmp_file)
desired_column = [4] # text column
for row in reader: # read one row at a time
myColumn = list(row[i] for i in desired_column) # build the output row (process)
writer.writerow(myColumn) # write it
,但这不会返回正确的json对象,而是返回所有带有双引号的对象。另外我将如何从json中提取某些值作为新的json
编辑:
"{""publishers"": [""bernan Press""],""physical_format"": ""Hardcover"",""subtitle"": ""9th November - 3rd December,1992"",""key"": ""/books/OL10000135M"",""title"": ""Parliamentary Debates,1992-93"",""identifiers"": {""goodreads"": [""6850240""]},""isbn_13"": [""9780107805401""],""languages"": [{""key"": ""/languages/eng""}],""number_of_pages"": 64,""isbn_10"": [""0107805405""],""publish_date"": ""December 1993"",""last_modified"": {""type"": ""/type/datetime"",""value"": ""2010-04-24T17:54:01.503315""},""authors"": [{""key"": ""/authors/OL2645777A""}],""latest_revision"": 4,""works"": [{""key"": ""/works/OL7925046W""}],""type"": {""key"": ""/type/edition""},""subjects"": [""Government - Comparative"",""Politics / Current Events""],""revision"": 4}"
编辑2:
所以我正试图读取此文件,该文件是由制表符分隔的文件,其中包含以下各列:
type-记录类型(/ type / edition,/ type / work等) 键-记录的唯一键。 (/书/ OL1M等) 修订-记录的修订号 last_modified-上次修改的时间戳 JSON-JSON格式的完整记录
我正在尝试读取JSON文件,而从那个Json im中只是尝试获取json的“标题”,“ isbn 13”,“ isbn 10”并将其作为一行保存到文件中
所以每一行都应该看起来像原始行,但只包含那些键和值