将数百个 xml 文件转换为单个 csv 文件

我有 423 个 xml 文件来训练我的深度学习模型。我搜索了一些 python 代码和 xslt 等,但我不知道该怎么做。这是一个文件的示例:

<?xml version="1.0"?>

-<case>

<number>2</number>

<age>49</age>

<sex>F</sex>

<composition>solid</composition>

<echogenicity>hyperechogenicity</echogenicity>

<margins>well defined</margins>

<calcifications>non</calcifications>

<tirads>2</tirads>

<reportbacaf/>

<reporteco/>


-<mark>

<image>1</image>

<svg>[{"points": [{"x": 250,"y": 72},{"x": 226,"y": 82},{"x": 216,"y": 90},{"x": 204,"y": 94},{"x": 190,"y": 98},{"x": 181,"y": 103},{"x": 172,"y": 109},{"x": 165,"y": 121},{"x": 161,"y": 131},{"x": 159,"y": 142},{"x": 162,"y": 170},{"x": 164,"y": 185},{"x": 171,"y": 203},{"x": 176,"y": 210},{"x": 185,"y": 214},{"x": 191,"y": 218},{"x": 211,"y": 228},{"x": 212,"y": 230},{"x": 235,"y": 239},{"x": 243,"y": 242},{"x": 255,"y": 244},{"x": 263,"y": 245},{"x": 285,{"x": 298,{"x": 330,"y": 233},{"x": 352,"y": 217},{"x": 367,"y": 201},{"x": 373,"y": 194},{"x": 379,"y": 173},{"x": 382,"y": 163},{"x": 383,"y": 143},"y": 136},"y": 127},"y": 122},{"x": 374,"y": 117},{"x": 365,{"x": 360,"y": 101},{"x": 358,"y": 95},"y": 88},{"x": 346,"y": 85},{"x": 333,"y": 81},{"x": 327,"y": 78},{"x": 319,"y": 73},{"x": 314,{"x": 304,"y": 70},{"x": 281,"y": 69},{"x": 258,"y": 71},{"x": 254,{"x": 248,"y": 72}],"annotation": {},"regionType": "freehand"}]</svg>

</mark>

</case>

我需要解析(包括)数字和 tirads 之间的信息。如何使用 Python 将这些转换为单个文件?

hzthzjln 回答:将数百个 xml 文件转换为单个 csv 文件

遍历每个 XML。在每个 XML 上,使用诸如 etreelxml 之类的 xml 解析器来解析 xml 的内容并将其存储为 dict。然后将 dict 保存为 JSON 文件或 CSV。

要解析的标签列表,

tags = ['number','age','sex','composition','echogenicity','margins','calcifications','tirads']

解析保存为“abc.xml”的XML,

with open('abc.xml','r') as fd:
    doc = fd.read()

使用 LxmlBeautifulSoup modules 进行解析。

from bs4 import BeautifulSoup as BX
soup = BX(doc,'lxml')
mydata = {}
for tag in tags:
    value = soup.find(tag)
    if value:
        mydata[tag] = value.text
    else:
        mydata[tag] = None

查看数据

print(mydata)
#{'number': '2','age': '49','sex': 'F','composition': 'solid','echogenicity': 'hyperechogenicity','margins': 'well defined','calcifications': 'non','tirads': '2'}

完整代码写成函数,

from bs4 import BeautifulSoup as BX
tags = ['number','tirads']

def parse_xml(xmlfile):
    with open(xmlfile,'r') as fd:
        doc = fd.read()
    soup = BX(doc,'lxml')
    mydata = {}
    for tag in tags:
        value = soup.find(tag)
        if value:
            mydata[tag] = value.text
        else:
            mydata[tag] = None
    return mydata

您可以循环所有 xml 文件并使用此函数解析每个 xml。

#lets say all your xml files are in this folder
myfiles_path = r"C:/Users/RG/Desktop/test/"

import os,pandas
all_data = {}

xmlfiles = os.listdir(myfiles_path)

for file in xmlfiles:
    file_path = os.path.join(myfiles_path,file)
    all_data[file] = parse_xml(file_path)

df = pandas.DataFrame.from_dict(all_data)
df.to_csv('output.csv')
本文链接:https://www.f2er.com/2003.html

大家都在问