Python XML分析器问题

我是python的新手。很抱歉问这个愚蠢的问题。 我正在尝试将XML文件读取为python对象(最好是熊猫) 现在,我只是尝试打印变量,以查看是否可以表格形式正确读取它们。

我为此使用了xml.etree.ElementTree,但是我可能没有按预期使用它。

代码:

import xml.etree.ElementTree as ET
tree = ET.parse("data.xml")
ODM = tree.getroot()

ns = {'xmlns': 'http://www.cdisc.org/ns/odm/v1.3','mdsol': 'http://www.mdsol.com/ns/odm/metadata'}

for ClinicalData in ODM:
    LocationOID=None
    #print(ClinicalData.tag,ClinicalData.attrib)
    for SubjectData in ClinicalData:
        for SiteRef in SubjectData:
            LocationOID=SiteRef.attrib.get('LocationOID')
        for StudyEventData in SubjectData:
            for AuditRecord in StudyEventData:
                print(ClinicalData.attrib.get('MetaDataVersionOID'),ClinicalData.attrib.get('AuditSubCategoryName'),#null ouptput due to namespace issue
                     SubjectData.attrib.get('SubjectKey'),SubjectData.attrib.get('SubjectName'),#null ouptput due to namespace issue
                     LocationOID,#not sure what is the issue
                     StudyEventData.attrib.get('StudyEventRepeatKey'),AuditRecord.find('DateTimeStamp')                      #not sure what is the issue
                    )

输入:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<ODM xmlns="http://www.cdisc.org/ns/odm/v1.3" 
        xmlns:mdsol="http://www.mdsol.com/ns/odm/metadata" 
        CreationDateTime="2019-08-23T12:59:09" FileOID="3b2b4161-fad8-4239-9c83-03d0e62624dd" FileType="Transactional" ODMVersion="1.3">

    <ClinicalData MetaDataVersionOID="1772" StudyOID="0acc SP3 MAPPING1(DEV)" mdsol:AuditSubCategoryName="activated">
        <SubjectData SubjectKey="7735fd9c-1792-457c-aa58-0ca26ecdc810" mdsol:SubjectKeyType="SubjectUUID" mdsol:SubjectName="acc-SUBJ-3">
            <SiteRef LocationOID="0accSP3MAPPING1SITE1"/>
            <StudyEventData StudyEventOID="FV" StudyEventRepeatKey="VIST[1]/FV[1]" mdsol:InstanceId="2960580">
                <AuditRecord>
                    <UserRef UserOID="systemuser"/>
                    <LocationRef LocationOID="0accSP3MAPPING1SITE1"/>
                    <DateTimeStamp>2019-07-10T07:56:54</DateTimeStamp>
                    <ReasonForChange>Update</ReasonForChange>
                    <SourceID>394263772</SourceID>
                </AuditRecord>
            </StudyEventData>
        </SubjectData>
    </ClinicalData>
</ODM>

我希望所有打印变量都需要像XML文件中那样为变量分配适当的值。请让我知道,还有其他适当的方法可以执行此操作,而不是多次内部循环。

zhng41 回答:Python XML分析器问题

命名空间是使用ElementTree的麻烦。参见此discussion

简短回答:

for ClinicalData in ODM:
    #print(ClinicalData.tag,ClinicalData.attrib)
    for SubjectData in ClinicalData:
        SiteRef = SubjectData.find('{http://www.cdisc.org/ns/odm/v1.3}SiteRef')
        LocationOID = SiteRef.attrib.get('LocationOID')
        for StudyEventData in SubjectData:
            for AuditRecord in StudyEventData:
                print(
                    ClinicalData.attrib.get('MetaDataVersionOID'),ClinicalData.attrib.
                    get('{http://www.mdsol.com/ns/odm/metadata}AuditSubCategoryName'
                        ),#null ouptput due to namespace issue
                    SubjectData.attrib.get('SubjectKey'),SubjectData.attrib.get(
                        '{http://www.mdsol.com/ns/odm/metadata}SubjectName'
                    ),#null ouptput due to namespace issue
                    LocationOID,#not sure what is the issue
                    StudyEventData.attrib.get('StudyEventRepeatKey'),AuditRecord.find(
                        '{http://www.cdisc.org/ns/odm/v1.3}DateTimeStamp').
                    text  #not sure what is the issue
                )
,

我认为您可以使用BeautifulSoup解析XML:

    from bs4 import BeautifulSoup

    temp  ="""<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<ODM xmlns="http://www.cdisc.org/ns/odm/v1.3" 
        xmlns:mdsol="http://www.mdsol.com/ns/odm/metadata" 
        CreationDateTime="2019-08-23T12:59:09" FileOID="3b2b4161-fad8-4239-9c83-03d0e62624dd" FileType="Transactional" ODMVersion="1.3">

    <ClinicalData MetaDataVersionOID="1772" StudyOID="0ACC SP3 MAPPING1(DEV)" mdsol:AuditSubCategoryName="Activated">
        <SubjectData SubjectKey="7735fd9c-1792-457c-aa58-0ca26ecdc810" mdsol:SubjectKeyType="SubjectUUID" mdsol:SubjectName="ACC-SUBJ-3">
            <SiteRef LocationOID="0ACCSP3MAPPING1SITE1"/>
            <StudyEventData StudyEventOID="FV" StudyEventRepeatKey="VIST[1]/FV[1]" mdsol:InstanceId="2960580">
                <AuditRecord>
                    <UserRef UserOID="systemuser"/>
                    <LocationRef LocationOID="0ACCSP3MAPPING1SITE1"/>
                    <DateTimeStamp>2019-07-10T07:56:54</DateTimeStamp>
                    <ReasonForChange>Update</ReasonForChange>
                    <SourceID>394263772</SourceID>
                </AuditRecord>
            </StudyEventData>
        </SubjectData>
    </ClinicalData>
</ODM>"""



temp=BeautifulSoup(temp,"lxml")
ClinicalData = temp.find('ClinicalData'.lower())
SubjectData = ClinicalData.find_all('SubjectData'.lower())
LocationOID=None
for i in SubjectData:
    SiteRef = i.find('SiteRef'.lower())
    LocationOID = SiteRef.attrs['locationoid']


print('LocationOID',LocationOID)

输出:

LocationOID 0ACCSP3MAPPING1SITE1
[Finished in 1.2s]
,

@贾斯汀 我已经应用了您的建议,但一直有效,直到我破坏了它。

输入:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<ODM xmlns="http://www.cdisc.org/ns/odm/v1.3" xmlns:mdsol="http://www.mdsol.com/ns/odm/metadata" CreationDateTime="2019-08-23T12:59:09" FileOID="3b2b4161-fad8-4239-9c83-03d0e62624dd" FileType="Transactional" ODMVersion="1.3">
    <ClinicalData MetaDataVersionOID="2965" StudyOID="0ACC SP3 MAPPING1(DEV)" mdsol:AuditSubCategoryName="Entered">
        <SubjectData SubjectKey="481e4653-693c-4e15-8762-d8a66c0d2cf1" mdsol:SubjectKeyType="SubjectUUID" mdsol:SubjectName="ACC-SUBJ-1">
            <SiteRef LocationOID="0ACCSP3MAPPING1SITE1"/>
            <StudyEventData StudyEventOID="FV" StudyEventRepeatKey="VIST[1]/FV[1]" mdsol:InstanceId="2960564">
                <FormData FormOID="VS" FormRepeatKey="1" mdsol:DataPageId="15331229">
                    <ItemGroupData ItemGroupOID="VS" mdsol:RecordId="17928808">
                        <ItemData ItemOID="VS.WT" TransactionType="Upsert" Value="45">
                            <AuditRecord>
                                <UserRef UserOID="alscrave2"/>
                                <LocationRef LocationOID="0ACCSP3MAPPING1SITE1"/>
                                <DateTimeStamp>2018-02-02T09:39:30</DateTimeStamp>
                                <ReasonForChange/>
                                <SourceID>122841525</SourceID>
                            </AuditRecord>
                            <MeasurementUnitRef MeasurementUnitOID="1761.Weight.1"/>
                        </ItemData>
                    </ItemGroupData>
                </FormData>
            </StudyEventData>
        </SubjectData>
    </ClinicalData>
    <ClinicalData MetaDataVersionOID="2965" StudyOID="0ACC SP3 MAPPING1(DEV)" mdsol:AuditSubCategoryName="Entered">
        <SubjectData SubjectKey="481e4653-693c-4e15-8762-d8a66c0d2cf1" mdsol:SubjectKeyType="SubjectUUID" mdsol:SubjectName="ACC-SUBJ-1">
            <SiteRef LocationOID="0ACCSP3MAPPING1SITE1"/>
            <StudyEventData StudyEventOID="FV" StudyEventRepeatKey="VIST[1]/FV[1]" mdsol:InstanceId="2960564">
                <FormData FormOID="VS" FormRepeatKey="1" mdsol:DataPageId="15331229">
                    <ItemGroupData ItemGroupOID="VS" mdsol:RecordId="17928809">
                        <ItemData ItemOID="VS.WT" TransactionType="Upsert" Value="46">
                            <AuditRecord>
                                <UserRef UserOID="alscrave2"/>
                                <LocationRef LocationOID="0ACCSP3MAPPING1SITE1"/>
                                <DateTimeStamp>2018-02-02T09:39:30</DateTimeStamp>
                                <ReasonForChange/>
                                <SourceID>122841525</SourceID>
                            </AuditRecord>
                            <MeasurementUnitRef MeasurementUnitOID="1761.Weight.1"/>
                        </ItemData>
                    </ItemGroupData>
                </FormData>
            </StudyEventData>
        </SubjectData>
    </ClinicalData>
</ODM>

代码:

import xml.etree.ElementTree as ET
import pandas as pd

def getvalueofnode(node):
    """ return node text or None """
    return node.text if node is not None else None

tree = ET.parse("data.xml")
ODM = tree.getroot()

xmlns = "{http://www.cdisc.org/ns/odm/v1.3}"
mdsol = "{http://www.mdsol.com/ns/odm/metadata}"

def data_reader():
    dfcols = ['CreationDateTime','StudyOID','MetaDataVersionOID','SubjectName','SUBJECTUUID','LocationOID','StudyEventOID','StudyEventRepeatKey','FormOID','FormRepeatKey','DataPageId','ItemgroupOID','RecordId','var_name','Value','DateTimeStamp','ASC_Name','Measurement_Unit','SourceID','UserOID','InstanceId']
    df_xml = pd.DataFrame(columns=dfcols)

    CreationDateTime = ODM.attrib.get('CreationDateTime')

    for ClinicalData in ODM:
        StudyOID = ClinicalData.attrib.get('StudyOID')
        MetaDataVersionOID = ClinicalData.attrib.get('MetaDataVersionOID')
        ASC_Name = ClinicalData.attrib.get('{0}AuditSubCategoryName'.format(mdsol))
        for SubjectData in ClinicalData:
            SubjectName = SubjectData.attrib.get('{0}SubjectName'.format(mdsol))
            SUBJECTUUID = SubjectData.attrib.get('SubjectKey')
            LocationOID = SubjectData.find('{0}SiteRef'.format(xmlns)).attrib.get('LocationOID')
            for StudyEventData in SubjectData:
                StudyEventOID = StudyEventData.attrib.get('StudyEventOID')
                StudyEventRepeatKey = StudyEventData.attrib.get('StudyEventRepeatKey')
                InstanceId = StudyEventData.attrib.get('{0}InstanceId'.format(mdsol))
                for FormData in StudyEventData:
                    FormOID = FormData.attrib.get('FormOID')
                    FormRepeatKey = FormData.attrib.get('FormRepeatKey')
                    DataPageId = FormData.attrib.get('{0}DataPageId'.format(mdsol))
                    for ItemGroupData in FormData:
                        ItemgroupOID = ItemGroupData.attrib.get('ItemgroupOID')
                        RecordId = ItemGroupData.attrib.get('{0}RecordId'.format(mdsol))
                        for ItemData in ItemGroupData:
                            var_name = ItemData.attrib.get('ItemOID')
                            Value = ItemData.attrib.get('Value')
                            Measurement_Unit = ItemData.find('MeasurementUnitRef'.format(xmlns)).attrib.get('MeasurementUnitOID')
                            for AuditRecord in ItemData:
                                DateTimeStamp = AuditRecord.find('{0}DateTimeStamp'.format(xmlns)).text;
                                SourceID = AuditRecord.find('{0}SourceID'.format(xmlns)).text; 
                                UserOID = ItemData.find('{0}UserRef'.format(xmlns)).attrib.get('UserOID')
                                df_xml = df_xml.append(
                                pd.Series([CreationDateTime,StudyOID,MetaDataVersionOID,SubjectName,SUBJECTUUID,LocationOID,StudyEventOID,StudyEventRepeatKey,FormOID,FormRepeatKey,DataPageId,ItemgroupOID,RecordId,var_name,Value,DateTimeStamp,ASC_Name,Measurement_Unit,SourceID,UserOID,InstanceId],index=dfcols),ignore_index=True)

    print(df_xml)
data_reader()

问题:我收到重复的记录。而变量DateTimeStamp,SourceID,UserOID和Measurement_Unit在分配期间会引发运行时错误。

本文链接:https://www.f2er.com/3168071.html

大家都在问