C＃.NET-是否有一种简单的方法可以在单个ZIP文件中的一组XML文件中查询同一XML节点？

我正在尝试将一段Python代码转换为C＃，该代码接受一个充满XML文件的ZIP文件，然后针对每个XML文件执行特定的XPath查询并返回结果。在Python中，它非常轻巧，看起来像这样（我意识到下面的示例并不严格是XPath，但我之前写过它！）：

with zipfile.ZipFile(fullFileName) as zf:
zfxml = [f for f in zf.namelist() if f.endswith('.xml')]
for zfxmli in zfxml:
    with zf.open(zfxmli) as zff:
        zfft = et.parse(zff).getroot()
        zffts = zfft.findall('Widget')
        print ([wgt.find('Description').text for wgt in zffts])

我在C＃中获得的最接近的是：

foreach (ZipArchiveEntry entry in archive.Entries)
{
    FileInfo fi = new FileInfo(entry.FullName);

    if (fi.Extension.Equals(".xml",StringComparison.OrdinalIgnoreCase))
    {
        using (Stream zipEntryStream = entry.Open())
        {
            XmlDocument xmlDoc = new XmlDocument();

            xmlDoc.Load(zipEntryStream);
            XmlNodeList wgtNodes = xmlDoc.SelectNodes("//Root/Widget");

            foreach (XmlNode tmp in wgtNodes)
            {
                zipListBox.Items.Add(tmp.SelectSingleNode("//Description"));
            }
        }
    }
}

尽管这对于较小的ZIP文件确实有效，但它比Python实现占用了 way 更大的内存，并且如果ZIP文件中包含太多XML文件，则会崩溃导致内存不足。是否有另一种更有效的方法来实现这一目标？

如 What is the best way to parse (big) XML in C# Code? 中所述，您可以使用XmlReader来流传输具有有限内存消耗的巨大XML文件。但是，XmlReader的使用有些棘手，因为如果XML并非如预期那样准确，那么读取太少或太多就很容易。（即使微不足道的空白也可以引发XmlReader算法。）

为减少发生此类错误的机会，首先引入以下扩展方法，该方法迭代当前元素的所有直接子元素：

public static partial class XmlReaderExtensions
{
    /// <summary>
    /// Read all immediate child elements of the current element,and yield return a reader for those matching the incoming name & namespace.
    /// Leave the reader positioned after the end of the current element
    /// </summary>
    public static IEnumerable<XmlReader> ReadElements(this XmlReader inReader,string localName,string namespaceURI)
    {
        inReader.MoveToContent();
        if (inReader.NodeType != XmlNodeType.Element)
            throw new InvalidOperationException("The reader is not positioned on an element.");
        var isEmpty = inReader.IsEmptyElement;
        inReader.Read();
        if (isEmpty)
            yield break;
        while (!inReader.EOF)
        {
            switch (inReader.NodeType)
            {
                case XmlNodeType.EndElement:
                    // Move the reader AFTER the end of the element
                    inReader.Read();
                    yield break;
                case XmlNodeType.Element:
                    {
                        if (inReader.LocalName == localName && inReader.NamespaceURI == namespaceURI)
                        {
                            using (var subReader = inReader.ReadSubtree())
                            {
                                subReader.MoveToContent();
                                yield return subReader;
                            }
                            // ReadSubtree() leaves the reader positioned ON the end of the element,so read that also.
                            inReader.Read();
                        }
                        else
                        {
                            // Skip() leaves the reader positioned AFTER the end of the element.
                            inReader.Skip();
                        }
                    }
                    break;
                default:
                    // Not an element: Text value,whitespace,comment.  Read it and move on.
                    inReader.Read();
                    break;
            }
        }
    }

    /// <summary>
    /// Read all immediate descendant elements of the current element,and yield return a reader for those matching the incoming name & namespace.
    /// Leave the reader positioned after the end of the current element
    /// </summary>
    public static IEnumerable<XmlReader> ReadDescendants(this XmlReader inReader,string namespaceURI)
    {
        inReader.MoveToContent();
        if (inReader.NodeType != XmlNodeType.Element)
            throw new InvalidOperationException("The reader is not positioned on an element.");
        using (var reader = inReader.ReadSubtree())
        {
            while (reader.ReadToFollowing(localName,namespaceURI))
            {
                using (var subReader = inReader.ReadSubtree())
                {
                    subReader.MoveToContent();
                    yield return subReader;
                }
            }
        }
        // Move the reader AFTER the end of the element
        inReader.Read();
    }
}

有了它，您的python算法可以被复制如下：

var zipListBox = new List<string>();

using (var archive = ZipFile.Open(fullFileName,ZipArchiveMode.Read))
{
    foreach (var entry in archive.Entries)
    {
        if (Path.GetExtension(entry.Name).Equals(".xml",StringComparison.OrdinalIgnoreCase))
        {
            using (var zipEntryStream = entry.Open())
            using (var reader = XmlReader.Create(zipEntryStream))
            {
                // Move to the root element
                reader.MoveToContent();

                var query = reader
                    // Read all child elements <Widget>
                    .ReadElements("Widget","")
                    // And extract the text content of their first child element <Description>
                    .SelectMany(r => r.ReadElements("Description","").Select(i => i.ReadElementContentAsString()).Take(1));

                zipListBox.AddRange(query);
            }
        }
    }
}

注意：

您的c＃XPath查询与您原来的python查询不匹配。您原始的python代码执行以下操作：
```
zfft = et.parse(zff).getroot()
```
这无条件获得了根元素（docs）。
```
zffts = zfft.findall('Widget')
```
这将找到所有名为“ Widget”的直接子元素（未使用递归下降运算符//）（docs）。
```
wgt.find('Description').text for wgt in zffts
```
这会循环遍历各个小部件，并为每个小部件找到名为“ Description”的第一个子元素并获取其文本（docs）。

为了进行比较，xmlDoc.SelectNodes("//Root/Widget")递归地降低了整个XML元素层次结构，以查找嵌套在名为<Widget>的节点内的名为<Root>的节点-可能不是您想要的。同样，tmp.SelectSingleNode("//Description")递归地下降到<Widget>下的XML层次结构中以找到描述节点。递归下降可能在这里起作用，但是如果存在多个嵌套的<Description>节点，则可能返回不同的结果。
使用XmlReader.ReadSubtree()可确保整个元素都被消耗掉-不多也不少。

ReadElements()与LINQ to XML很好地配合。例如。如果您想流化XML并获取每个小部件的ID，描述和名称，而又不将它们全部加载到内存中，则可以执行以下操作：

var query = reader
    .ReadElements("Widget","")
    .Select(r => XElement.Load(r))
    .Select(e => new { Description = e.Element("Description")?.Value,Id = e.Attribute("id")?.Value,Name = e.Element("Name")?.Value });

foreach (var widget in query)
{
    Console.WriteLine("Id = {0},Name = {1},Description = {2}",widget.Id,widget.Name,widget.Description);
}

这里再次限制了内存的使用，因为在任何时候都只引用一个XElement对应的一个<Widget>。

演示小提琴here。

更新

如果<Widget>标记的集合实际上不包含在XML根目录中，而是实际上包含在根目录的单个<Widgets>子树中，那么您的代码将如何更改？ / em>

您在这里有几个选择。首先，您可以通过将LINQ语句链接到一起来对ReadElements进行嵌套调用，这些语句用SelectMany使元素层次结构扁平化：

var query = reader
    // Read all child elements <Widgets>
    .ReadElements("Widgets","")
    // Read all child elements <Widget>
    .SelectMany(r => r.ReadElements("Widget",""))
    // And extract the text content of their first child element <Description>
    .SelectMany(r => r.ReadElements("Description","").Select(i => i.ReadElementContentAsString()).Take(1));

如果您仅对仅在某些特定XPath上读取<Widget>节点感兴趣，请使用此选项。

或者，您可以简单地阅读所有名为<Widget>的后代，如下所示：

var query = reader
    // Read all descendant elements <Widget>
    .ReadDescendants("Widget","")
    // And extract the text content of their first child element <Description>
    .SelectMany(r => r.ReadElements("Description","").Select(i => i.ReadElementContentAsString()).Take(1));

如果有兴趣读取XML中出现的<Widget>个节点，请使用此选项。

演示小提琴＃2 here。

C＃.NET-是否有一种简单的方法可以在单个ZIP文件中的一组XML文件中查询同一XML节点？

bozipk 回答：C＃.NET-是否有一种简单的方法可以在单个ZIP文件中的一组XML文件中查询同一XML节点？

大家都在问