如何从Azure Python函数blob输入绑定读取镶木地板文件？

2024-05-04 • 问答

我有一个带有blob输入绑定的python函数。有问题的斑点包含一个镶木地板文件。最终，我想将绑定的blob读取到pandas数据框中，但是我不确定执行此操作的正确方法。

我已验证绑定设置正确，并且已经能够成功读取纯文本文件。我很高兴镶木地板文件的完整性很好，因为我可以使用此处提供的示例来读取它：https://arrow.apache.org/docs/python/parquet.html#reading-a-parquet-file-from-azure-blob-storage

以下代码显示了我要执行的操作：


import logging
import io
import azure.functions as func
import pyarrow.parquet as pq


def main(req: func.HttpRequest,inputblob: func.InputStream) -> func.HttpResponse:
    logging.info('Python HTTP trigger function processed a request.')

    # Create a bytestream to hold blob content
    byte_stream = io.BytesIO()
    byte_stream.write(inputblob.read())
    df = pq.read_table(source=byte_stream).to_pandas()

我收到以下错误消息：

pyarrow.lib.ArrowIOError: Couldn't deserialize thrift: Tprotocolexception: Invalid data

以下是我的function.json文件：

{
  "scriptFile": "__init__.py","bindings": [
    {
      "authLevel": "function","type": "httpTrigger","direction": "in","name": "req","methods": [
        "get","post"
      ]
    },{
      "type": "http","direction": "out","name": "$return"
    },{
        "name": "inputblob","type": "blob","path": "<container>/file.parquet","connection": "AzureWebJobsStorage","direction": "in"
    }
  ]
}

我的host.json文件：

{
    "version":  "2.0","functionTimeout": "00:10:00","extensionBundle": {
        "id": "microsoft.Azure.Functions.ExtensionBundle","version": "[1.*,2.0.0)"
    }
}

from io import BytesIO import azure.functions as func def main(blobTrigger: func.InputStream): # Read the blob as bytes blob_bytes = blobTrigger.read() blob_to_read = BytesIO(blob_bytes) df = pd.read_parquet(blob_to_read,engine='pyarrow') print("Length of the parquet file:" + str(len(df.index)))

如何从Azure Python函数blob输入绑定读取镶木地板文件？

why521jay 回答：如何从Azure Python函数blob输入绑定读取镶木地板文件？

大家都在问