python – 使用大量数据操作将JSON加速到数据帧

前端之家收集整理的这篇文章主要介绍了python – 使用大量数据操作将JSON加速到数据帧前端之家小编觉得挺不错的,现在分享给大家,也给大家做个参考。

我有一大堆JSON数据格式如下:

  1. [
  2. [{
  3. "created_at": "2017-04-28T16:52:36Z","as_of": "2017-04-28T17:00:05Z","trends": [{
  4. "url": "http://twitter.com/search?q=%23ChavezSigueCandanga","query": "%23ChavezSigueCandanga","tweet_volume": 44587,"name": "#ChavezSigueCandanga","promoted_content": null
  5. },{
  6. "url": "http://twitter.com/search?q=%2327Abr","query": "%2327Abr","tweet_volume": 79781,"name": "#27Abr","promoted_content": null
  7. }],"locations": [{
  8. "woeid": 395277,"name": "Turmero"
  9. }]
  10. }],[{
  11. "created_at": "2017-04-28T16:57:35Z","as_of": "2017-04-28T17:00:03Z","trends": [{
  12. "url": "http://twitter.com/search?q=%23fyrefestival","query": "%23fyrefestival","tweet_volume": 141385,"name": "#fyrefestival",{
  13. "url": "http://twitter.com/search?q=%23HotDocs17","query": "%23HotDocs17","tweet_volume": null,"name": "#HotDocs17","locations": [{
  14. "woeid": 9807,"name": "Vancouver"
  15. }]
  16. }]
  17. ]...

我编写了一个函数,将其格式化为采用以下形式的pandas数据框:

  1. +----+--------------------------------+------------------+----------------------------------+--------------+--------------------------------------------------------------+----------------------+----------------------+---------------+----------------+
  2. | | name | promoted_content | query | tweet_volume | url | as_of | created_at | location_name | location_woeid |
  3. +----+--------------------------------+------------------+----------------------------------+--------------+--------------------------------------------------------------+----------------------+----------------------+---------------+----------------+
  4. | 47 | #BatesMotel | | %23BatesMotel | 59748 | http://twitter.com/search?q=%23BatesMotel | 2017-04-25T17:00:05Z | 2017-04-25T16:53:43Z | Winnipeg | 2972 |
  5. | 48 | #AdviceForPeopleJoiningTwitter | | %23AdviceForPeopleJoiningTwitter | 51222 | http://twitter.com/search?q=%23AdviceForPeopleJoiningTwitter | 2017-04-25T17:00:05Z | 2017-04-25T16:53:43Z | Winnipeg | 2972 |
  6. | 49 | #CADTHSymp | | %23CADTHSymp | | http://twitter.com/search?q=%23CADTHSymp | 2017-04-25T17:00:05Z | 2017-04-25T16:53:43Z | Winnipeg | 2972 |
  7. | 0 | #WorldPenguinDay | | %23WorldPenguinDay | 79006 | http://twitter.com/search?q=%23WorldPenguinDay | 2017-04-25T17:00:05Z | 2017-04-25T16:58:22Z | Toronto | 4118 |
  8. | 1 | #TravelTuesday | | %23TravelTuesday | | http://twitter.com/search?q=%23TravelTuesday | 2017-04-25T17:00:05Z | 2017-04-25T16:58:22Z | Toronto | 4118 |
  9. | 2 | #DigitalLeap | | %23DigitalLeap | | http://twitter.com/search?q=%23DigitalLeap | 2017-04-25T17:00:05Z | 2017-04-25T16:58:22Z | Toronto | 4118 |
  10. | | | | | | | | | | |
  11. | 0 | #nusnc17 | | %23nusnc17 | | http://twitter.com/search?q=%23nusnc17 | 2017-04-25T17:00:05Z | 2017-04-25T16:58:24Z | Birmingham | 12723 |
  12. | 1 | #WorldPenguinDay | | %23WorldPenguinDay | 79006 | http://twitter.com/search?q=%23WorldPenguinDay | 2017-04-25T17:00:05Z | 2017-04-25T16:58:24Z | Birmingham | 12723 |
  13. | 2 | #littleboyblue | | %23littleboyblue | 20772 | http://twitter.com/search?q=%23littleboyblue | 2017-04-25T17:00:05Z | 2017-04-25T16:58:24Z | Birmingham | 12723 |
  14. +----+--------------------------------+------------------+----------------------------------+--------------+--------------------------------------------------------------+----------------------+----------------------+---------------+----------------+

这是将JSON写入DataFrame的函数

  1. def trends_to_dataframe(data):
  2. df = pd.DataFrame()
  3. for location in data:
  4. temp_df = pd.DataFrame()
  5. for trend in location[0]['trends']:
  6. temp_df = temp_df.append(pd.Series(trend),ignore_index=True)
  7. temp_df['as_of'] = location[0]['as_of']
  8. temp_df['created_at'] = location[0]['created_at']
  9. temp_df['location_name'] = location[0]['locations'][0]['name']
  10. temp_df['location_woeid'] = location[0]['locations'][0]['woeid']
  11. df = df.append(temp_df)
  12. return df

不幸的是,由于我拥有的数据量(以及我测试过的一些简单的计时器),这将需要大约4个小时才能完成.有关如何提高速度的任何想法?

最佳答案
您可以通过使用concurrent.futures异步展平数据来加快速度,然后将其全部加载到带有from_records的DataFrame中.

  1. from concurrent.futures import ThreadPoolExecutor
  2. def get_trends(location):
  3. trends = []
  4. for trend in location[0]['trends']:
  5. trend['as_of'] = location[0]['as_of']
  6. trend['created_at'] = location[0]['created_at']
  7. trend['location_name'] = location[0]['locations'][0]['name']
  8. trend['location_woeid'] = location[0]['locations'][0]['woeid']
  9. trends.append(trend)
  10. return trends
  11. flat_data = []
  12. with ThreadPoolExecutor() as executor:
  13. for location in data:
  14. flat_data += get_trends(location)
  15. df = pd.DataFrame.from_records(flat_data)

猜你在找的Python相关文章