我正在使用RSS('http://www.reddit.com/new/.rss?sort=new')并将其上传到SQL数据库。这是我的步骤:
通过该URL,我能够创建一个熊猫数据框,然后将其上传到SQL数据库。数据框中的列名称为标题,链接,摘要,作者和标签。清理摘要列并清除所有标签的最佳方法是什么?
'<!-- SC_OFF --><div class="md"><p>The title says most of it,I’m running about a 12-13 min mile. I haven’t run in about 4.5 years and I need to get to my fastest 1.5 with more in the tank afterward,and I need it to be solid. </p> <p>I’ve read blogs and running guides,but I thought I’d get it from the source,people who just love to run,just like the way I used to love to lift. </p> <p>I guess my question is,where do I start? Some say football conditioning,others say just run… Some even say just walk. I’m trying to slim down fast and have a solid mile and a half to 2-mile sprint. </p> <p>The only other conditioning I’m doing right now is three days of fight sports (2 Krav/kickboxing,1 combat fitness style). Looking at running 3ish days and taking Sunday off.</p> </div><!-- SC_ON -->   submitted by   <a href="https://www.reddit.com/user/Logical_penguin"> /u/Logical_penguin </a>   to   <a href="https://www.reddit.com/r/running/"> r/running </a> <br/> <span><a href="https://www.reddit.com/r/running/comments/drt0nf/im_65_335lbs_ex_amature_strong_man_and_i_need_help/">[link]</a></span>   <span><a href="https://www.reddit.com/r/running/comments/drt0nf/im_65_335lbs_ex_amature_strong_man_and_i_need_help/">[comments]</a></span>'
我可以将以下内容用于其中一部分
df['summary'] = df['summary'].map(lambda x: x.lstrip('<!-- SC_OFF --->'))
但是,这对于摘要列中的所有内容来说将花费太长时间。