如何将Wiki标记转换为文本? R /维基百科

我正在尝试运行以下代码:

library(WikipediR)

wp_content <- page_content("en","wikipedia",page_name = "Aaron Halfaker",as_wikitext = T,clean_response = T)

wp_content <- wp_content$wikitext$`*`

print(wp_content)

但是输出在Wiki标记中:

[1] "{{Infobox scientist\n| name        = Aaron Halfaker\n| native_name = \n| native_name_lang = \n| image       = File:Halfaker,_Aaron_Sept_2013.jpg\n| image_size  = \n| alt         = \n| caption     = \n| birth_date  = {{birth date and age|1983|12|27}}\n| birth_place = [[Virginia,Minnesota]]<ref>{{Cite web |url=https://twitter.com/halfak/status/826529576906059780 |title=Twitter status |last=Halfaker |first=Aaron |website=Twitter |date=31 January 2017}}</ref>\n| death_date  = \n| death_place = \n| resting_place = \n| resting_place_coordinates =  <!--{{coord|LAT|LONG|type:landmark|display=inline,title}}-->\n| other_names = \n| residence   = \n| citizenship = \n| nationality = \n| fields      = [[Human-Computer Interaction]] <br/> [[computer-supported cooperative work]]\n| workplaces  = [[Wikimedia Foundation]]\n| patrons     = \n| alma_mater  = [[The College of St. Scholastica]] (B.S.,2006)<br/> [[University of Minnesota]] (Ph.D.,2013)<ref name=\"tmn\">{{cite web|url=http://tech.mn/news/2013/12/11/aaron-halfaker-wikimedia-foundation/|title=Wicked Smart: 5 questions with U of M PhD and Wikipedian Aaron Halfaker|date=11 December 2013|publisher=TechMN|accessdate=5 January 2015}}</ref><ref>{{Cite web |url=https://www-users.cs.umn.edu/~halfak/docs/curriculum_vitae |title=Aaron Halfaker Curriculum Vitae}}</ref>\n..."

如何将其转换为纯文本,或立即将其转换为纯文本。 我也尝试通过as_wiktext = F,但没有成功。

语言-R。 软件包-Wikipedir v1.5.0

iCMS 回答:如何将Wiki标记转换为文本? R /维基百科

as_wikitext = T下载带有Wiki标记的文本。默认情况下,page_content对页面进行HTML标记。幸运的是,有许多可用的HTML解析器,其中最好的解析器之一是rvest。以下代码将页面下载为HTML,使用rvest::read_html将其解析为HTML结构,然后使用rvest::html_text

将其解析为纯文本
library(WikipediR)
library(rvest)
#> Loading required package: xml2

wp_content <- page_content(language = 'en',project = 'wikipedia','page_name' = 'Aaron Halfaker',as_wikitext = F)

html_text(read_html(wp_content$parse$text$`*`))
#> [1] "Aaron HalfakerBorn (1983-12-27) December 27,1983 (age 36)Virginia,Minnesota[1]Alma materThe College of St. Scholastica (B.S.,2006)University of Minnesota (Ph.D.,2013)..."

reprex package(v0.3.0)于2020-09-02创建

本文链接:https://www.f2er.com/1636084.html

大家都在问