我需要对n = 630个职位描述的job_experience
部分中的各个句子进行分类/分类。我对提取工作经验和与能力相关的句子特别感兴趣,但是我需要使它们与它们所关联的job_title
保持联系。
这些职位描述的当前状态:许多类似的说法(例如,“需要microsoft Office技能。”“使用microsoft Word,PowerPoint的经验。”“至少3年相关工作经验领域。”“至少三年的类似职位经验。”。
将来,我们将需要压缩这些职位描述语句,以便例如可以将同一条语句应用于多个职位,并且经理可以从下拉列表中选择工作经验说明。
因此,我想对这些单独的句子进行分类,以便我们可以将它们压缩并决定今后将使用哪些语句。
我一直在研究应该做的事情,对于哪种方法最有效的建议,我将不胜感激。我熟悉R,但是主要将其用于数据整理和可视化。 LDA,kmeans文本聚类,特征识别...这些是我在研究(scikit-learn.org)中发现的东西,并且大多数都在Python中使用。
- Python最适合这种事情吗?我可以使用R吗?
- 哪种算法最适合初学者?
- 我知道这不是魔术,只是在寻找实现此任务的最佳方法。
我的数据如下:
df <- data.frame(job_title = c("Recruiter","Recruiter","File Clerk","Learning & Org. Development Specialist","CNA","CNA"),job_experience = c("Minimum 1 year experience in recruitment or related human resources function.","Proficient in microsoft Office Applications.","High school diploma required.","Bachelors Degree in Human Resources or related field preferred.","High School diploma preferred.","Ability to use relevant computer systems.","Bachelors Degree in related field (e.g.,Human Resources,Education,Organizational Development).","Minimum 2 years experience applying L&OD principles and practices in an organizational setting.","Previous work experience in Human Resources preferred.","Experience with a learning management system (LMS).","High school diploma or GED equivalent.","Certified Nursing Assistant,certified by the Virginia Board of Health Professions.","CPR certification required at date of hire."))
我的目标是拥有这样的数据集(新列= job_exp_category
):
job_title job_experience job_exp_category
"Recruiter" "Minimum 1 year experience in recruitment..." "Work experience"
"Recruiter" "Proficient in microsoft Office Applicati..." "Skill/Ability"
"Recruiter" "High school diploma required." "Degree"
... ... ...
"CNA" "Certified Nursing Assistant,certificati..." "Certification/License"
"CNA" "CPR certification required at date of hire." "Certification/License"
感谢您对SO社区的任何见识。