新手提醒
最近几个月,我一直在用python学习ML,并取得了一些不错的成绩。但是,目前,我只从事一个项目,需要更多经验的人的指导(Google只能带您到?为止)。
我要实现的目标
我有一个包含客户及其交易的虚拟数据集。我想根据他们的人口统计数据,消费得分和购物行为将它们分类或细分为更小的“部落”。例如,一个“部落”的描述可能是这么细粒度的:(35岁的男性,他们主要在每月上半年的周六下午购买基于音乐的产品,并且得分较高) 我想找到细化细分与一般细分之间的最佳结合点,例如:按收入和支出得分进行细分。
我尝试过的
首先,我分配了一个int值,该值表示每个客户交易中每次分类发生的频率。例如:
Client | Home | Movies | Games
1 3 1 0
这表示客户1已购买3次与房屋相关的项目,与电影相关的项目1次,并且他们从未购买过游戏类别中的任何项目。
对于Days(即周日-周六),周号(即任何给定月份中的周号的1-5),小时(即hour_one-hour_twenty_four),我都做了相同的操作。
这种方法使我可以创建纯数值数据的干净向量。
这是我的原始JSON格式输入数据的示例(处理之前):
[
{
"id": 1,"customer_id": 1,"age": 47,"gender": "Female","first_name": "Lea","last_name": "Calafato","email": "lcalafato0@cafepress.com","phone_number": "612-170-5956","income_k": 24,"location": "Nottingham","sign_up_date": "2/16/2019","transactions": [
{
"customer_id": "1","product_id": 42,"product_cat": "Home","price": 106.92,"time": "8:15 PM","date": "04/15/2019","day": "Monday","week_num": 3
},{
"customer_id": "1","product_id": 30,"product_cat": "Movies","price": 26.63,"time": "10:12 AM","date": "09/17/2019","day": "Tuesday","week_num": 4
}
],"number_of_purchases": 2,"last_purchase": "09/17/2019","total_spent": 133.55
}
]
这是经过处理和标准化的数据框:
age 750 non-null int64
income_k 750 non-null int64
spending_score 750 non-null int64
gender__Female 750 non-null uint8
gender__Male 750 non-null uint8
Home 750 non-null float64
Movies 750 non-null float64
Games 750 non-null float64
Grocery 750 non-null float64
Music 750 non-null float64
Health 750 non-null float64
Beauty 750 non-null float64
Sports 750 non-null float64
Toys 750 non-null float64
Garden 750 non-null float64
Computers 750 non-null float64
Clothing 750 non-null float64
Books 750 non-null float64
Outdoors 750 non-null float64
Industrial 750 non-null float64
Kids 750 non-null float64
Tools 750 non-null float64
Automotive 750 non-null float64
electronics 750 non-null float64
Jewelery 750 non-null float64
Baby 750 non-null float64
Shoes 750 non-null float64
week_one 750 non-null float64
week_two 750 non-null float64
week_three 750 non-null float64
week_four 750 non-null float64
week_five 750 non-null float64
Sunday 750 non-null float64
Monday 750 non-null float64
Tuesday 750 non-null float64
Wednesday 750 non-null float64
Thursday 750 non-null float64
Friday 750 non-null float64
Saturday 750 non-null float64
hour_one 750 non-null float64
hour_two 750 non-null float64
hour_three 750 non-null float64
hour_four 750 non-null float64
hour_five 750 non-null float64
hour_six 750 non-null float64
hour_seven 750 non-null float64
hour_eight 750 non-null float64
hour_nine 750 non-null float64
hour_ten 750 non-null float64
hour_eleven 750 non-null float64
hour_twelve 750 non-null float64
hour_thirteen 750 non-null float64
hour_fourteen 750 non-null float64
hour_fithteen 750 non-null float64
hour_sixteen 750 non-null float64
hour_seventeen 750 non-null float64
hour_eighteen 750 non-null float64
hour_nineteen 750 non-null float64
hour_twenty 750 non-null float64
hour_twenty_one 750 non-null float64
hour_twenty_two 750 non-null float64
hour_twenty_three 750 non-null float64
hour_twenty_four 750 non-null float64*
我已经通过k均值和DBSCAN算法运行了这些数据,但无济于事。 k-means为我提供了4个群集,这些群集对于我的要求来说太笼统了,而DBSCAN给我提供了0个群集,每个数据点都被视为噪声。
如果有任何不清楚之处,我深表歉意,请随时问我要澄清的事情。预先感谢。