如何使用python和ML在粒度级别上细分客户端？新手提醒我要实现的目标我尝试过的

新手提醒

最近几个月，我一直在用python学习ML，并取得了一些不错的成绩。但是，目前，我只从事一个项目，需要更多经验的人的指导（Google只能带您到?为止）。

我要实现的目标

我有一个包含客户及其交易的虚拟数据集。我想根据他们的人口统计数据，消费得分和购物行为将它们分类或细分为更小的“部落”。例如，一个“部落”的描述可能是这么细粒度的：（35岁的男性，他们主要在每月上半年的周六下午购买基于音乐的产品，并且得分较高）我想找到细化细分与一般细分之间的最佳结合点，例如：按收入和支出得分进行细分。

我尝试过的

首先，我分配了一个int值，该值表示每个客户交易中每次分类发生的频率。例如：

Client | Home  | Movies | Games 
    1      3        1       0

这表示客户1已购买3次与房屋相关的项目，与电影相关的项目1次，并且他们从未购买过游戏类别中的任何项目。

对于Days（即周日-周六），周号（即任何给定月份中的周号的1-5），小时（即hour_one-hour_twenty_four），我都做了相同的操作。

这种方法使我可以创建纯数值数据的干净向量。

这是我的原始JSON格式输入数据的示例（处理之前）：

[
    {
        "id": 1,"customer_id": 1,"age": 47,"gender": "Female","first_name": "Lea","last_name": "Calafato","email": "lcalafato0@cafepress.com","phone_number": "612-170-5956","income_k": 24,"location": "Nottingham","sign_up_date": "2/16/2019","transactions": [
            {
                "customer_id": "1","product_id": 42,"product_cat": "Home","price": 106.92,"time": "8:15 PM","date": "04/15/2019","day": "Monday","week_num": 3
            },{
                "customer_id": "1","product_id": 30,"product_cat": "Movies","price": 26.63,"time": "10:12 AM","date": "09/17/2019","day": "Tuesday","week_num": 4
            }
        ],"number_of_purchases": 2,"last_purchase": "09/17/2019","total_spent": 133.55
    }
]

这是经过处理和标准化的数据框：

age                  750 non-null int64
income_k             750 non-null int64
spending_score       750 non-null int64
gender__Female       750 non-null uint8
gender__Male         750 non-null uint8
Home                 750 non-null float64
Movies               750 non-null float64
Games                750 non-null float64
Grocery              750 non-null float64
Music                750 non-null float64
Health               750 non-null float64
Beauty               750 non-null float64
Sports               750 non-null float64
Toys                 750 non-null float64
Garden               750 non-null float64
Computers            750 non-null float64
Clothing             750 non-null float64
Books                750 non-null float64
Outdoors             750 non-null float64
Industrial           750 non-null float64
Kids                 750 non-null float64
Tools                750 non-null float64
Automotive           750 non-null float64
electronics          750 non-null float64
Jewelery             750 non-null float64
Baby                 750 non-null float64
Shoes                750 non-null float64
week_one             750 non-null float64
week_two             750 non-null float64
week_three           750 non-null float64
week_four            750 non-null float64
week_five            750 non-null float64
Sunday               750 non-null float64
Monday               750 non-null float64
Tuesday              750 non-null float64
Wednesday            750 non-null float64
Thursday             750 non-null float64
Friday               750 non-null float64
Saturday             750 non-null float64
hour_one             750 non-null float64
hour_two             750 non-null float64
hour_three           750 non-null float64
hour_four            750 non-null float64
hour_five            750 non-null float64
hour_six             750 non-null float64
hour_seven           750 non-null float64
hour_eight           750 non-null float64
hour_nine            750 non-null float64
hour_ten             750 non-null float64
hour_eleven          750 non-null float64
hour_twelve          750 non-null float64
hour_thirteen        750 non-null float64
hour_fourteen        750 non-null float64
hour_fithteen        750 non-null float64
hour_sixteen         750 non-null float64
hour_seventeen       750 non-null float64
hour_eighteen        750 non-null float64
hour_nineteen        750 non-null float64
hour_twenty          750 non-null float64
hour_twenty_one      750 non-null float64
hour_twenty_two      750 non-null float64
hour_twenty_three    750 non-null float64
hour_twenty_four     750 non-null float64*

我已经通过k均值和DBSCAN算法运行了这些数据，但无济于事。 k-means为我提供了4个群集，这些群集对于我的要求来说太笼统了，而DBSCAN给我提供了0个群集，每个数据点都被视为噪声。

如果有任何不清楚之处，我深表歉意，请随时问我要澄清的事情。预先感谢。

如果您的数据未缩放，则可能会发生这种情况。在Iris数据集上查看通用示例：

import pandas as pd
import numpy as np
import seaborn
from sklearn.preprocessing import MinMaxScaler
from  sklearn.cluster import KMeans

iris = seaborn.load_dataset('iris')
scaling = MinMaxScaler(feature_range=(-1,1)).fit(iris.iloc[:,:-1])
iris_scale = pd.DataFrame(scaling.transform(iris.iloc[:,:-1]),columns=iris.iloc[:,:-1].columns)
km = KMeans(n_clusters=3,random_state=1)
km.fit(iris_scale)
y_kmeans = pd.DataFrame({'cluster': km.predict(iris_scale),'real':iris['species'],'stam':1})
y_kmeans.pivot_table(index=['real'],columns=['cluster'],aggfunc='count')

这是您应该获得的结果：

cluster        0     1     2
real                        
setosa       NaN  50.0   NaN
versicolor  47.0   NaN   3.0
virginica   14.0   NaN  36.0

意味着k-均值将所有“ setosa”物种聚类为簇1，即使不知道它是同一物种。确保您的数据已缩放（标准化）。

如果您使用k = 4，则K-均值将为您提供4个聚类。如果您想要更多费用，请增加k ...

类似地，DBSCAN需要设置权限以实现所需的效果。

如何使用python和ML在粒度级别上细分客户端？ 新手提醒我要实现的目标我尝试过的

新手提醒

我要实现的目标

我尝试过的

xinfeifei1 回答：如何使用python和ML在粒度级别上细分客户端？ 新手提醒我要实现的目标我尝试过的

大家都在问

如何使用python和ML在粒度级别上细分客户端？新手提醒我要实现的目标我尝试过的

xinfeifei1 回答：如何使用python和ML在粒度级别上细分客户端？新手提醒我要实现的目标我尝试过的