基于DBSCAN的群集字符串

2024-05-04 • 问答

摘要：在基于“内容”列对多列csv文件进行聚类中寻找python代码的DBSCAN实现

Input:

    input csv file rows sample

    Rank,Domain,Contents      

    1,abc.com,hello random text out
    2,xyz.com,hello random somethingelse
    3,not.com,a b c d
    4,plus.com,a b asdsadsa asdsadasdsadsa
    5,minus.com,man win 

   Where,Column 1 => Rank = digit
   Column 2 => Domain = domain name ex. abc.com
   Column 3 => Contents = list of words (string,this is 
extracted clean up words from html page)

Output :

    The output of the cluster be based on similar list of contents

    Cluster 1: abc.com,xyz.com
    Cluster 2: not.com,plus.com
    Cluster 3: minus.com
    ....

    Please note: In output,I am not looking for words that are in same cluster. Instead,I am looking for a 'domain name',column which is clustered based on similar contents of column 3,'contents'

我研究了以下资源，但它们基于kmeans，与我正在寻找的DBSCAN集群输出无关。请注意，由于我们不想根据输入来限制群集号，因此提供群集号在这种情况下将不适用。

1）How can I cluster text data with multiple columns?

2）Clustering text documents using scikit-learn kmeans in Python

3）http://brandonrose.org/clustering

4）https://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/

5）https://towardsdatascience.com/applying-machine-learning-to-classify-an-unsupervised-text-document-e7bb6265f52

所以

input <= csv file with 'Rank','Domain','Contents'
output <= cluster with domain name [NOT contents]

A python implementation in DBSCAN clustering would be an ideal.

谢谢！

基于DBSCAN的群集字符串

robinwu20090630 回答：基于DBSCAN的群集字符串

大家都在问