摘要: 在基于“内容”列对多列csv文件进行聚类中寻找python代码的DBSCAN实现
Input:
input csv file rows sample
Rank,Domain,Contents
1,abc.com,hello random text out
2,xyz.com,hello random somethingelse
3,not.com,a b c d
4,plus.com,a b asdsadsa asdsadasdsadsa
5,minus.com,man win
Where,Column 1 => Rank = digit
Column 2 => Domain = domain name ex. abc.com
Column 3 => Contents = list of words (string,this is
extracted clean up words from html page)
Output :
The output of the cluster be based on similar list of contents
Cluster 1: abc.com,xyz.com
Cluster 2: not.com,plus.com
Cluster 3: minus.com
....
Please note: In output,I am not looking for words that are in same cluster. Instead,I am looking for a 'domain name',column which is clustered based on similar contents of column 3,'contents'
我研究了以下资源,但它们基于kmeans,与我正在寻找的DBSCAN集群输出无关。请注意,由于我们不想根据输入来限制群集号,因此提供群集号在这种情况下将不适用。
1)How can I cluster text data with multiple columns?
2)Clustering text documents using scikit-learn kmeans in Python
3)http://brandonrose.org/clustering
4)https://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/
所以
input <= csv file with 'Rank','Domain','Contents'
output <= cluster with domain name [NOT contents]
A python implementation in DBSCAN clustering would be an ideal.
谢谢!