如何找出垃圾邮件中最常用的15个单词?

我已经训练了线性支持向量机(SVM)来基于单词将电子邮件分类为垃圾邮件或非垃圾邮件。我首先使用此代码将电子邮件转换为处理后的文本:

def processEmail(email):
    email = email.lower()
    #replace strings like <html> with a space
    email = re.sub("<[^<>]+"," ",email)
    #ruplace numbers with strings
    email = re.sub("[0-9]+","number",email)
    #replace anything that starts with http:// or https:// with httpaddr
    email = re.sub("(http|https)://[^\s]*","httpaddr",email)
    #replace strings with @ in the middle with emailaddr as they are strings
    email = re.sub("[^\s] + @[^\s]","emailaddr",email)
    #repace $ with dollar
    email = re.sub("[$]+","dollar",email)
    #replace >,?
    email = re.sub("[\>\>\,\?]","",email)
    print("--------------------------------Pre-processed Email------------------------")
    print(email)
    return email

我收到了我转换过的单词袋或常用单词的词汇表 使用:

def getVocabDict():
    vocab_txt = open("C:/Users/dynam/Desktop/Coursera AndrewNg/machine-learning-ex6/machine-learning-ex6/ex6/vocab.txt","r")
    vocab_dict = {}
    for line in vocab_txt:
        (key,val) = line.split() #default splitting is using space
        vocab_dict.update({key:val})
    return vocab_dict

此后,我使用:

将电子邮件转换为令牌
def email2Token(Iemail):
    #initialize the stemmer software
    stemmer = nltk.stem.porter.PorterStemmer()
    email = processEmail(Iemail)
    #split the email into individual words
    tokens = re.split("[ \@\$\/\#\.\-\:\&\*\+\=\[\]\?\!\(\)\{\}\,\'\"\>\_\<\;\%\\n]",email)
    print("------------------------Email after splitting into individual words/tokens------------------")
    print(tokens)
    #apply stemmer to each word
    stemmed_tokens = []
    for token in tokens:
        #use porter stemmer to stem the word
        stemmed_token = stemmer.stem(token)
        stemmed_tokens.append(stemmed_token)
        print("---------stemmed token-------------")
        print(stemmed_token)
    return stemmed_tokens

然后,我将电子邮件转换为特征向量,其中第一个元素表示天气,我在我编写的词汇词典中显示的电子邮件中的单词:

def email2featureVec(Iemail,vocab_dict):
    n = len(vocab_dict)
    emailrec = email2Token(Iemail)
    print("---------The token recieved by feature vector converter-----------")
    print(emailrec)
    email_feature = np.zeros((n,1))
    indx = 0
    for i in emailrec:
        if i in vocab_dict.values():
            email_feature[indx,0] = 1
        else:
            email_feature[indx,0] = 0
        indx+=1
    print("--------------------------Email feature vec----------------------------------")
    print(email_feature)
    return email_feature

最后,我创建一个线性SVM模型,并在训练数据集X及其标签y上对其进行训练:

#creating instance of an SVM with C = 0.1
linear_svm = svm.SVC(C = 0.1,kernel = "linear")
#fitting SVM to our X-matrix given labels y
linear_svm.fit(X,y.flatten())

现在,我想知道如何获得15个最重要的单词来对垃圾邮件进行分类? 我稀疏,我必须使用系数来找出答案,但是我的系数是:

for i in linear_svm.coef_:
    for j in i:
        print(j)

0.007932077307221794
0.015633235616866917
0.055464916277558125
-0.013416103446075411
-0.06619756700850743
0.03659516600411697
0.18337597875664702
-0.02488628335729145 and so on ........

我尝试使用:

sorted_arr = np.sort(linear_svm.coef_,axis = None)[::-1]
for i in sorted_arr:
    print(vocab_dict[(i)])

但是会弹出一个错误:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-32-9027571acfa4> in <module>()
      1 sorted_arr = np.sort(linear_svm.coef_,axis = None)[::-1]
      2 for i in sorted_arr:
----> 3     print(vocab_dict[(i)])

KeyError: 0.5006137361746403
iCMS 回答:如何找出垃圾邮件中最常用的15个单词?

暂时没有好的解决方案,如果你有好的解决方案,请发邮件至:iooj@foxmail.com
本文链接:https://www.f2er.com/1737480.html

大家都在问