我想知道为了实现遗传贝叶斯SMS垃圾邮件过滤器而应遵循的ML步骤

2024-05-04 • 问答

我被分配了一个学校作业，以研究贝叶斯遗传垃圾短信过滤器，我正在尝试构建ML算法遗传贝叶斯SMS垃圾邮件过滤器，在这种情况下，我将使用遗传算法进行特征选择，然后使用贝叶斯算法对其进行训练，但是我的问题是我正在尝试知道我将要采取的步骤以使其正确。 PS：我想知道到目前为止是否遵循的步骤是最好的方法。

期望的最终结果（如果使用Bayes训练提取的特征）期望的分类精度达到90％

我首先加载数据集并导入所需的所有模块。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import re
from nltk.tokenize import word_tokenize

from sklearn.model_selection import train_test_split

from tpot import TPOTClassifier
from tpot import TPOTRegressor

#to ignore warnings when running
import warnings
warnings.filterwarnings('ignore')

from sklearn.feature_extraction.text import CountVectorizer,HashingVectorizer


dfData = pd.read_csv('datasets/disambiguate_spam_sms.csv',encoding="latin-1")
print(dfData.shape)
dfData.head()

第一步之后，我将它们按其正在使用的SMS类别进行分组，在本例中为（垃圾邮件/火腿）

#to show a chart of a particular category in the dataset

dfData.groupby('label').message.count().plot.bar(ylim=0)
plt.show()
print(4825/747) #Baseline accuracy

然后我删除了空值

# remove null value from the dataset

dfDrop = dfData.dropna(subset=['label','message'])
dfDrop.head()

然后我重新整理我的数据集

# randomly shuffle the data before starting,to avoid any type of ordering in the data

dfShuffle = dfDrop.iloc[np.random.permutation(len(dfDrop))]
dfSort = dfShuffle.reset_index(drop=True)
dfSort.head()

我预处理数据集

#  Pre-processing the Raw Text and Getting It Ready for Machine Learning

stemmer = PorterStemmer()
words = stopwords.words("english")

dfSort['processedtext'] = dfSort['message'].apply(lambda x: " ".join([stemmer.stem(i) for i in re.sub("[^a-zA-Z]"," ",x).split() 
                                                                      if i not in words]).lower())
print(dfSort.shape)

因此，毕竟，现在我真的不知道下一阶段将是什么。

我想知道为了实现遗传贝叶斯SMS垃圾邮件过滤器而应遵循的ML步骤

lovers_php 回答：我想知道为了实现遗传贝叶斯SMS垃圾邮件过滤器而应遵循的ML步骤

大家都在问