我正在实现我的第一个神经网络,它是用于二进制情感分析分类的LSTM。我已经通过降低字母,标记和删除大多数标点符号(仅保留。,')对数据进行了预处理。 我还为此使用了GloVe的100d预训练嵌入。
问题是:无论我做什么,准确性都是很糟糕的,并且不会随着epoc的改变而改变(在更改LSTM架构时也不会改变)
我试图更改优化器及其学习率,向LSTM添加更多神经元,更改时期数和批量大小。
似乎没有任何作用
def setLSTM(data,stopRem,stemm,lemma,negHand):
#pre-processing data
data = pre_processing(data,negHand)
print(data[1])
#splitting data
X_train,X_test,y_train,y_test = datasplit(data)
#Setting the words as unique indexes (max 10k unique indexes)
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)
#getting vocabulary
vocab = tokenizer.word_index.items()
print(vocab)
vocab_size = vocab_size = len(tokenizer.word_index) + 1
#maxlen = Maxlen is correspondes to the maximum tweet length (so that we can add padding to shorter ones)
maxlen = len(max((X_train + X_test)))
print("Maxlen is: ",maxlen)
#Padding the sequences to guarantee that all tweets have the same length
X_train = pad_sequences(X_train,padding='post',maxlen=maxlen)
X_test = pad_sequences(X_test,maxlen=maxlen)
#Create embedding matrix with zeros (because some of the vocabulary might not exist in the embeddings)
#and adding the embeddings we have
embedding_matrix = zeros((vocab_size,100))
for idx,word in vocab:
embedding_vector = embeddings.get(word)
if embedding_vector is not None:
embedding_matrix[idx] = embedding_vector
#creating the model with its layers (embedding layer,lstm layer,dense layer)
model = Sequential()
#The embedding layer has "trainable=False" because we're using pre-trained embeddings
embedding_layer = Embedding(vocab_size,100,weights=[embedding_matrix],input_length=maxlen,trainable=False)
model.add(embedding_layer)
model.add(Dropout(0.2))
#Adding an LSTM layer with 128 neurons
model.add(LSTM(units=100))
model.add(Dropout(0.2))
#Adding dense layer with sigmoid activation
model.add(Dense(1,activation='sigmoid'))
#opt = Adam(learning_rate=0.0001,beta_1=0.9,beta_2=0.999,amsgrad=False)
#Compiling model ("loss='binary_crossentropy'" because we're dealing with a binary classification problem)
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['acc'])
print(model.summary())
history = model.fit(X_train,batch_size=64,epochs=5,verbose=1,validation_split=0.2)
score = model.evaluate(X_test,y_test,verbose=1)
print("Test Score:",score[0])
print("Test accuracy:",score[1])
setLSTM(tweets,False,False)
Model: "sequential_9"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_9 (Embedding) (None,13,100) 1916600
_________________________________________________________________
dropout_1 (Dropout) (None,100) 0
_________________________________________________________________
lstm_9 (LSTM) (None,100) 80400
_________________________________________________________________
dropout_2 (Dropout) (None,100) 0
_________________________________________________________________
dense_9 (Dense) (None,1) 101
=================================================================
Total params: 1,997,101
Trainable params: 80,501
Non-trainable params: 1,916,600
_________________________________________________________________
None
Train on 10852 samples,validate on 2713 samples
Epoch 1/5
10852/10852 [==============================] - 5s 448us/step - loss: 0.6920 - acc: 0.5275 - val_loss: 0.6916 - val_acc: 0.5404
Epoch 2/5
10852/10852 [==============================] - 4s 360us/step - loss: 0.6917 - acc: 0.5286 - val_loss: 0.6908 - val_acc: 0.5404
Epoch 3/5
10852/10852 [==============================] - 4s 365us/step - loss: 0.6920 - acc: 0.5286 - val_loss: 0.6907 - val_acc: 0.5404
Epoch 4/5
10852/10852 [==============================] - 4s 382us/step - loss: 0.6916 - acc: 0.5286 - val_loss: 0.6903 - val_acc: 0.5404
Epoch 5/5
10852/10852 [==============================] - 4s 383us/step - loss: 0.6916 - acc: 0.5264 - val_loss: 0.6906 - val_acc: 0.5404
4522/4522 [==============================] - 1s 150us/step
Test Score: 0.6925433831950933
Test accuracy: 0.5176913142204285