我正在准备一些代码,以使用LSTM和GLOVE对某些文本(多标签)进行分类。准备过程的一部分是嵌入。我正在使用kaggle的引用,但是嵌入向量出现错误,并且我不知道问题出在哪里。
embed_size = 50 # how big is each word vector
max_features = 20000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 200 # max number of words in a comment to use
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)
X_t = pad_sequences(list_tokenized_train,maxlen=maxlen)
X_te = pad_sequences(list_tokenized_test,maxlen=maxlen)
def get_coefs(word,*arr): return word,np.asarray(arr,dtype='float32')
embeddings_index = dict(get_coefs(*o.strip().split()) for o in open(EMBEDDING_FILE))
all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(),all_embs.std()
emb_mean,emb_std
word_index = tokenizer.word_index
nb_words = min(max_features,len(word_index))
embedding_matrix = np.random.normal(emb_mean,emb_std,(nb_words,embed_size))
for word,i in word_index.items():
if i >= max_features: continue
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None: embedding_matrix[i] = embedding_vector
这将导致以下错误:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-14-9dffbd6ff093> in <module>
26 embedding_vector = embeddings_index.get(word)
27
---> 28 if embedding_vector is not None: embedding_matrix[i] = embedding_vector
IndexError: index 3595 is out of bounds for axis 0 with size 3595
我尝试更改'embed_size',max_features'和“ maxlen”参数,因为这些参数对代码有直接影响,但是,我认为我没有找到正确的问题。我不希望有人能解决这个问题,但是也许可以引导我到需要看的地方,或者说明发生了什么。
谢谢!