我正在Pytorch和Tensorflow2中实现Skipgram模型。我对常用字的二次采样的实施方式有疑问。根据本文的逐字记录,二次采样单词wi
的概率计算为
其中t
是自定义阈值(通常是一个很小的值,例如 0.0001 ),而f
是文档中单词的出现频率。尽管作者以不同但几乎等效的方式实现了它,但让我们坚持这个定义。
在计算P(wi)
时,我们最终会得到负值。例如,假设我们有100个单词,并且其中一个单词的出现频率比其他单词高得多(因为我的数据集就是这种情况)。
import numpy as np
import seaborn as sns
np.random.seed(12345)
# generate counts in [1,20]
counts = np.random.randint(low=1,high=20,size=99)
# add an extremely bigger count
counts = np.insert(counts,100000)
# compute frequencies
f = counts/counts.sum()
# define threshold as in paper
t = 0.0001
# compute probabilities as in paper
probs = 1 - np.sqrt(t/f)
sns.distplot(probs);
问:使用这种“概率”实现子采样的正确方法是什么?
作为附加信息,我已经看到在keras函数keras.preprocessing.sequence.make_sampling_table
中采用了不同的方法:
def make_sampling_table(size,sampling_factor=1e-5):
"""Generates a word rank-based probabilistic sampling table.
Used for generating the `sampling_table` argument for `skipgrams`.
`sampling_table[i]` is the probability of sampling
the i-th most common word in a dataset
(more common words should be sampled less frequently,for balance).
The sampling probabilities are generated according
to the sampling distribution used in word2vec:
```
p(word) = (min(1,sqrt(word_frequency / sampling_factor) /
(word_frequency / sampling_factor)))
```
We assume that the word frequencies follow Zipf's law (s=1) to derive
a numerical approximation of frequency(rank):
`frequency(rank) ~ 1/(rank * (log(rank) + gamma) + 1/2 - 1/(12*rank))`
where `gamma` is the Euler-Mascheroni constant.
# Arguments
size: Int,number of possible words to sample.
sampling_factor: The sampling factor in the word2vec formula.
# Returns
A 1D Numpy array of length `size` where the ith entry
is the probability that a word of rank i should be sampled.
"""
gamma = 0.577
rank = np.arange(size)
rank[0] = 1
inv_fq = rank * (np.log(rank) + gamma) + 0.5 - 1. / (12. * rank)
f = sampling_factor * inv_fq
return np.minimum(1.,f / np.sqrt(f))