如何从文本语料库构建PPMI矩阵?

我正在尝试使用SVD模型在Brown语料库上嵌入单词。为此,我想先生成一个单词-单词共现矩阵,然后转换为PPMI矩阵以进行SVD​​矩阵乘法。

我尝试使用SkLearn CountVectorizer创建同现

count_model = CountVectorizer(ngram_range=(1,1))

X = count_model.fit_transform(corpus)
X[X > 0] = 1
Xc = (X.T * X)
Xc.setdiag(0)
print(Xc.todense())

但是:

(1)不确定如何使用此方法控制上下文窗口吗?我想尝试各种上下文大小,并查看其如何影响流程。

(2)假设 PMI(a,b)=对数p(a,b)/ p(a)p(b)

任何对思考过程和实施的帮助将不胜感激!

谢谢(-:

a1300236594 回答:如何从文本语料库构建PPMI矩阵?

我尝试使用提供的代码,但是无法将移动窗口应用于它。因此,我执行了自己的功能。此函数获取一个句子列表,并返回代表同现矩阵的pandas.DataFrame对象和一个window_size数字:

def co_occurrence(sentences,window_size):
    d = defaultdict(int)
    vocab = set()
    for text in sentences:
        # preprocessing (use tokenizer instead)
        text = text.lower().split()
        # iterate over sentences
        for i in range(len(text)):
            token = text[i]
            vocab.add(token)  # add to vocab
            next_token = text[i+1 : i+1+window_size]
            for t in next_token:
                key = tuple( sorted([t,token]) )
                d[key] += 1

    # formulate the dictionary into dataframe
    vocab = sorted(vocab) # sort vocab
    df = pd.DataFrame(data=np.zeros((len(vocab),len(vocab)),dtype=np.int16),index=vocab,columns=vocab)
    for key,value in d.items():
        df.at[key[0],key[1]] = value
        df.at[key[1],key[0]] = value
    return df

让我们尝试以下两个简单的句子:

>>> text = ["I go to school every day by bus .","i go to theatre every night by bus"]
>>> 
>>> df = co_occurrence(text,2)
>>> df
         .  bus  by  day  every  go  i  night  school  theatre  to
.        0    1   1    0      0   0  0      0       0        0   0
bus      1    0   2    1      0   0  0      1       0        0   0
by       1    2   0    1      2   0  0      1       0        0   0
day      0    1   1    0      1   0  0      0       1        0   0
every    0    0   2    1      0   0  0      1       1        1   2
go       0    0   0    0      0   0  2      0       1        1   2
i        0    0   0    0      0   2  0      0       0        0   2
night    0    1   1    0      1   0  0      0       0        1   0
school   0    0   0    1      1   1  0      0       0        0   1
theatre  0    0   0    0      1   1  0      1       0        0   1
to       0    0   0    0      2   2  2      0       1        1   0

[11 rows x 11 columns]

现在,我们有了同现矩阵。让我们找到(正)逐点相互信息或PPMI。我使用了slides中斯坦福大学教授克里斯托弗·波茨(Christopher Potts)提供的代码,可以将其总结为下图

pmi

PPMI与以下带有pmi的{​​{1}}相同:

positive=True

让我们尝试一下:

def pmi(df,positive=True):
    col_totals = df.sum(axis=0)
    total = col_totals.sum()
    row_totals = df.sum(axis=1)
    expected = np.outer(row_totals,col_totals) / total
    df = df / expected
    # Silence distracting warnings about log(0):
    with np.errstate(divide='ignore'):
        df = np.log(df)
    df[np.isinf(df)] = 0.0  # log(0) = 0
    if positive:
        df[df < 0] = 0.0
    return df
本文链接:https://www.f2er.com/3163617.html

大家都在问