我尝试使用提供的代码,但是无法将移动窗口应用于它。因此,我执行了自己的功能。此函数获取一个句子列表,并返回代表同现矩阵的pandas.DataFrame
对象和一个window_size
数字:
def co_occurrence(sentences,window_size):
d = defaultdict(int)
vocab = set()
for text in sentences:
# preprocessing (use tokenizer instead)
text = text.lower().split()
# iterate over sentences
for i in range(len(text)):
token = text[i]
vocab.add(token) # add to vocab
next_token = text[i+1 : i+1+window_size]
for t in next_token:
key = tuple( sorted([t,token]) )
d[key] += 1
# formulate the dictionary into dataframe
vocab = sorted(vocab) # sort vocab
df = pd.DataFrame(data=np.zeros((len(vocab),len(vocab)),dtype=np.int16),index=vocab,columns=vocab)
for key,value in d.items():
df.at[key[0],key[1]] = value
df.at[key[1],key[0]] = value
return df
让我们尝试以下两个简单的句子:
>>> text = ["I go to school every day by bus .","i go to theatre every night by bus"]
>>>
>>> df = co_occurrence(text,2)
>>> df
. bus by day every go i night school theatre to
. 0 1 1 0 0 0 0 0 0 0 0
bus 1 0 2 1 0 0 0 1 0 0 0
by 1 2 0 1 2 0 0 1 0 0 0
day 0 1 1 0 1 0 0 0 1 0 0
every 0 0 2 1 0 0 0 1 1 1 2
go 0 0 0 0 0 0 2 0 1 1 2
i 0 0 0 0 0 2 0 0 0 0 2
night 0 1 1 0 1 0 0 0 0 1 0
school 0 0 0 1 1 1 0 0 0 0 1
theatre 0 0 0 0 1 1 0 1 0 0 1
to 0 0 0 0 2 2 2 0 1 1 0
[11 rows x 11 columns]
现在,我们有了同现矩阵。让我们找到(正)逐点相互信息或PPMI。我使用了slides中斯坦福大学教授克里斯托弗·波茨(Christopher Potts)提供的代码,可以将其总结为下图
PPMI与以下带有pmi
的{{1}}相同:
positive=True
让我们尝试一下:
def pmi(df,positive=True):
col_totals = df.sum(axis=0)
total = col_totals.sum()
row_totals = df.sum(axis=1)
expected = np.outer(row_totals,col_totals) / total
df = df / expected
# Silence distracting warnings about log(0):
with np.errstate(divide='ignore'):
df = np.log(df)
df[np.isinf(df)] = 0.0 # log(0) = 0
if positive:
df[df < 0] = 0.0
return df
本文链接:https://www.f2er.com/3163617.html