信息检索文档收集阅读

因此,在我的代码中,我正在读取一个包含许多文档的文件,并且正在阅读这些文档并接受重要的单词并进行打印。但是,我能够解析文档一次读取一个文档,但是在阅读文档文本时遇到“ KeyError”错误,我创建了一个函数,可以读取文本并将文本输入到List中并创建我不知道我的代码在索引这些单词的地方出了错。

这是应该从中读取的文档集合(只是第一个片段):

<DOC>
<DOCNO> AP890101-0001 </DOCNO>
<FILEID>AP-NR-01-01-89 2358EST</FILEID>
<FIRST>r a PM-APArts:60sMovies     01-01 1073</FIRST>
<SECOND>PM-AP Arts: 60s Movies,1100</SECOND>
<HEAD>You Don't Need a Weatherman To Know '60s Films Are Here</HEAD>
<HEAD>Eds: Also in Monday AMs report.</HEAD>
<BYLINE>By HILLEL ITALIE</BYLINE>
<BYLINE>Associated Press Writer</BYLINE>
<DATELINE>NEW YORK (AP) </DATELINE>
<TEXT>
   The celluloid torch has been passed to a new
generation: filmmakers who grew up in the 1960s.

       #Part 1
        # 1) Each doc begins w < DOC > and ends w < /DOC >
        # 2) First few lines contain metadata,you should read
        #    the < DOCNO > field and use it as the ID of the doc
        # 3) Doc contents are between the tags < TEXT > and < /TEXT >
        # 4) All other file contents can be ignored.
        # 5) Remove all stop words,punct.,and lowercase all words 
        def inv_idx(self,doc):
            print()
            print("Good job,we are in inv_idx() !")
            print()

            doc_num = 0
            doc_text = 0
            # count = 0
            ps = PorterStemmer()

            with open(doc,'r') as fs:
                for line in fs:
                    if "<DOC>" in line:
                        doc_num += 1 
                    elif doc_num: #1 enters condition
                        #number the doc
                        if "<DOCNO>" in line:
                            print("in DOCNO tag")
                            #split line and return first item,which is the docID 
                            doc_name = line.split()[1] 
                            print(doc_name)

                        elif "<TEXT>" in line:
                            print("in TEXT tag")
                            doc_text += 1

                        #parse doc's text
                        elif doc_text: 
                            if "</TEXT>" in line:
                                print("in /TEXT tag")
                                doc_text = 1
                            #tokenize all words + see if they are in the main list (aka if they exist in the doc and are not stop words)    
                            else: 
                                tokens = word_tokenize(line)
                                for token in tokens:
                                    token = ps.stem(token.lower())

                                    if token not in self.stopWords and token.isalpha():
                                        if token not in self.postings:#only add if word appears in doc
                                            self.postings[token][doc_name]= 1 # Why am I getting this error?
                                            #print("line82") OK SO WE KNOW WE HIT 82 
                                        else: 
                                            if doc_text not in self.postings[token]:
                                                self.postings[token][doc_name] = 1 #first appearance
                                            else: 
                                                self.postings[token][doc_name] += 1 #increment appearance

                                                #print(self.postings[token])

                                        if token not in self.termFreq:               #add_docfreq
                                             # only add if word apepars in doc
                                            self.termFreq[token] = [doc_name]        #add_docfreq
                                        else:                                        #add_docfreq
                                            if doc_name not in self.termFreq[token]: #add_docfreq
                                                self.termFreq[token].append(doc_name)#add_docfreq

                                for word in line.split():                            #total_terms 
                                    if word not in self.stopWords and word.isalpha():#total_terms 
                                        count += 1                                   #total_terms 

                        #move on to next document
                        elif "</DOC>" in line:
                            # self.totalTerms[doc_name] = count                       #total_terms 
                            doc_num = 0
                            # count = 0                                               #total_terms 
            print("EOFFFF")

以下是输出/错误:

  

干得好,我们在inv_idx()中!在DOCNO标签AP890101-0001中,在TEXT标签Traceback中(最近一次调用为最后):文件“ ps2.py”,第151行,在newPS2.inv_idx(“ data / ap89_collection”)文件“ ps2.py”,第74行,在inv_idx中self.postings [token] [doc_name] = 1

KeyError:“赛璐ul”

hope1109 回答:信息检索文档收集阅读

暂时没有好的解决方案,如果你有好的解决方案,请发邮件至:iooj@foxmail.com
本文链接:https://www.f2er.com/3139221.html

大家都在问