在Java中有效过滤字符串

2024-04-30 • 问答

我现在正在尝试制作类似小型搜索引擎的产品。我的目标是在哈希图中索引一堆文件，但首先我需要执行一些操作，包括降低资本，删除所有不必要的单词以及删除除a-z / A-Z之外的所有字符。现在，我的实现如下所示：

String article = "";

for (File file : dir.listFiles()) { //for each file (001.txt,002.txt...)
        Scanner s = null;
        try {
            s = new Scanner(file);
            while (s.hasnext())
                article += s.next().toLowerCase(Locale.ROOT) + " "; //converting all characters to lower case
            article = currentWord.replaceAll(delimiters.get()," "); //removing punctuations (?,-,!,* etc...) 

            String splittedWords = article.split(" ");  //splitting each word into a string array
            for(int i = 0; i < splittedWords.length; i++) {
                s = new Scanner(stopwords);
                boolean flag = true;
                while(s.hasnextLine())
                    if (splittedWords[i].equals(s.nextLine())) { //comparing each word with all the stop words (words like a,the,already,these etc...) taken from another big txt file and removing them,because we dont need to fill our map with unnecessary words,to provide faster search times later on
                        flag = false;
                        break;
                    }
                if(flag) map.put(splittedWords[i],file.getName()); //if current word in splittedWords array does not match any stop word,put it in the hashmap        


            }
            s.close();


        } catch (FileNotFoundException e) {

            e.printStackTrace();
        }
        s.close();
        System.out.println(file);
    }

这只是我的代码的一个块，它可能包含缺少的部分，我用注释粗略地解释了我的算法。使用.contains方法检查stopWords是否包含任何currentWord，尽管这是一种较快的方法，但它不会映射“死亡”之类的单词，因为它包含停用词列表中的“ at”。我试图尽最大努力使它更有效，但是我进步不大。每个文件包含大约大约300个单词每个索引都需要大约3秒才能完成索引，考虑到我有一万个文件，这是不理想的。关于如何改善算法以使其运行更快的任何想法？

有一些改进：

首先，请不要使用new Scanner(File)构造函数，因为它使用了无缓冲的I / O。小磁盘读取操作（特别是在HDD上）非常无效。例如，使用具有65 KB缓冲区的BufferedInputStream：

try (Scanner s = new Scanner(new BufferedInputStream(new FileInputStream(f),65536))) {
    // your code
}

第二个：您的PC最有可能具有多代码CPU。因此，您可以并行扫描多个文件。为此，您必须确保使用支持多线程的map。将地图的定义更改为：

Map<String,String> map = new ConcurrentHashMap<>();

然后您可以使用以下代码：

Files.list(dir.toPath()).parallel().forEach(f -> {
    try (Scanner s = new Scanner(new BufferedInputStream(Files.newInputStream(f),65536))) {
        // your code
    } catch (IOException e) {
        e.printStackTrace();
    }
});

根据系统中的CPU内核，它将同时处理多个文件。特别是如果您处理大量文件，这将大大减少程序的运行时间。

最后，您的实现非常复杂。您可以使用Scanner的输出来创建一个新的String，然后将其再次拆分。相反，最好将Scanner配置为直接考虑所需的定界符：

try (Scanner s = new Scanner(....).useDelimiter("[,\\!\\-\\.\\?\\*]")) {

然后，您可以直接使用Scanner创建的令牌，而不必构建article字符串并随后将其拆分。

为什么要自己实施搜索引擎？

对于生产，我会推荐现有的解决方案-Apache Lucene，它完全符合您的任务。

如果您只是在培训，可以通过几个标准点来改进代码。

像这样{ "extends": "../tsconfig.test.json" }那样避免循环中的字符串连接。最好创建一个单词regexp并将其传递给Scanner。

article +=

将所有停用词放入哈希图中，并仅使用Pattern p = Pattern.compile("[A-Za-z]+"); try (Scanner s = new Scanner(file)) { while (s.hasNext(p)) { String word = s.next(p); word = word.toLowerCase(Locale.ROOT); ... } }方法检查每个新出现的词

在Java中有效过滤字符串

qiu397249612 回答：在Java中有效过滤字符串

大家都在问