NodeJS RTF ANSI查找并用特殊字符替换单词

我有一个查找和替换脚本,当单词没有任何特殊字符时,该脚本没有问题。但是,很多时候 是特殊字符,因为它会查找名称。到目前为止,这正在破坏脚本。

脚本查找{<some-text>}并尝试替换内容(以及删除括号)。

示例:

text.rtf

Here's a name with special char {Kotouč}

script.ts

import * as fs from "fs";

// Ingest the rtf file.
const content: string = fs.readFileSync("./text.rtf","utf8");
console.log("content::\n",content);

// The string we are looking to match in file text.
const plainText: string = "{Kotouč}";

// Look for all text that matches the patter `{TEXT_HERE}`.
const anyMatchPattern: RegExp = /{(.*?)}/gi;
const matches: string[] = content.match(anyMatchPattern) || [];
const matchesLen: number = matches.length;
for (let i: number = 0; i < matchesLen; i++) {

    // It correctly identifies the targeted text.
    const currMatch: string = matches[i];
    const isRtfMetadata: boolean = currMatch.endsWith(";}");
    if (isRtfMetadata) {
        continue;
    }

    // Here I need a way to escape `plainText` string so that it matches the source.
    console.log("currMatch::",currMatch);
    console.log("currMatch === plainText::",currMatch === plainText);
    if (currMatch === plainText) {
        const newContent: string = content.replace(currMatch,"IT_WORKS!");
        console.log("newContent:",newContent);
    }
}

输出

content::
 {\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0

\f0\fs24 \cf0 Here's a name with special char \{Kotou\uc0\u269 \}.}

currMatch:: {Kotou\uc0\u269 \}

currMatch === plainText:: false

它看起来像是ANSI的转义符,我尝试使用jsesc,但是会生成一个不同的字符串{Kotou\u010D},而不是文档生成的{Kotou\uc0\u269 \}

如何动态地对plainText字符串变量进行转义,使其与文档中的内容匹配?

lw9776535 回答:NodeJS RTF ANSI查找并用特殊字符替换单词

我需要的是加深我对rtf格式以及常规文本编码的了解。

从文件中读取的原始RTF文本为我们提供了一些提示:

{\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600...

rtf文件元数据的这一部分告诉我们一些事情。

它使用的是RTF文件格式版本1。编码为ANSI,特别是cpg1252,也称为Windows-1252CP-1252,即:

  

...拉丁字母的单字节字符编码

source

有价值的信息是,我们知道它使用的是拉丁字母,稍后将使用。

在知道使用的特定RTF版本时,我偶然发现了RTF 1.5 Spec

在该规范中快速搜索我正在寻找的一个转义序列,发现它是RTF特定的转义控制序列,即\uc0。因此,知道\u269之后我便可以解析出我的真实想法。现在,我知道它是unicode,并且很有预感\u269代表unicode character code 269。所以我抬头...

\u269(字符代码269shows up on this page to confirm。现在我知道了字符集以及获取等效的纯文本(未转义)所需做的工作,并且有一个基本的SO post I used here可以启动该功能。

使用所有这些知识,我就可以从那里将其拼凑在一起。这是完整的更正脚本,它的输出是:

script.ts

import * as fs from "fs";


// Match RTF unicode control sequence: http://www.biblioscape.com/rtf15_spec.htm
const unicodeControlReg: RegExp = /\\uc0\\u/g;

// Extracts the unicode character from an escape sequence with handling for rtf.
const matchEscapedChars: RegExp = /\\uc0\\u(\d{2,6})|\\u(\d{2,6})/g;

/**
 * Util function to strip junk characters from string for comparison.
 * @param {string} str
 * @returns {string}
 */
const cleanupRtfStr = (str: string): string => {
    return str
        .replace(/\s/g,"")
        .replace(/\\/g,"");
};

/**
 * Detects escaped unicode and looks up the character by that code.
 * @param {string} str
 * @returns {string}
 */
const unescapeString = (str: string): string => {
    const unescaped = str.replace(matchEscapedChars,(cc: string) => {
        const stripped: string = cc.replace(unicodeControlReg,"");
        const charCode: number = Number(stripped);

        // See unicode character codes here:
        //  https://unicodelookup.com/#latin/11
        return String.fromCharCode(charCode);
    });

    // Remove all whitespace.
    return unescaped;
};

// Ingest the rtf file.
const content: string = fs.readFileSync("./src/TEST.rtf","binary");
console.log("content::\n",content);

// The string we are looking to match in file text.
const plainText: string = "{Kotouč}";

// Look for all text that matches the pattern `{TEXT_HERE}`.
const anyMatchPattern: RegExp = /{(.*?)}/gi;
const matches: string[] = content.match(anyMatchPattern) || [];
const matchesLen: number = matches.length;
for (let i: number = 0; i < matchesLen; i++) {
    const currMatch: string = matches[i];
    const isRtfMetadata: boolean = currMatch.endsWith(";}");
    if (isRtfMetadata) {
        continue;
    }

    if (currMatch === plainText) {
        const newContent: string = content.replace(currMatch,"IT_WORKS!");
        console.log("\n\nnewContent:",newContent);
        break;
    }

    const unescapedMatch: string = unescapeString(currMatch);
    const cleanedMatch: string = cleanupRtfStr(unescapedMatch);
    if (cleanedMatch === plainText) {
        const newContent: string = content.replace(currMatch,"IT_WORKS_UNESCAPED!");
        console.log("\n\nnewContent:",newContent);
        break;
    }
}

输出

content::
 {\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\partightenfactor0

\f0\fs24 \cf0 Here\'92s a name with special char \{Kotou\uc0\u269 \}}


newContent: {\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\partightenfactor0

\f0\fs24 \cf0 Here\'92s a name with special char \IT_WORKS_UNESCAPED!}

希望对其他不熟悉字符编码/转义的人有所帮助,并且可以在rtf格式的文档中使用它!

本文链接:https://www.f2er.com/3045074.html

大家都在问