NodeJS RTF ANSI查找并用特殊字符替换单词

2024-05-17 • 问答

我有一个查找和替换脚本，当单词没有任何特殊字符时，该脚本没有问题。但是，很多时候是特殊字符，因为它会查找名称。到目前为止，这正在破坏脚本。

脚本查找{<some-text>}并尝试替换内容（以及删除括号）。

示例：

text.rtf

Here's a name with special char {Kotouč}

script.ts

import * as fs from "fs";

// Ingest the rtf file.
const content: string = fs.readFileSync("./text.rtf","utf8");
console.log("content::\n",content);

// The string we are looking to match in file text.
const plainText: string = "{Kotouč}";

// Look for all text that matches the patter `{TEXT_HERE}`.
const anyMatchPattern: RegExp = /{(.*?)}/gi;
const matches: string[] = content.match(anyMatchPattern) || [];
const matchesLen: number = matches.length;
for (let i: number = 0; i < matchesLen; i++) {

    // It correctly identifies the targeted text.
    const currMatch: string = matches[i];
    const isRtfMetadata: boolean = currMatch.endsWith(";}");
    if (isRtfMetadata) {
        continue;
    }

    // Here I need a way to escape `plainText` string so that it matches the source.
    console.log("currMatch::",currMatch);
    console.log("currMatch === plainText::",currMatch === plainText);
    if (currMatch === plainText) {
        const newContent: string = content.replace(currMatch,"IT_WORKS!");
        console.log("newContent:",newContent);
    }
}

输出

content::
 {\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0

\f0\fs24 \cf0 Here's a name with special char \{Kotou\uc0\u269 \}.}

currMatch:: {Kotou\uc0\u269 \}

currMatch === plainText:: false

它看起来像是ANSI的转义符，我尝试使用jsesc，但是会生成一个不同的字符串{Kotou\u010D}，而不是文档生成的{Kotou\uc0\u269 \}。

如何动态地对plainText字符串变量进行转义，使其与文档中的内容匹配？

我需要的是加深我对rtf格式以及常规文本编码的了解。

从文件中读取的原始RTF文本为我们提供了一些提示：

{\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600...

rtf文件元数据的这一部分告诉我们一些事情。

它使用的是RTF文件格式版本1。编码为ANSI，特别是cpg1252，也称为Windows-1252或CP-1252，即：

...拉丁字母的单字节字符编码

（source）

有价值的信息是，我们知道它使用的是拉丁字母，稍后将使用。

在知道使用的特定RTF版本时，我偶然发现了RTF 1.5 Spec

在该规范中快速搜索我正在寻找的一个转义序列，发现它是RTF特定的转义控制序列，即\uc0。因此，知道\u269之后我便可以解析出我的真实想法。现在，我知道它是unicode，并且很有预感\u269代表unicode character code 269。所以我抬头...

\u269（字符代码269）shows up on this page to confirm。现在我知道了字符集以及获取等效的纯文本（未转义）所需做的工作，并且有一个基本的SO post I used here可以启动该功能。

使用所有这些知识，我就可以从那里将其拼凑在一起。这是完整的更正脚本，它的输出是：

script.ts

import * as fs from "fs";


// Match RTF unicode control sequence: http://www.biblioscape.com/rtf15_spec.htm
const unicodeControlReg: RegExp = /\\uc0\\u/g;

// Extracts the unicode character from an escape sequence with handling for rtf.
const matchEscapedChars: RegExp = /\\uc0\\u(\d{2,6})|\\u(\d{2,6})/g;

/**
 * Util function to strip junk characters from string for comparison.
 * @param {string} str
 * @returns {string}
 */
const cleanupRtfStr = (str: string): string => {
    return str
        .replace(/\s/g,"")
        .replace(/\\/g,"");
};

/**
 * Detects escaped unicode and looks up the character by that code.
 * @param {string} str
 * @returns {string}
 */
const unescapeString = (str: string): string => {
    const unescaped = str.replace(matchEscapedChars,(cc: string) => {
        const stripped: string = cc.replace(unicodeControlReg,"");
        const charCode: number = Number(stripped);

        // See unicode character codes here:
        //  https://unicodelookup.com/#latin/11
        return String.fromCharCode(charCode);
    });

    // Remove all whitespace.
    return unescaped;
};

// Ingest the rtf file.
const content: string = fs.readFileSync("./src/TEST.rtf","binary");
console.log("content::\n",content);

// The string we are looking to match in file text.
const plainText: string = "{Kotouč}";

// Look for all text that matches the pattern `{TEXT_HERE}`.
const anyMatchPattern: RegExp = /{(.*?)}/gi;
const matches: string[] = content.match(anyMatchPattern) || [];
const matchesLen: number = matches.length;
for (let i: number = 0; i < matchesLen; i++) {
    const currMatch: string = matches[i];
    const isRtfMetadata: boolean = currMatch.endsWith(";}");
    if (isRtfMetadata) {
        continue;
    }

    if (currMatch === plainText) {
        const newContent: string = content.replace(currMatch,"IT_WORKS!");
        console.log("\n\nnewContent:",newContent);
        break;
    }

    const unescapedMatch: string = unescapeString(currMatch);
    const cleanedMatch: string = cleanupRtfStr(unescapedMatch);
    if (cleanedMatch === plainText) {
        const newContent: string = content.replace(currMatch,"IT_WORKS_UNESCAPED!");
        console.log("\n\nnewContent:",newContent);
        break;
    }
}

输出

content::
 {\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\partightenfactor0

\f0\fs24 \cf0 Here\'92s a name with special char \{Kotou\uc0\u269 \}}


newContent: {\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\partightenfactor0

\f0\fs24 \cf0 Here\'92s a name with special char \IT_WORKS_UNESCAPED!}

希望对其他不熟悉字符编码/转义的人有所帮助，并且可以在rtf格式的文档中使用它！

NodeJS RTF ANSI查找并用特殊字符替换单词

lw9776535 回答：NodeJS RTF ANSI查找并用特殊字符替换单词

大家都在问