我正在使用acrobat.tlb库解析.pdf
在连续删除连字符的新行中,连字符被分开.
例如
ABC-123-XXX-987
解析为:
ABC
123
XXX
987
如果我使用iTextSharp解析文本,它会解析文件中显示的整个字符串,这是我想要的行为.但是,我需要在.pdf和iTextSharp中突出显示这些字符串(序列号),而不是将突出显示放在正确的位置…因此acrobat.tlb
我正在使用此代码,从这里:http://www.vbforums.com/showthread.php?561501-RESOLVED-2003-How-to-highlight-text-in-pdf
- ' filey = "*your full file name including directory here*"
- AcroExchApp = CreateObject("AcroExch.App")
- AcroExchAVDoc = CreateObject("AcroExch.AVDoc")
- ' Open the [strfiley] pdf file
- AcroExchAVDoc.Open(filey,"")
- ' Get the PDDoc associated with the open AVDoc
- AcroExchPDDoc = AcroExchAVDoc.GetPDDoc
- sustext = "accessorizes"
- suktext = "accessorises"
- ' get JavaScript Object
- ' note jso is related to PDDoc of a PDF,jso = AcroExchPDDoc.GetJSObject
- ' count
- nCount = 0
- nCount1 = 0
- gbStop = False
- bUSCnt = False
- bUKCnt = False
- ' search for the text
- If Not jso Is Nothing Then
- ' total number of pages
- nPages = jso.numpages
- ' Go through pages
- For i = 0 To nPages - 1
- ' check each word in a page
- nWords = jso.getPageNumWords(i)
- For j = 0 To nWords - 1
- ' get a word
- word = Trim(CStr(jso.getPageNthWord(i,j)))
- 'If VarType(word) = VariantType.String Then
- If word <> "" Then
- ' compare the word with what the user wants
- If Trim(sustext) <> "" Then
- result = StrComp(word,sustext,vbTextCompare)
- ' if same
- If result = 0 Then
- nCount = nCount + 1
- If bUSCnt = False Then
- iUSCnt = iUSCnt + 1
- bUSCnt = True
- End If
- End If
- End If
- If suktext<> "" Then
- result1 = StrComp(word,suktext,vbTextCompare)
- ' if same
- If result1 = 0 Then
- nCount1 = nCount1 + 1
- If bUKCnt = False Then
- iUKCnt = iUKCnt + 1
- bUKCnt = True
- End If
- End If
- End If
- End If
- Next j
- Next i
- jso = Nothing
- End If
代码执行突出显示文本的工作,但带有’word’变量的FOR循环将带连字符的字符串拆分为组件部分.
- For i = 0 To nPages - 1
- ' check each word in a page
- nWords = jso.getPageNumWords(i)
- For j = 0 To nWords - 1
- ' get a word
- word = Trim(CStr(jso.getPageNthWord(i,j)))
有谁知道如何使用acrobat.tlb维护整个字符串?我的相当广泛的搜索空白.
我可以理解iTextSharp在突出显示文本时很麻烦,因为你必须绘制一个矩形并变得复杂,但acrobat.tlb的解决方案也有它的缺点.它不是免费的,很少有人会使用它.对我们其他人来说更好的解决方案是免费且易于使用的Spire.Pdf.你可以从NuGet包中获得它.代码执行以下操作:
- Opens .pdf
- Read each text page
- using regular expression find matches
- save them to a list of strings eliminating duplicates
- for each string in this list search page and highlight the word
码:
- Dim pdf As PdfDocument = New PdfDocument("Path")
- Dim pattern As String = "([A-Z,0-9]{3}[-][A-Z,0-9]{3})"
- Dim matches As MatchCollection
- Dim result As PdfTextFind() = Nothing
- Dim content As New StringBuilder()
- Dim matchList As New List(Of String)
- For Each page As PdfPageBase In pdf.Pages
- 'get text from current page
- content.Append(page.ExtractText())
- 'find matches
- matches = Regex.Matches(content.ToString,pattern,RegexOptions.None)
- matchList.Clear()
- 'Assign each match to a string list.
- For Each match As Match In matches
- matchList.Add(match.Value)
- Next
- 'Eliminate duplicates.
- matchList = matchList.Distinct.ToList
- 'for each string in list
- For i = 0 To matchList.Count - 1
- 'find all occurances of matchList(i) string in page and highlight it
- result = page.FindText(matchList(i)).Finds
- For Each find As PdfTextFind In result
- find.ApplyHighLight(Color.BlueViolet) 'you can set your color preference
- Next
- Next 'matchList
- Next 'page
- pdf.SaveToFile("New Path")
- pdf.Close()
- pdf.Dispose()
我在正则表达方面不太好,所以你可以实现你的.无论如何,那是我的方法.