使用.NET VB或C#中的acrobat.tlb从.pdf中提取完整的带连字符的单词

前端之家收集整理的这篇文章主要介绍了使用.NET VB或C#中的acrobat.tlb从.pdf中提取完整的带连字符的单词前端之家小编觉得挺不错的,现在分享给大家,也给大家做个参考。
我正在使用acrobat.tlb库解析.pdf

在连续删除连字符的新行中,连字符被分开.

例如
ABC-123-XXX-987

解析为:
ABC
123
XXX
987

如果我使用iTextSharp解析文本,它会解析文件显示的整个字符串,这是我想要的行为.但是,我需要在.pdf和iTextSharp中突出显示这些字符串(序列号),而不是将突出显示放在正确的位置…因此acrobat.tlb

我正在使用此代码,从这里:http://www.vbforums.com/showthread.php?561501-RESOLVED-2003-How-to-highlight-text-in-pdf

  1. ' filey = "*your full file name including directory here*"
  2. AcroExchApp = CreateObject("AcroExch.App")
  3. AcroExchAVDoc = CreateObject("AcroExch.AVDoc")
  4. ' Open the [strfiley] pdf file
  5. AcroExchAVDoc.Open(filey,"")
  6.  
  7. ' Get the PDDoc associated with the open AVDoc
  8. AcroExchPDDoc = AcroExchAVDoc.GetPDDoc
  9. sustext = "accessorizes"
  10. suktext = "accessorises"
  11. ' get JavaScript Object
  12. ' note jso is related to PDDoc of a PDF,jso = AcroExchPDDoc.GetJSObject
  13. ' count
  14. nCount = 0
  15. nCount1 = 0
  16. gbStop = False
  17. bUSCnt = False
  18. bUKCnt = False
  19. ' search for the text
  20. If Not jso Is Nothing Then
  21. ' total number of pages
  22. nPages = jso.numpages
  23.  
  24. ' Go through pages
  25. For i = 0 To nPages - 1
  26. ' check each word in a page
  27. nWords = jso.getPageNumWords(i)
  28. For j = 0 To nWords - 1
  29. ' get a word
  30.  
  31. word = Trim(CStr(jso.getPageNthWord(i,j)))
  32. 'If VarType(word) = VariantType.String Then
  33. If word <> "" Then
  34. ' compare the word with what the user wants
  35. If Trim(sustext) <> "" Then
  36. result = StrComp(word,sustext,vbTextCompare)
  37. ' if same
  38. If result = 0 Then
  39. nCount = nCount + 1
  40. If bUSCnt = False Then
  41. iUSCnt = iUSCnt + 1
  42. bUSCnt = True
  43. End If
  44. End If
  45. End If
  46. If suktext<> "" Then
  47. result1 = StrComp(word,suktext,vbTextCompare)
  48. ' if same
  49. If result1 = 0 Then
  50. nCount1 = nCount1 + 1
  51. If bUKCnt = False Then
  52. iUKCnt = iUKCnt + 1
  53. bUKCnt = True
  54. End If
  55. End If
  56. End If
  57. End If
  58. Next j
  59. Next i
  60. jso = Nothing
  61. End If

代码执行突出显示文本的工作,但带有’word’变量的FOR循环将带连字符的字符串拆分为组件部分.

  1. For i = 0 To nPages - 1
  2. ' check each word in a page
  3. nWords = jso.getPageNumWords(i)
  4. For j = 0 To nWords - 1
  5. ' get a word
  6.  
  7. word = Trim(CStr(jso.getPageNthWord(i,j)))

有谁知道如何使用acrobat.tlb维护整个字符串?我的相当广泛的搜索空白.

我可以理解iTextSharp在突出显示文本时很麻烦,因为你必须绘制一个矩形并变得复杂,但acrobat.tlb的解决方案也有它的缺点.它不是免费的,很少有人会使用它.对我们其他人来说更好的解决方案是免费且易于使用的Spire.Pdf.你可以从NuGet包中获得它.代码执行以下操作:
  • Opens .pdf
  • Read each text page
  • using regular expression find matches
  • save them to a list of strings eliminating duplicates
  • for each string in this list search page and highlight the word

码:

  1. Dim pdf As PdfDocument = New PdfDocument("Path")
  2. Dim pattern As String = "([A-Z,0-9]{3}[-][A-Z,0-9]{3})"
  3. Dim matches As MatchCollection
  4.  
  5. Dim result As PdfTextFind() = Nothing
  6. Dim content As New StringBuilder()
  7. Dim matchList As New List(Of String)
  8.  
  9. For Each page As PdfPageBase In pdf.Pages
  10. 'get text from current page
  11. content.Append(page.ExtractText())
  12.  
  13. 'find matches
  14. matches = Regex.Matches(content.ToString,pattern,RegexOptions.None)
  15.  
  16. matchList.Clear()
  17.  
  18. 'Assign each match to a string list.
  19. For Each match As Match In matches
  20. matchList.Add(match.Value)
  21. Next
  22.  
  23. 'Eliminate duplicates.
  24. matchList = matchList.Distinct.ToList
  25.  
  26. 'for each string in list
  27. For i = 0 To matchList.Count - 1
  28. 'find all occurances of matchList(i) string in page and highlight it
  29. result = page.FindText(matchList(i)).Finds
  30.  
  31. For Each find As PdfTextFind In result
  32. find.ApplyHighLight(Color.BlueViolet) 'you can set your color preference
  33. Next
  34.  
  35. Next 'matchList
  36.  
  37. Next 'page
  38.  
  39. pdf.SaveToFile("New Path")
  40.  
  41. pdf.Close()
  42. pdf.Dispose()

我在正则表达方面不太好,所以你可以实现你的.无论如何,那是我的方法.

猜你在找的VB相关文章