我有一个字节数组,希望找到一些字节的“出现”.
例如00 69 73 6F 6D,非常大的字节数组(> 50/100兆字节)
要么
解决方法
您可以使用Boyer-Moore算法有效地搜索字节数组中的字节序列.
这是我从the Wikipedia entry on Boyer-Moore的Java版本转换而来的C#版本.
- public sealed class BoyerMoore
- {
- readonly byte[] needle;
- readonly int[] charTable;
- readonly int[] offsetTable;
- public BoyerMoore(byte[] needle)
- {
- this.needle = needle;
- this.charTable = makeByteTable(needle);
- this.offsetTable = makeOffsetTable(needle);
- }
- public IEnumerable<int> Search(byte[] haystack)
- {
- if (needle.Length == 0)
- yield break;
- for (int i = needle.Length - 1; i < haystack.Length;)
- {
- int j;
- for (j = needle.Length - 1; needle[j] == haystack[i]; --i,--j)
- {
- if (j != 0)
- continue;
- yield return i;
- i += needle.Length - 1;
- break;
- }
- i += Math.Max(offsetTable[needle.Length - 1 - j],charTable[haystack[i]]);
- }
- }
- static int[] makeByteTable(byte[] needle)
- {
- const int ALPHABET_SIZE = 256;
- int[] table = new int[ALPHABET_SIZE];
- for (int i = 0; i < table.Length; ++i)
- table[i] = needle.Length;
- for (int i = 0; i < needle.Length - 1; ++i)
- table[needle[i]] = needle.Length - 1 - i;
- return table;
- }
- static int[] makeOffsetTable(byte[] needle)
- {
- int[] table = new int[needle.Length];
- int lastPrefixPosition = needle.Length;
- for (int i = needle.Length - 1; i >= 0; --i)
- {
- if (isPrefix(needle,i + 1))
- lastPrefixPosition = i + 1;
- table[needle.Length - 1 - i] = lastPrefixPosition - i + needle.Length - 1;
- }
- for (int i = 0; i < needle.Length - 1; ++i)
- {
- int slen = suffixLength(needle,i);
- table[slen] = needle.Length - 1 - i + slen;
- }
- return table;
- }
- static bool isPrefix(byte[] needle,int p)
- {
- for (int i = p,j = 0; i < needle.Length; ++i,++j)
- if (needle[i] != needle[j])
- return false;
- return true;
- }
- static int suffixLength(byte[] needle,int p)
- {
- int len = 0;
- for (int i = p,j = needle.Length - 1; i >= 0 && needle[i] == needle[j]; --i,--j)
- ++len;
- return len;
- }
- }
这是一些控制台应用测试代码:
- public static void Main()
- {
- byte[] haystack = new byte[10000];
- byte[] needle = { 0x00,0x69,0x73,0x6F,0x6D };
- // Put a few copies of the needle into the haystack.
- for (int i = 1000; i <= 9000; i += 1000)
- Array.Copy(needle,haystack,i,needle.Length);
- var searcher = new BoyerMoore(needle);
- foreach (int index in searcher.Search(haystack))
- Console.WriteLine(index);
- }
注意Search()方法如何返回haystack中针头起始点的所有位置的索引.
如果您只是想要计数,您可以这样做:
- int count = new BoyerMoore(needle).Search(haystack).Count();
对于你的第二个问题:我假设你问的是找到最长的重复字节序列?
这是一个更加复杂 – 而且非常不同 – 的问题.如果你想要一个答案,你应该问一个单独的问题,但你应该阅读the Wikipedia entry on the “longest repeated substring problem”.