我正在处理一个相当大的推文集合,我想为每个推文获取其提及(其他用户的名字,前缀为@),如果提到的用户也在文件中:
- users = new Dictionary()
- for each line in file:
- username = get_username(line)
- userid = get_userid(line)
- users.add(key = userid,value = username)
- for each line in file:
- mentioned_names = get_mentioned_names(line)
- mentioned_ids = mentioned_names.map(x => if x in users: users[x] else null)
- print "$line | $mentioned_ids"
我已经使用GAWK处理该文件了,所以不是在Python或C中再次处理它,我决定尝试将其添加到我的AWK脚本中.但是,我无法找到一种方法来传递相同的文件,为每个文件执行不同的代码.大多数解决方案都意味着多次调用AWK,但后来我放弃了我在第一遍中创建的关联数组.
我可以用很笨拙的方式做到这一点(比如把文件夹到两次,然后通过sed传递给每只猫的所有行添加不同的前缀),但我希望能够在一对夫妇中理解这段代码.几个月没有恨自己.
什么是AWK方式来做到这一点?
PD:
我找到的不太可怕的方式:
- function rewind( i)
- {
- # from https://www.gnu.org/software/gawk/manual/html_node/Rewind-Function.html
- # shift remaining arguments up
- for (i = ARGC; i > ARGIND; i--)
- ARGV[i] = ARGV[i-1]
- # make sure gawk knows to keep going
- ARGC++
- # make current file next to get done
- ARGV[ARGIND+1] = FILENAME
- # do it
- nextfile
- }
- BEGIN {
- count = 1;
- }
- count == 1 {
- # first pass,fills an associative array
- }
- count == 2 {
- # second pass,uses the array
- }
- FNR == 30 {
- # handcoded length,horrible
- # could also be automated calling wc -l,passing as parameter
- if (count == 1) {
- count = 2;
- rewind(1)
- }
- }