awk简介

awk是一种用于处理数据和生成报告（manipulating data and generating reports）的脚本语言，无需编译，可使用变量、数学函数、字符串函数和逻辑操作符。

awk通常用于文本的模式扫描和处理（pattern scanning and processing），可对输入文件的每一行进行搜索（patterns be searched for in each line of a document），筛选出符合条件的字段，起到过滤文本的作用。

名字来源于三个开发者的姓的首字母：Alfred Aho, Peter Weinberger, and Brian Kernighan.

AWK Operations:
- (a) Scans a file line by line
- (b) Splits each input line into fields
- (c) Compares input line/fields to pattern
- (d) Performs action(s) on matched lines
Useful For:
- (a) Transform data files
- (b) Produce formatted reports
Programming Constructs:
- (a) Format output lines
- (b) Arithmetic and string operations
- (c) Conditionals and loops

awk基本用法

awk [options] -f progfile input-file
awk [options] 'program' input-file
program内容通常是：
1. 'selection _criteria {action}'
2. 'BEGIN {initialization var}{main code for each line}END {finalization code}'
3. BEGIN和END语句只执行一次，非必需
4. main code或者action会对输入文件的每一行进行扫描，如果没有写，则默认为打印整行内容
5. 如果只写了'selection _criteria'但没写action，则将满足条件的整行打印出来

awk常用选项

-f program-file
--file program-file
              Read the AWK program source from the file program-file, 
              instead of from the first command line argument.  
              Multiple -f (or --file) options may be used.

-F fs
--field-separator fs
              Use fs for the input field separator (the value of the FS predefined variable).

-v var=val
--assign var=val
              Assign the value val to the variable var, before execution of the program begins.  
              Such variable values are available to the BEGIN block of an AWK program.

awk常用内建变量（Built-in Variables）

FS          The input field separator, a space by default. （输入内容字段之间的分隔符）
OFS         The output field separator, a space by default.（输出内容的字段分隔符）
NF          The number of fields in the current input record.（通常一列表示一个字段）
NR          The total number of input records seen so far.（通常一行表示一个记录）

awk的PATTERNS和ACTIONS

AWK is a line-oriented language.
The pattern comes first, and then the action（action写在{}里面）.
Either the pattern may be missing, or the action may be missing, but, of course, not both.
If the pattern is missing, the action is executed for every single record of input.
A missing action is equivalent to { print } which prints the entire record.

AWK patterns may be one of the following:

  BEGIN
  END
  BEGINFILE
  ENDFILE
  /regular expression/
  relational expression
  pattern && pattern
  pattern || pattern
  pattern ? pattern : pattern
  (pattern)
  ! pattern
  pattern1, pattern2

常用的Actions

Action statements consist of the usual assignment（赋值语句）, conditional（条件语句）, and looping（循环语句） statements found in most languages. The operators（运算符）, control statements（控制语句）, and input/output（输入输出） statements available are patterned after those in C（学过C语言就会发现全是熟悉的感觉）.

运算符Operators：

   (...)       Grouping
   $           Field reference.
   ++ --       Increment and decrement, both prefix and postfix.
   ^           Exponentiation (** may also be used, and **= for the assignment operator).
   + - !       Unary plus, unary minus, and logical negation.
   * / %       Multiplication, division, and modulus.
   + -         Addition and subtraction.
   space       String concatenation.
   |   |&      Piped I/O for getline, print, and printf.
   < > <= >= != ==
               The regular relational operators.
   ~ !~        Regular expression match, negated match.  NOTE: Do not use a constant regular expression (/foo/) on the left-hand side of a ~ or !~.  Only use one on the right-hand side.   The  expression /foo/ ~ exp has the same meaning as (($0 ~ /foo/) ~ exp).  This is usually not what was intended.
   in          Array membership.
   &&          Logical AND.
   ||          Logical OR.
   ?:          The  C  conditional expression.  This has the form expr1 ? expr2 : expr3.  If expr1 is true, the value of the expression is expr2, otherwise it is expr3.  Only one of expr2 and expr3 is
               evaluated.
   = += -= *= /= %= ^=
               Assignment.  Both absolute assignment (var = value) and operator-assignment (the other forms) are supported.

常见的输入输出语句：

print                 Print the current record（只写print则打印整行）. 
print expr-list       Print expressions（可罗列print哪些字段）. 
print expr-list >file Print expressions on file（可直接写到文件去）. 
printf fmt, expr-list Format and print（指定打印的格式，同C语言，详情可通过man awk查看）. 
printf fmt, expr-list >file Format and print on file（可直接写到文件去）.

常见的数学函数：

atan2(y, x)   Return the arctangent of y/x in radians.
cos(expr)     Return the cosine of expr, which is in radians.
exp(expr)     The exponential function.
int(expr)     Truncate to integer.
log(expr)     The natural logarithm function.
rand()        Return a random number N, between 0 and 1, such that 0 ≤ N < 1.
sin(expr)     Return the sine of expr, which is in radians.
sqrt(expr)    The square root function.
srand([expr]) Use expr as the new seed for the random number generator.

常见的字符串函数：

gsub(r, s [, t])        字符串替换
length([s])             字符串长度
split(s, a [, r [, seps] ]) 字符串分割放于数组

awk应用实践

已知文件hg38_refGene.csv如下：

#bin,name2,name,chrom,strand,txStart,txEnd,exonCount
1278,HDDC3,NM_001286451,chr15,-,90930917,90932569,4
1918,CHRNA1,NM_000079,chr2,-,174747594,174764472,9
85,MPDZ,NM_001261407,chr9,-,13105703,13279564,45
178,AMD1,NM_001287216,chr6,+,110874783,110895712,5
1161,POLK,NM_016218,chr5,+,75511831,75599821,15
965,CRELD2,NM_001135101,chr22,+,49918629,49927540,11
1017,ZNF835,NM_001005850,chr19,-,56662651,56671755,2
1150,LIMK1,NM_001204426,chr7,+,74093155,74122525,15
866,ZNF568,NM_001204836,chr19,+,36916328,36952743,6

# 以逗号为分隔符，提取第2和第8个字段
# 写法：BEGIN{变量初始化}{action}
awk 'BEGIN {FS = ","; OFS = "\t"} {print $2, $8}' hg38_refGene.csv | head
# 也可以写成：选项 '{action}'形式
awk -F "," -v OFS="\t" '{print $2, $8} ' hg38_refGene.csv | head
# 上述输出形式为print expr-list这种写法

# 自定义变量
awk 'BEGIN {version = "hg38"; FS = ","; OFS = "\t"} {print version, $2, $8}' hg38_refGene.csv | head
# 也可以写成：
awk -v version="hg38" 'BEGIN{FS=",";OFS="\t"}{print version,$2,$8} ' hg38_refGene.csv | head
awk -v version="hg38" -v FS="," -v OFS="\t" '{print version,$2,$8} ' hg38_refGene.csv | head

# 变量赋值和运算
awk 'BEGIN {FS = ","; OFS = "\t"; sum = 0; count = 0;} {count++; sum+=$8}END{print "average exoncount is: " sum/count}' hg38_refGene.csv
####### 输出结果如下：
####### average exoncount is: 9.17675

# 文本过滤
# 可以在{action}中用if语句来设置过滤条件
awk 'BEGIN{FS=",";OFS="\t"} {if($4=="chrY" && $6>6000000 && $6<7000000) print $2,$8} ' hg38_refGene.csv | head
awk 'BEGIN{FS=",";OFS="\t"} {if($2=="TP53") print $2,$8} ' hg38_refGene.csv | head
awk 'BEGIN{FS=",";OFS="\t"} {if($2~"^MIR") print $2,$8} ' hg38_refGene.csv | head
# 也可以写成：'selection _criteria {action}' 
# 数字支持科学计数法
awk -F "," -v OFS="\t" '$4=="chrY" && $6>6e6 && $6<7e6{print $2,$8} ' hg38_refGene.csv | head
awk -F "," -v OFS="\t" '$2~/TP53/{print $2,$8} ' hg38_refGene.csv | head
awk -F "," -v OFS="\t" '$2~/TP/ && $2!~/^TP/{print $2,$8} ' hg38_refGene.csv | head