golang 语言机制之内存剖析

《Golang 语言机制之逃逸分析》中介绍了编译器逃逸分析的基础知识，除了共享变量这种情况之外，还有其他情况也会导致变量逃逸。

示例代码

示例代码是读取byte数组，找到 elvis 字符串并替换成 Elvis。示例代码在 https://play.golang.org/p/n_SzF4Cer4，benchmark 在 https://play.golang.org/p/TnXrxJVfLV。示例代码中有两个函数实现这个功能，这篇文章只关注其中的 algOne 函数。

以下是入参和期望 algOne 函数输出的结果。

Input:
abcelvisaElvisabcelviseelvisaelvisaabeeeelvise l v i saa bb e l v i saa elvi
selvielviselvielvielviselvi1elvielviselvis

Output:
abcElvisaElvisabcElviseElvisaElvisaabeeeElvise l v i saa bb e l v i saa elvi
selviElviselvielviElviselvi1elviElvisElvis

程序 1 是 algOne 函数代码。

// 程序 1
 80 func algOne(data []byte, find []byte, repl []byte, output *bytes.Buffer) {
 81
 82     // Use a bytes Buffer to provide a stream to process.
 83     input := bytes.NewBuffer(data)
 84
 85     // The number of bytes we are looking for.
 86     size := len(find)
 87
 88     // Declare the buffers we need to process the stream.
 89     buf := make([]byte, size)
 90     end := size - 1
 91
 92     // Read in an initial number of bytes we need to get started.
 93     if n, err := io.ReadFull(input, buf[:end]); err != nil {
 94         output.Write(buf[:n])
 95         return
 96     }
 97
 98     for {
 99
100         // Read in one byte from the input stream.
101         if _, err := io.ReadFull(input, buf[end:]); err != nil {
102
103             // Flush the reset of the bytes we have.
104             output.Write(buf[:end])
105             return
106         }
107
108         // If we have a match, replace the bytes.
109         if bytes.Compare(buf, find) == 0 {
110             output.Write(repl)
111
112             // Read a new initial number of bytes.
113             if n, err := io.ReadFull(input, buf[:end]); err != nil {
114                 output.Write(buf[:n])
115                 return
116             }
117
118             continue
119         }
120
121         // Write the front byte since it has been compared.
122         output.WriteByte(buf[0])
123
124         // Slice that front byte out.
125         copy(buf, buf[1:])
126     }
127 }

我们通过 benchmark 来了解 algOne 函数的性能表现和对堆的压力。

Benchmarking

程序2是 algOne 函数的 benchmark 代码。

// 程序 2
func BenchmarkAlgorithmOne(b *testing.B) {
    var output bytes.Buffer
    in := assembleInputStream()
    find := []byte("elvis")
    repl := []byte("Elvis")

    b.ResetTimer()

    for i := 0; i < b.N; i++ {
        output.Reset()
        algOne(in, find, repl, &output)
    }
}

我们通过 go test -run none -bench AlgorithmOne -benchtime 3s -benchmem 运行 benchmark，输出如下：

$go test -run none -bench AlgorithmOne -benchtime 3s -benchmem
BenchmarkAlgorithmOne-8   2000000     2522 ns/op    117 B/op     2 allocs/op

我们可以看到 algOne 函数有两次内存分配，每次分配了 117 个字节。我们通过 profiling 数据来看一下哪行代码造成了内存分配。

Profiling

在 go test 后加上 memprofile 参数，重新执行测试。

$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem -memprofile mem.out
BenchmarkAlgorithmOne-8    2000000    2570 ns/op    117 B/op         2 allocs/op

benchmark 执行结束后，会生成两个新文件。

~/code/go/src/.../memcpu
$ ls -l
total 9248
-rw-r--r--  1 bill  staff      209 May 22 18:11 mem.out       (NEW)
-rwxr-xr-x  1 bill  staff  2847600 May 22 18:10 memcpu.test   (NEW)
-rw-r--r--  1 bill  staff     4761 May 22 18:01 stream.go
-rw-r--r--  1 bill  staff      880 May 22 14:49 stream_test.go

源码在 memcpu 目录中，algOne 函数在 stream.go 文件中，benchmark 测试在 stream_test.go 文件中。mem.out 和 memcpu.test 是新生成的文件，内容是profile数据，有了这俩文件，就可以使用 pprof 工具进行分析了

$ go tool pprof -alloc_space memcpu.test mem.out
Entering interactive mode (type "help" for commands)
(pprof) _

这里我们使用 -alloc_space 来分析内存分配。我们使用 list algOne 命令查看 algOne 函数的内存分配。

(pprof) list algOne
Total: 335.03MB
ROUTINE ======================== .../memcpu.algOne in code/go/src/.../memcpu/stream.go
 335.03MB   335.03MB (flat, cum)   100% of Total
        .          .     78:
        .          .     79:// algOne is one way to solve the problem.
        .          .     80:func algOne(data []byte, find []byte, repl []byte, output *bytes.Buffer) {
        .          .     81:
        .          .     82: // Use a bytes Buffer to provide a stream to process.
 318.53MB   318.53MB     83: input := bytes.NewBuffer(data)
        .          .     84:
        .          .     85: // The number of bytes we are looking for.
        .          .     86: size := len(find)
        .          .     87:
        .          .     88: // Declare the buffers we need to process the stream.
  16.50MB    16.50MB     89: buf := make([]byte, size)
        .          .     90: end := size - 1
        .          .     91:
        .          .     92: // Read in an initial number of bytes we need to get started.
        .          .     93: if n, err := io.ReadFull(input, buf[:end]); err != nil || n < end {
        .          .     94:       output.Write(buf[:n])
(pprof) _

分析信息展示了 input 变量和 buf slice在堆上创建。

input 变量是指针，即 input 指向的 bytes.Buffer 值分配在堆上。flat列（pprof输出的第一列）展示由于 algOne 函数共享导致变量逃逸，而 algOne 函数在 benchmark 中调用，我们用 list Benchmark 函数看一下 Benchmark 的内存分配。

(pprof) list Benchmark
Total: 335.03MB
ROUTINE ======================== .../memcpu.BenchmarkAlgorithmOne in code/go/src/.../memcpu/stream_test.go
        0   335.03MB (flat, cum)   100% of Total
        .          .     18: find := []byte("elvis")
        .          .     19: repl := []byte("Elvis")
        .          .     20:
        .          .     21: b.ResetTimer()
        .          .     22:
        .   335.03MB     23: for i := 0; i < b.N; i++ {
        .          .     24:       output.Reset()
        .          .     25:       algOne(in, find, repl, &output)
        .          .     26: }
        .          .     27:}
        .          .     28:
(pprof) _

cum列（pprof信息的第二列）只有一个值，所以 Benchmark 函数没有直接进行内存分配，内存分配全部发生在for循环中。pprof 工具只能分析出有变量逃逸，我们需要通过 go build gcflags "-m -m" 命令来看一下 bytes.Buffer 的值为什么会发生逃逸。

编译器报告（Compiler Reporting）

$ go build -gcflags "-m -m"

go build 输出信息很多，由于是在第83行创建的 bytes.buffer 值，所以我们只关注 stream.go:83。

./stream.go:83: inlining call to bytes.NewBuffer func([]byte) *bytes.Buffer { return &bytes.Buffer literal }

./stream.go:83: &bytes.Buffer literal escapes to heap
./stream.go:83:   from ~r0 (assign-pair) at ./stream.go:83
./stream.go:83:   from input (assigned) at ./stream.go:83
./stream.go:83:   from input (interface-converted) at ./stream.go:93
./stream.go:83:   from input (passed to call[argument escapes]) at ./stream.go:93

从第一行信息可以看出来，bytes.Buffer 值一开始并没逃逸，bytes.NewBuffer 函数返回 bytes.Buffer 值的地址，接下来的5行信息展示了第93行导致 bytes.Buffer 值从algOne 栈逃逸，input 变量被赋值给一个接口变量。

接口（interface）

algOne 函数的第 93 行代码调用 io.ReadFull 函数造成了接口赋值。

type Reader interface {
      Read(p []byte) (n int, err error)
}

func ReadFull(r Reader, buf []byte) (n int, err error) {
      return ReadAtLeast(r, buf, len(buf))
}

通过 ReadFull 函数代码我们发现，bytes.Buffer 的指针变量赋值给了 Reader 接口。现在我们知道了使用接口变量的开销：分配和重定向。所以如果没有必须的使用接口变量的原因，可以不使用接口变量。

修改直algOne 函数，接用 io 包的 Read 函数来代替 ReadFull 函数。

import (
    "bytes"
    "fmt"
    _ "io"
)

func algOne(data []byte, find []byte, repl []byte, output *bytes.Buffer) {

    // Use a bytes Buffer to provide a stream to process.
    input := bytes.NewBuffer(data)

    // The number of bytes we are looking for.
    size := len(find)

    // Declare the buffers we need to process the stream.
    buf := make([]byte, size)
    end := size - 1

    // Read in an initial number of bytes we need to get started.
    if n, err := input.Read(buf[:end]); err != nil || n < end {
        output.Write(buf[:n])
        return
    }

    for {

        // Read in one byte from the input stream.
        if _, err := input.Read(buf[end:]); err != nil {

            // Flush the reset of the bytes we have.
            output.Write(buf[:end])
            return
        }

        // If we have a match, replace the bytes.
        if bytes.Compare(buf, find) == 0 {
            output.Write(repl)

            // Read a new initial number of bytes.
            if n, err := input.Read(buf[:end]); err != nil || n < end {
                output.Write(buf[:n])
                return
            }

            continue
        }

        // Write the front byte since it has been compared.
        output.WriteByte(buf[0])

        // Slice that front byte out.
        copy(buf, buf[1:])
    }
}

再次运行 benchmark，每次操作的内存分配次数变成了一次，bytes.Buffer 变量不再逃逸了。性能提升了29%（从 2570 ns/op 到 1814 ns/op），

$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem -memprofile mem.out
BenchmarkAlgorithmOne-8     2000000      1814 ns/op      5 B/op           1 allocs/op

我们继续来看 buf 变量。

栈帧（Stack Frames）

再次运行 go build -gcflags "-m -m"，并关注 stream.go:89 。

$ go build -gcflags "-m -m"
./stream.go:89: make([]byte, size) escapes to heap
./stream.go:89:   from make([]byte, size) (too large for stack) at ./stream.go:89

这里说 slice 底层的数组对于栈来说太大了（too large for stack）。这个信息有些误导性，其实不是因为数组太大，而是因为编译器在编译阶段不知道数组的大小。

只有编译器在编译期间知道变量的大小的情况下，变量才会被分配在栈上，这是因为函数栈帧的大小是在编译阶段确定的，如果编译器在编译过程中不知道变量的大小，那么变量就会分配到堆上。

修改 89 行代码，将 buf 变量的初始长度设置为 5，再次运行 benchmark。

buf := make([]byte, 5)

内存分配次数变成了 0 次。

$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem
BenchmarkAlgorithmOne-8   3000000   1720 ns/op     0 B/op       0 allocs/op

再次运行 go build -gcflags "-m -m" 会发现，所有变量都没逃逸。

$ go build -gcflags "-m -m"
./stream.go:83: algOne &bytes.Buffer literal does not escape
./stream.go:89: algOne make([]byte, 5) does not escape

但是除非能确认 slice 的长度，否则我们不能通过硬编码来指定 slice 的初始大小，所以这段代码 algOne 函数可能需要一次内存分配。

内存分配和性能（Allocations and Performance）

比较一下每次优化过程中的新能提升：

Before any optimization
BenchmarkAlgorithmOne-8   2000000    2570 ns/op   117 B/op        2 allocs/op

Removing the bytes.Buffer allocation
BenchmarkAlgorithmOne-8   2000000    1814 ns/op     5 B/op        1 allocs/op

Removing the backing array allocation
BenchmarkAlgorithmOne-8   3000000    1720 ns/op     0 B/op        0 allocs/op

优化掉 bytes.Buffer 变量的内存分配之后，提升了大约 29% 的性能，所有内存分配都又划掉之后，提升了大约 33% 的性能。由此可见，内存分配是影响应用程序性能的因素之一。

结论

golang有一些很方便的工具来分析内存，基于这些工具，可以重构代码使得变量只分配在栈空间上，而不需要重新分配到堆上。写代码时不要把性能作为第一优先级，因为你并不想在代码序时一直猜测代码的性能，写正确的代码才是第一优先级。我们首先要关注的是完整性、可读性和简单性。有了可以运行的程序，再来确定程序性能是否满足要求。假如程序还不够快，可以使用golang提供的工具来查找和解决性能问题。

参考资料

https://www.ardanlabs.com/blog/2017/06/language-mechanics-on-memory-profiling.html