应用full gc排查记录

问题背景

应用使用G1 GC,在某一时刻后,开始出现频繁的full gc

排查

通过gc 日志获取一些线索

存在大量大对象分配,大部分触发GC的原因为大对象分配,相关日志片段:[GC pause (G1 Humongous Allocation)

存在转移失败(to-space exhausted),转移失败一般发生在young gc后survivor区和老年代空间没有足够空间容纳存活对象时,转移失败会导致较长的young gc耗时。转移失败时,回收器会将未成功复制的分区全部置为老年代分区,并且一般之后会紧跟着一次full gc进行全堆的回收。但是分析日志后并不是每次转移失败后都会进行full gc,这里猜测是因为JDK8u60之后,在年轻代回收的最后(对象转移之后),会进行巨型对象的回收。因此释放了内存,内存占用回到可使用的状态

年轻代大小过大,G1中一般不需要设置年轻代大小,根据预设的停顿时间,G1的停顿预测模型会动态的调整年轻代的大小。阈值为全堆大小的15% ~ 60%。观察到发生转移失败时,年轻代大小被调整到了全堆60%的上限值。过大的年轻代,以及大量的大对象分区,挤占了survivor区和老年代的空间,导致没有足够的空间容纳存活对象(过大的年轻代也有可能导致存活对象变多),最终转移失败。

调优建议

  • 调整分区大小,让原本的大对象变为正常的年轻代对象
  • 可以限制年轻代大小,降低年轻代大小,提高年轻代回收次数,变相增加大对象的回收次数
  • 从应用层面避免过多的大对象分配

解决

通过代码review和内存dump,发现大对象均为用于记日志的序列化string,并且这些日志并不是必需的。最终通过废弃这些记日志的代码来解决问题。

转移失败的GC日志

2020-10-03T23:03:29.457+0800: 330.855: [GC pause (G1 Humongous Allocation) (young) (to-space exhausted), 4.0495836 secs]
   [Parallel Time: 1773.8 ms, GC Workers: 8]
      [GC Worker Start (ms): Min: 330857.0, Avg: 330857.1, Max: 330857.1, Diff: 0.2]
      [Ext Root Scanning (ms): Min: 1.1, Avg: 1.8, Max: 5.4, Diff: 4.3, Sum: 14.2]
      [Update RS (ms): Min: 0.0, Avg: 3.4, Max: 4.5, Diff: 4.5, Sum: 27.6]
         [Processed Buffers: Min: 0, Avg: 50.9, Max: 81, Diff: 81, Sum: 407]
      [Scan RS (ms): Min: 25.9, Avg: 28.7, Max: 29.7, Diff: 3.7, Sum: 229.3]
      [Code Root Scanning (ms): Min: 0.0, Avg: 1.2, Max: 3.5, Diff: 3.5, Sum: 9.2]
      [Object Copy (ms): Min: 1736.7, Avg: 1738.4, Max: 1738.9, Diff: 2.2, Sum: 13907.0] // 对象复制转移
      [Termination (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.5]
         [Termination Attempts: Min: 1, Avg: 1.9, Max: 4, Diff: 3, Sum: 15]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.6]
      [GC Worker Total (ms): Min: 1773.5, Avg: 1773.6, Max: 1773.7, Diff: 0.2, Sum: 14188.5]
      [GC Worker End (ms): Min: 332630.6, Avg: 332630.6, Max: 332630.7, Diff: 0.1]
   [Code Root Fixup: 0.3 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 2.1 ms]
   [Other: 2273.5 ms]
      [Evacuation Failure: 2257.7 ms] // 转移失败的额外耗时
      [Choose CSet: 0.0 ms]
      [Ref Proc: 0.2 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.7 ms]
      [Humongous Register: 1.1 ms]
      [Humongous Reclaim: 9.9 ms] // 大对象分区回收
      [Free CSet: 2.7 ms]
   [Eden: 14.4G(15.4G)->0.0B(13.9G) Survivors: 184.0M->8192.0K Heap: 22.9G(26.0G)->9683.8M(26.0G)] // 年轻代大小15.4G,全堆26G,年轻代占比接近60%