线上频繁Full GC分析

背景

同事的项目voice-analyzer-web线上环境,监控系统频繁报警发生JVMFullGC卡顿次数频繁(4台服务器都报警),监控每10分钟统计一次,该时间段内Full GC超过10次就会报警。

线上JVM参数(有省略)

1
XX:MaxMetaspaceSize=256M, -Xms4g, -Xmx4g, -Xmn1g, -Xss256k, -XX:SurvivorRatio=8, -XX:MaxTenuringThreshold=8, -XX:ParallelGCThreads=8, -XX:+UseConcMarkSweepGC, -XX:+UseParNewGC, -XX:+DisableExplicitGC, -XX:+CMSParallelRemarkEnabled, -XX:+CMSClassUnloadingEnabled, -XX:CMSInitiatingOccupancyFraction=70, -XX:CMSFullGCsBeforeCompaction=5, -XX:+UseCMSCompactAtFullCollection, -XX:+CMSScavengeBeforeRemark, -XX:+HeapDumpOnOutOfMemoryError, -Xloggc:/usr/local/webserver/voice-analyzer-web/logs/gc.log, -XX:+UseGCLogFileRotation, -XX:NumberOfGCLogFiles=10, -XX:GCLogFileSize=10M, -XX:+PrintGCDetails, -XX:+PrintGCDateStamps, -XX:+PrintGCApplicationStoppedTime, -XX:+PrintGCApplicationConcurrentTime, -, -XX:HeapDumpPath=/usr/local/webserver/voice-analyzer-web/logs/voice-analyzer-web.hprof

分配了4GB堆内存,年轻代1GB,老年代3GB,eden区800M,每个Survivor区100M,老年代占用率达到70%(2.1G左右)执行CMS GC

日志分析

初步分析gc.log

找一台线上机器下载gc.log(5M大小)到本地,推荐在线图像化分析gc地址geceasy)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
2019-03-21T11:19:41.609+0800: 1086345.976: [GC (Allocation Failure) 2019-03-21T11:19:41.609+0800: 1086345.976: [ParNew: 873168K->36622K(943744K), 0.0312849 secs] 3074543K->2238776K(4089472K), 0.0316707 secs] [Times: user=0.14 sys=0.00, real=0.04 secs]
2019-03-21T11:19:41.641+0800: 1086346.008: Total time for which application threads were stopped: 0.0339016 seconds, Stopping threads took: 0.0002451 seconds
2019-03-21T11:19:41.641+0800: 1086346.008: Application time: 0.0000710 seconds
2019-03-21T11:19:41.643+0800: 1086346.010: [GC (CMS Initial Mark) [1 CMS-initial-mark: 2202154K(3145728K)] 2238847K(4089472K), 0.0057407 secs] [Times: user=0.02 sys=0.00, real=0.00 secs]
2019-03-21T11:19:41.649+0800: 1086346.015: Total time for which application threads were stopped: 0.0076999 seconds, Stopping threads took: 0.0002899 seconds
2019-03-21T11:19:41.649+0800: 1086346.016: [CMS-concurrent-mark-start]
2019-03-21T11:19:41.915+0800: 1086346.282: [CMS-concurrent-mark: 0.266/0.266 secs] [Times: user=0.64 sys=0.01, real=0.27 secs]
2019-03-21T11:19:41.915+0800: 1086346.282: [CMS-concurrent-preclean-start]
2019-03-21T11:19:41.968+0800: 1086346.335: [CMS-concurrent-preclean: 0.049/0.053 secs] [Times: user=0.06 sys=0.00, real=0.05 secs]
2019-03-21T11:19:41.968+0800: 1086346.335: [CMS-concurrent-abortable-preclean-start]
2019-03-21T11:19:45.263+0800: 1086349.630: Application time: 3.6144054 seconds
2019-03-21T11:19:45.268+0800: 1086349.635: Total time for which application threads were stopped: 0.0049627 seconds, Stopping threads took: 0.0002311 seconds
2019-03-21T11:19:45.268+0800: 1086349.635: Application time: 0.0000508 seconds
2019-03-21T11:19:45.269+0800: 1086349.636: Total time for which application threads were stopped: 0.0010917 seconds, Stopping threads took: 0.0001300 seconds
CMS: abort preclean due to time 2019-03-21T11:19:47.067+0800: 1086351.434: [CMS-concurrent-abortable-preclean: 4.876/5.099 secs] [Times: user=6.58 sys=0.02, real=5.10 secs]
2019-03-21T11:19:47.067+0800: 1086351.434: Application time: 1.7978130 seconds
2019-03-21T11:19:47.069+0800: 1086351.436: [GC (CMS Final Remark) [YG occupancy: 51879 K (943744 K)]2019-03-21T11:19:47.069+0800: 1086351.436: [GC (CMS Final Remark) 2019-03-21T11:19:47.069+0800: 1086351.436: [ParNew: 51879K->23393K(943744K), 0.0156467 secs] 2254034K->2226423K(4089472K), 0.0159200 secs] [Times: user=0.12 sys=0.00, real=0.02 secs]
2019-03-21T11:19:47.085+0800: 1086351.452: [Rescan (parallel) , 0.0106023 secs]2019-03-21T11:19:47.096+0800: 1086351.462: [weak refs processing, 0.0000353 secs]2019-03-21T11:19:47.096+0800: 1086351.462: [class unloading, 0.0421021 secs]2019-03-21T11:19:47.138+0800: 1086351.505: [scrub symbol table, 0.0157111 secs]2019-03-21T11:19:47.153+0800: 1086351.520: [scrub string table, 0.0014866 secs][1 CMS-remark: 2203030K(3145728K)] 2226423K(4089472K), 0.0887818 secs] [Times: user=0.26 sys=0.01, real=0.09 secs]
2019-03-21T11:19:47.158+0800: 1086351.525: Total time for which application threads were stopped: 0.0908702 seconds, Stopping threads took: 0.0002256 seconds
2019-03-21T11:19:47.158+0800: 1086351.525: [CMS-concurrent-sweep-start]
2019-03-21T11:19:47.350+0800: 1086351.717: [CMS-concurrent-sweep: 0.192/0.192 secs] [Times: user=0.31 sys=0.00, real=0.20 secs]
2019-03-21T11:19:47.351+0800: 1086351.717: [CMS-concurrent-reset-start]
2019-03-21T11:19:47.356+0800: 1086351.723: [CMS-concurrent-reset: 0.005/0.005 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
2019-03-21T11:19:48.158+0800: 1086352.525: Application time: 1.0000965 seconds
2019-03-21T11:19:48.160+0800: 1086352.527: Total time for which application threads were stopped: 0.0019235 seconds, Stopping threads took: 0.0002317 seconds
2019-03-21T11:19:49.356+0800: 1086353.723: Application time: 1.1961818 seconds
2019-03-21T11:19:49.358+0800: 1086353.725: [GC (CMS Initial Mark) [1 CMS-initial-mark: 2202654K(3145728K)] 2234684K(4089472K), 0.0023390 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
2019-03-21T11:19:49.360+0800: 1086353.727: Total time for which application threads were stopped: 0.0043086 seconds, Stopping threads took: 0.0002062 seconds
2019-03-21T11:19:49.360+0800: 1086353.727: [CMS-concurrent-mark-start]
2019-03-21T11:19:49.623+0800: 1086353.990: [CMS-concurrent-mark: 0.262/0.262 secs] [Times: user=0.55 sys=0.00, real=0.27 secs]
2019-03-21T11:19:49.623+0800: 1086353.990: [CMS-concurrent-preclean-start]
2019-03-21T11:19:49.689+0800: 1086354.056: [CMS-concurrent-preclean: 0.062/0.066 secs] [Times: user=0.10 sys=0.00, real=0.06 secs]

观察频繁CMS GC相邻间隔时间8秒左右,检查CMS GC回收前后老年代内存使用情况:

1
2
2019-03-21T11:19:41.643+0800: 1086346.010: [GC (CMS Initial Mark) [1 CMS-initial-mark: 2202154K(3145728K)] 2238847K(4089472K), 0.0057407 secs] [Times: user=0.02 sys=0.00, real=0.00 secs]
2019-03-21T11:19:49.358+0800: 1086353.725: [GC (CMS Initial Mark) [1 CMS-initial-mark: 2202654K(3145728K)] 2234684K(4089472K), 0.0023390 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]

老年代容量为3145728K,第一次在使用了2202154K时触发了CMS GC初始标记操作,第二次在使用了2202654K后触发CMS GC初始标记操作;2次CMS GC之间老年代反而增加了0.5M大小,初步怀疑方向:

  • CMSInitiatingOccupancyFraction比例大小,导致频繁触发
  • 该段时间有大量内存转移到老年代
  • 堆内存泄漏

这张图比较直观展示老年代内存一直维持在2.1G左右,GC前后并没有降低老年代大小,而且这段时间并没有大量并发请求,怀疑堆内存泄漏。

分析heap dump

选择一台服务器,联系运维dump heap(2.4G),重启全部服务器(内存直线掉下来,老年代200M。。。)

使用工具VisualVm或者Eclipse MAT分析dump日志

使用VisualVM

  • 检查堆内存大对象:

可见共有377094个double[]数组占有内存2G左右,retained size是该对象被GC之后所能回收到内存的总和;怀疑double[]对象泄漏,没有被回收

  • 查看该对象的GC Root

上图可以看出该double[]DataBuffer直接引用,最后被缓存在Guava LocalCache中;从GC Root可以看出,有ScheudleTask(DataPublisher中创建)引用了该对象

代码分析

  • 进入DataPublisher类中,检查ScheudleTask创建过程:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
public synchronized void start() {
if (future == null) {
Runnable task = new Runnable() {
public void run() {
try {
// 主要做计算统计数据
accumulator.publish();
} catch (Exception e) {
handleException(e);
}
}
};
// 从上下文看,延迟1秒定期执行一次
future = getExecutor().scheduleWithFixedDelay(task,
delayMillis, delayMillis,
TimeUnit.MILLISECONDS);
}
}
protected synchronized ScheduledExecutorService getExecutor() {
if (sharedExecutor == null) {
sharedExecutor = Executors.newScheduledThreadPool(1, new PublishThreadFactory());
}
return sharedExecutor;
}
  • 逆向检查start()调用

ServerStatsinitialize方法触发

1
2
3
4
5
6
7
8
9
10
11
12
13
public void initialize(Server server) {
serverFailureCounts = new MeasuredRate(failureCountSlidingWindowInterval);
requestCountInWindow = new MeasuredRate(300000L);
if (publisher == null) {
// DataBuffer实际存在于DataDistribution中,bufferSize=1000
dataDist = new DataDistribution(getBufferSize(), PERCENTS);
// DataPublisher间接引用了DataBuffer
publisher = new DataPublisher(dataDist, getPublishIntervalMillis());
// 触发调用
publisher.start();
}
this.server = server;
}
  • 逆向检查initialize()调用

调用关系如下:

LoadBalancerStatscreateServerStats方法触发

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
private ServerStats createServerStats(Server server) {
// 创建ServerStats并初始化
ServerStats ss = new ServerStats(this);
ss.setBufferSize(1000);
ss.setPublishInterval(1000);
ss.initialize(server);
return ss;
}
private ServerStats getServerStats(Server server) {
try {
// 从localcache缓存中获取,如果没有
// 则调用createServerStats创建ServerStats并缓存
return serverStatsCache.get(server);
} catch (ExecutionException e) {
ServerStats stats = createServerStats(server);
serverStatsCache.asMap().putIfAbsent(server, stats);
return serverStatsCache.asMap().get(server);
}
}
// localcache在这里。。。,key是Server, value是ServerStats
private final LoadingCache<Server, ServerStats> serverStatsCache =
CacheBuilder.newBuilder()
.expireAfterAccess(SERVERSTATS_EXPIRE_MINUTES.get(), TimeUnit.MINUTES)
.removalListener(new RemovalListener<Server, ServerStats>() {
@Override
public void onRemoval(RemovalNotification<Server, ServerStats> notification) {
notification.getValue().close();
}
})
.build(
new CacheLoader<Server, ServerStats>() {
// 缓存没有则创建ServerStats
public ServerStats load(Server server) {
return createServerStats(server);
}
});

可以看出bufferSize=1000Heap dump中每个double[]元素也是1000;而且,定时任务时间间隔为1秒

从缓存localcache的角度来看,如果元素一直增加,说明一直有新的Server被创建并添加到缓存中;但是据同事了解我们的Server只有10来个,缓存正常不会一直增长;怀疑是否代码问题导致不停创建新的Server?

  • 检查getServerStats调用关系

入口调用代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
public String call(final String path, Map<String, Object> param, Integer uploadType) {
return LoadBalancerCommand.<String> builder().withLoadBalancer(loadBalancer).build()
.submit(new ServerOperation<String>() {
@Override
public Observable<String> call(Server server) {
URL url;
try {
String s = HttpClientUtil.post("http://" + server.getHost() + ":" + server.getPort() + path, param);
return Observable.just(s);
} catch (Exception e) {
logger.info("python identify voice failed! host {}", server.getHost());
return Observable.error(e);
}
}
}).toBlocking().first();
}

重点观察LoadBalancerCommand.submit()方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
public Observable<T> submit(final ServerOperation<T> operation) {
// Use the load balancer
Observable<T> o =
(server == null ? selectServer() : Observable.just(server))
.concatMap(new Func1<Server, Observable<T>>() {
@Override
// Called for each server being selected
public Observable<T> call(Server server) {
context.setServer(server);
// getServerStats的入口在这里。。。
final ServerStats stats = loadBalancerContext.getServerStats(server);
}

观察LoadBalancerContext.getServerStats()方法

1
2
3
4
5
6
7
8
9
10
11
public final ServerStats getServerStats(Server server) {
ServerStats serverStats = null;
// LoadBalancerContext持有LoadBalancer对象
ILoadBalancer lb = this.getLoadBalancer();
if (lb instanceof AbstractLoadBalancer){
// LoadBalancer持有LoadBalancerStats对象
LoadBalancerStats lbStats = ((AbstractLoadBalancer) lb).getLoadBalancerStats();
serverStats = lbStats.getSingleServerStat(server);
}
return serverStats;
}
  • 检查LoadBalancer创建过程
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
public void afterPropertiesSet() throws Exception {
initLoadbalance();
// 每隔3分钟重新创建Loadbalance
service.scheduleAtFixedRate(new Runnable() {
@Override
public void run() {
initLoadbalance();
}
}, 3, 3, TimeUnit.MINUTES);
}
/**
* 初始化普通机器负载均衡策略
*/
private void initLoadbalance() {
try {
List<Server> serverList = new ArrayList<>();
String machine = footballConfigFamilyService.getConfigMap().get("voice-analyze");
List<Weight> list = JSON.parseObject(machine, new TypeReference<List<Weight>>() {
});
// serverList只有10来个
for (Weight weight : list) {
serverList.add(new WeightServer(weight.getHost(), weight.getPort(), weight.getWeight()));
}
// 创建loadBalancer
loadBalancer = LoadBalancerBuilder.newBuilder().withRule(new WeightedRule())
.buildFixedServerListLoadBalancer(serverList);
} catch (Exception e) {
e.printStackTrace();
}
}

观察LoadBalancerBuilder.buildFixedServerListLoadBalancer()方法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
public BaseLoadBalancer buildFixedServerListLoadBalancer(List<T> servers) {
if (rule == null) {
rule = createRuleFromConfig(config);
}
// 创建BaseLoadBalancer
BaseLoadBalancer lb = new BaseLoadBalancer(config, rule, ping);
// 设置负载均衡服务器列表
lb.setServersList(servers);
return lb;
}
public BaseLoadBalancer(IClientConfig config, IRule rule, IPing ping) {
initWithConfig(config, rule, ping);
}
void initWithConfig(IClientConfig clientConfig, IRule rule, IPing ping) {
setLoadBalancerStats(new LoadBalancerStats(clientName));
rule.setLoadBalancer(this);
if (ping instanceof AbstractLoadBalancerPing) {
((AbstractLoadBalancerPing) ping).setLoadBalancer(this);
}
}

从上面代码看,每隔3分钟就会重新创建新的LoadBalancer,每创建一个LoadBalancer,都会创建LoadBalancerStats作为LoadBalancer属性

难道每隔3分钟时间创建的ServerGuava LocalCache来说都是不同的吗?窃喜,感觉问题已经找到😄

现在需要确定2个WeightServer对象,具有相同的ipport, 在LocalCache中是一个吗?

  • 检查WeightServer

具体的guava cache 源码不再叙述,整体逻辑设计参考CurrentHashMap

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
public Server(String host, int port) {
this.host = host;
this.port = port;
this.id = host + ":" + port;
isAliveFlag = false;
}
public boolean equals(Object obj) {
if (this == obj)
return true;
if (!(obj instanceof Server))
return false;
Server svc = (Server) obj;
return svc.getId().equals(this.getId());
}
public int hashCode() {
int hash = 7;
hash = 31 * hash + (null == this.getId() ? 0 : this.getId().hashCode());
return hash;
}

从上面看,2个WeightServer如果ipport相同,经测试cache元素只有一个,guava cache认为是同一个元素。尴尬,猜错了😅。。。

重新检查代码

既然heap dump显示cache中的Server在不停增加,实际情况却是一个cache中只会有10来个Server;突然想起来,每个LoadBalancerStats对象都有一个cache对象

1
2
3
4
5
6
7
8
9
10
11
public BaseLoadBalancer(IClientConfig config, IRule rule, IPing ping) {
initWithConfig(config, rule, ping);
}
void initWithConfig(IClientConfig clientConfig, IRule rule, IPing ping) {
setLoadBalancerStats(new LoadBalancerStats(clientName));
rule.setLoadBalancer(this);
if (ping instanceof AbstractLoadBalancerPing) {
((AbstractLoadBalancerPing) ping).setLoadBalancer(this);
}
}

原因:定时任务执行initLoadbalance(),导致不停创建新的LoadBalancer,即LoadBalancerStats一直增加,全部cache缓存的Server也会增加

这里有一个问题,旧的cache为什么不被GC回收呢?回过头来看DataPublisher.start()方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
public synchronized void start() {
if (future == null) {
Runnable task = new Runnable() {
public void run() {
try {
accumulator.publish();
} catch (Exception e) {
handleException(e);
}
}
};
future = getExecutor().scheduleWithFixedDelay(task,
delayMillis, delayMillis,
TimeUnit.MILLISECONDS);
}
}
protected synchronized ScheduledExecutorService getExecutor() {
if (sharedExecutor == null) {
sharedExecutor = Executors.newScheduledThreadPool(1, new PublishThreadFactory());
}
return sharedExecutor;
}

从上面的代码看出,每个Server都关联了一个线程池执行定时任务,导致cacheServer对象一直被引用,GC不会回收这类对象。

比较有意思的地方:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// 过期时间默认30分钟
private final LoadingCache<Server, ServerStats> serverStatsCache =
CacheBuilder.newBuilder()
.expireAfterAccess(SERVERSTATS_EXPIRE_MINUTES.get(), TimeUnit.MINUTES)
.removalListener(new RemovalListener<Server, ServerStats>() {
@Override
public void onRemoval(RemovalNotification<Server, ServerStats> notification) {
notification.getValue().close();
}
})
.build(
new CacheLoader<Server, ServerStats>() {
// 缓存没有则创建ServerStats
public ServerStats load(Server server) {
return createServerStats(server);
}
});

这里的guava cache使用了expireAfterAccessremovalListener, 我猜robbin框架)作者本意是使用过期函数以及监听器在Server失效后关闭线程池运行,防止线程一直运行;但是guava cache的过期删除是被动的,就是说如果元素过期后再次被访问,会触发删除并重新加载

以该项目代码执行来看,只会一直添加新的cache,旧的cache不能被访问,导致缓存对象不能释放

解决方案

  • 避免LoadBalancer不停的创建,覆盖ServerList即可
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
private void initLoadbalance() {
try {
List<Server> serverList = new ArrayList<>();
String machine = footballConfigFamilyService.getConfigMap().get("voice-analyze");
List<Weight> list = JSON.parseObject(machine, new TypeReference<List<Weight>>() {
});
// serverList只有10来个
for (Weight weight : list) {
serverList.add(new WeightServer(weight.getHost(), weight.getPort(), weight.getWeight()));
}
// **********代码修复处************
if(loadBalancer == null){
loadBalancer = LoadBalancerBuilder.newBuilder().withRule(new WeightedRule())
.buildFixedServerListLoadBalancer(serverList);
}else{
// 覆盖旧的server list
loadBalancer.setServersList(serverList);
}
} catch (Exception e) {
e.printStackTrace();
}
}

结论

使用第三方框架,一定要了解底层运行机制。😄

热评文章