文章简介:本文简单介绍了 Hadoop Resource Manager 页面 HA 方式,并基于此实现了一种 Openrestry + Lua 编写动态脚本,将 Hadoop Resource Manager 使用 Nginx 反向代理出来的方法。其中提供的代码,可直接运行起来。
问题场景
Hadoop Resource Manager(RM) 的管理页面采用了类似客户端负载均衡的 HA 方案。Resource Manager 分为两部分,Master/Standby。当 http 请求到 Standby,Standby 将重定向到 Master,这样浏览器即可访问到实际工作的 RM 了。而这种场景下,使用 Nginx 作为 Load Balance 代理 RM 就不合适,因为对 nginx 来说,nginx 期望所有的实例都可以对外提供服务。Nginx 是否可以提供一种机制,找到实际的 RM master 节点,将所有请求发送到这个节点呢?
解决办法
HadoopResourceManagerHA 文档 对于 Load Balancer 有如下描述
1
|
If you are running a set of ResourceManagers behind a Load Balancer (e.g. Azure or AWS ) and would like the Load Balancer to point to the active RM, you can use the /isActive HTTP endpoint as a health probe. http://RM_HOSTNAME/isActive will return a 200 status code response if the RM is in Active HA State, 405 otherwise.
|
- 在使用的 HadoopResourceManager 版本 没有
/isActive
这个路由,所以不能使用这个办法。
- 查看资料的时候,发现
http://ip:8088/jmx?qry=Hadoop:service=ResourceManager,name=ClusterMetrics
中会有主节点的 hostname, 并且这个接口只有 Master 才会有响应。基于这个特征,我们可以通过检查是否可以返回主节点信息判断是否为主。
- 另外,Standby 307 的 response body 中也明确标识了当前是 standby,此行为也可以作为判断是否为主的标识。
当前我们各组建版本:
ResourceManager 版本3.1.1.3.1.0.0-78 from e4f82af51faec922b4804d0232a637422ec29e64 by jenkins source checksum 47cebb2682958b68f58d47415f5d2555 on 2018-12-06T12:28Z
Hadoop version: 3.1.1.3.1.0.0-78 from e4f82af51faec922b4804d0232a637422ec29e64 by jenkins source checksum eab9fa2a6aa38c6362c66d8df75774 on 2018-12-06T12:26Z
具体的响应的样子如下图
尝试请求 Standby 看下效果:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
|
curl xx.xx.xx.xx:8088/ -k -v
* Trying xx.xx.xx.xx:8088...
* Connected to xx.xx.xx.xx (xx.xx.xx.xx) port 8088 (#0)
> GET / HTTP/1.1
> Host: xx.xx.xx.xx:8088
> User-Agent: curl/7.88.1
> Accept: */*
>
< HTTP/1.1 307 Temporary Redirect
< Date: Wed, 25 Oct 2023 02:05:48 GMT
< Cache-Control: no-cache
< Expires: Wed, 25 Oct 2023 02:05:48 GMT
< Date: Wed, 25 Oct 2023 02:05:48 GMT
< Pragma: no-cache
< Content-Type: text/plain;charset=utf-8
< X-Frame-Options: SAMEORIGIN
< Location: http://bigdata3:8088/
< Content-Length: 43
<
This is standby RM. The redirect url is: /
curl 'http://xx.xx.xx.xx:8088/jmx?qry=Hadoop:service=ResourceManager,name=ClusterMetrics' -k -v
* Trying xx.xx.xx.xx:8088...
* Connected to xx.xx.xx.xx (xx.xx.xx.xx) port 8088 (#0)
> GET /jmx?qry=Hadoop:service=ResourceManager,name=ClusterMetrics HTTP/1.1
> Host: xx.xx.xx.xx:8088
> User-Agent: curl/7.88.1
> Accept: */*
>
< HTTP/1.1 307 Temporary Redirect
< Date: Wed, 25 Oct 2023 02:10:32 GMT
< Cache-Control: no-cache
< Expires: Wed, 25 Oct 2023 02:10:32 GMT
< Date: Wed, 25 Oct 2023 02:10:32 GMT
< Pragma: no-cache
< Content-Type: text/plain;charset=utf-8
< X-Frame-Options: SAMEORIGIN
< Location: http://bigdata3:8088/jmx?qry=Hadoop%3Aservice%3DResourceManager%2Cname%3DClusterMetrics
< Content-Length: 109
<
This is standby RM. The redirect url is: /jmx?qry=Hadoop%3Aservice%3DResourceManager%2Cname%3DClusterMetrics
* Connection #0 to host xx.xx.xx.xx left intact
|
请求 Master 节点,看下请求是什么样子的:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
|
curl xx.xx.xx.xx:8088/ -k -v
* Trying xx.xx.xx.xx:8088...
* Connected to xx.xx.xx.xx (xx.xx.xx.xx) port 8088 (#0)
> GET / HTTP/1.1
> Host: xx.xx.xx.xx:8088
> User-Agent: curl/7.88.1
> Accept: */*
>
< HTTP/1.1 302 Found
< Date: Wed, 25 Oct 2023 02:06:01 GMT
< Cache-Control: no-cache
< Expires: Wed, 25 Oct 2023 02:06:01 GMT
< Date: Wed, 25 Oct 2023 02:06:01 GMT
< Pragma: no-cache
< Content-Type: text/plain;charset=utf-8
< X-Frame-Options: SAMEORIGIN
< Vary: Accept-Encoding
< Location: http://xx.xx.xx.xx:8088/cluster
< Content-Length: 0
curl 'http://xx.xx.xx.xx:8088/jmx?qry=Hadoop:service=ResourceManager,name=ClusterMetrics' -k -v
* Trying xx.xx.xx.xx:8088...
* Connected to xx.xx.xx.xx (xx.xx.xx.xx) port 8088 (#0)
> GET /jmx?qry=Hadoop:service=ResourceManager,name=ClusterMetrics HTTP/1.1
> Host: xx.xx.xx.xx:8088
> User-Agent: curl/7.88.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Wed, 25 Oct 2023 02:09:49 GMT
< Cache-Control: no-cache
< Expires: Wed, 25 Oct 2023 02:09:49 GMT
< Date: Wed, 25 Oct 2023 02:09:49 GMT
< Pragma: no-cache
< Content-Type: application/json; charset=utf8
< X-Frame-Options: SAMEORIGIN
< Vary: Accept-Encoding
< Access-Control-Allow-Methods: GET
< Access-Control-Allow-Origin: *
< Transfer-Encoding: chunked
<
{
"beans" : [ {
"name" : "Hadoop:service=ResourceManager,name=ClusterMetrics",
"modelerType" : "ClusterMetrics",
"tag.ClusterMetrics" : "ResourceManager",
"tag.Context" : "yarn",
"tag.Hostname" : "bigdata3",
"NumActiveNMs" : 3,
"NumDecommissioningNMs" : 0,
"NumDecommissionedNMs" : 0,
"NumLostNMs" : 0,
"NumUnhealthyNMs" : 0,
"NumRebootedNMs" : 0,
"NumShutdownNMs" : 0,
"AMLaunchDelayNumOps" : 1713,
"AMLaunchDelayAvgTime" : 63.0,
"AMRegisterDelayNumOps" : 1700,
"AMRegisterDelayAvgTime" : 33170.0
} ]
}
|
nginx
使用 nginx 原生支持的能力判断主 upstream 的方法可以考虑 1,2,3 三种方法。
翻了下 nginx upstream 的文档, Dynamically configurable group with periodic health checks is available as part of our commercial subscription:. 所以准备尝试一下 ngx_http_upstream_hc_module
。我们直接使用 nginx:latest
镜像,配置如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
upstream backend {
server bing.com:80;
# server www.baidu.com:80;
}
server {
listen 8080;
server_name localhost;
location / {
proxy_set_header Host 'bing.com';
proxy_pass http://backend;
health_check;
}
}
|
启动 nginx 会报如下错误:
1
|
2023/10/23 01:56:53 [emerg] 1#1: unknown directive "health_check" in /etc/nginx/conf.d/default.conf:13
|
错误表示内置的 nginx 没编译 ngx_http_upstream_hc_module
。如果希望使用这种方式,需要自编译 nginx 添加此模块。暂不想自编译 nginx,此方法弃用。
openresty+lua
换一个思路,如果我们期望的行为是在请求的时候,能够区分开,使用基于 nginx 的 更动态的 openresty+lua 组合写脚本方案找到当前的主,问题即可解决了。
https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html
首先准备一下 openresty 的 docker 镜像
1
2
3
4
5
6
7
8
9
10
|
docker build --platform linux/amd64 -t openresty:dev -f Dockerfile .
FROM openresty/openresty:latest
RUN sed -E -i 's/(deb|security).debian.org/mirrors.tuna.tsinghua.edu.cn/g' /etc/apt/sources.list
RUN DEBIAN_FRONTEND=noninteractive apt-get update \
&& DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
wget \
curl \
&& rm -rf /var/lib/apt/lists/*
|
nginx.conf 文件如下
nginx.conf_root
1
2
3
4
5
6
7
8
9
10
11
|
daemon off;
worker_processes 3;
user root root;
events {
worker_connections 1024;
}
http {
default_type application/octet-stream;
include /etc/nginx/conf.d/*.conf;
}
|
默认路由核心逻辑如下,请求进来后,通过执行 shell 请求所有节点,找出当前的主的 hostname, 然后将请求路由到 hostname 对应的 upstream
default.conf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
|
server {
listen 80;
server_name localhost;
location / {
access_by_lua_block {
os.execute("/bin/bash /etc/nginx/conf.d/hadoop-status.sh > /tmp/hadoop-status.tmp")
handle = io.open("/tmp/hadoop-status.tmp", "r")
result = handle:read("*a")
handle:close()
ngx.log(ngx.ERR, result)
ngx.exec("@"..result)
}
}
location @bigdata1 {
proxy_pass http://192.168.100.41:8088;
}
location @bigdata2 {
proxy_pass http://192.168.100.42:8088;
}
location @bigdata3 {
proxy_pass http://192.168.100.43:8088;
}
}
|
找到主的节点的方法比较简单,遍历所有的节点,有 jmx 返回的即为 Master。Master 中包含当前的 hostname,从脚本返回,传递给 nginx 做路由选择即可. 代码如下:
hadoop-status.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
|
#!/usr/bin/env bash
set -o errtrace
set -o errexit
set -o nounset
# set -o pipefail
# set -o xtrace
cd "$(dirname "$0")"
# cat /etc/nginx/conf.d/host_ip.txt | while read line;
cat /etc/nginx/conf.d/host_ip.txt | while read line;
do
# 忽略备注行
if [[ $line == \#* ]]
then
continue
fi
linearray=( $line )
RAW_IP=${linearray[0]}
RAW_HOSTNAME=${linearray[1]}
MASTER_NODE=$(wget -qO- "http://${RAW_IP}:8088/jmx?qry=Hadoop:service=ResourceManager,name=ClusterMetrics" | sed 's/,/\n/g' | grep 'tag.Hostname' | sed 's/tag.Hostname" : "//g' | sed 's/[" ]//g')
if [ "${RAW_HOSTNAME}" == "${MASTER_NODE}" ]; then
echo -n $RAW_HOSTNAME
fi
done
|
host_ip.txt 中包含所有节点 ip 和 hostname
1
2
3
|
192.168.100.41 bigdata1
192.168.100.42 bigdata2
192.168.100.43 bigdata3
|
执行下边命令后,访问 127.0.0.1:12346 看下效果
1
|
docker run --rm -it -p 12346:80 -v $(pwd)/:/etc/nginx/conf.d/ openresty:dev nginx -c /etc/nginx/conf.d/nginx.conf_root
|
总结
如上,使用 openresty+lua 编写检查下游 upstream 中所有节点主为何,然后 nginx 请求路由到主。
ref