0 ceph quincy部署
ceph quincy 版本部署¶
-
Ceph版本使用
Quincy 17.2.7说明:openeuler yum源中默认版本为
Pacific 16.2.x,ceph在维护的版本ceph活跃版本 -
docker与containerd版本分别为
docker-20.10.23和containerd-1.7.2说明:openeuler 22.03自带的docker版本是
18.09。 -
使用sda和nvme0n2作为OSD
部署步骤¶
-
修改主机名和
hosts修改所有节点(k8s01-04)的主机名
修改有节点(k8s01-04)/etc/hosts文件注意 后面操作步骤指定的主机名称k8s01或ceph01时是同一台主机;同理k8s02或ceph02也是同一台主机。cat > /etc/hosts <<EOF 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 # HA ha01 192.168.59.238 ha02 192.168.59.239 test-kubevirt.demo.com 192.168.59.240 # public network 192.168.61.241 ceph01 192.168.61.242 ceph02 192.168.61.243 ceph03 # cluster network 10.168.61.241 ceph01-cl 10.168.61.242 ceph02-cl 10.168.61.243 ceph03-cl # manage network # k8s eth0 192.168.59.241 k8s01 192.168.59.242 k8s02 192.168.59.243 k8s03 192.168.59.244 k8s04 # rgw 192.168.61.241 rgw.testgw images.demo.com dl.demo.com # harbor 192.168.59.251 harbor registry.demo.com EOF -
统一网卡名称(选做)
-
关闭防火墙和selinux(所有节点操作)
- 关闭交换分区(选做),生产环境建议关闭交换分区(所有节点操作)
- 修改内核参数及资源限制参数(所有节点操作)
## 转发 IPv4 并让 iptables 看到桥接流量(选做) # cat <<EOF | sudo tee /etc/modules-load.d/ceph.conf overlay br_netfilter EOF # modprobe overlay # modprobe br_netfilter # lsmod | grep br_netfilter #验证br_netfilter模块 ## 修改内核参数 cat <<EOF | tee /etc/sysctl.d/ceph.conf net.bridge.bridge-nf-call-iptables = 1 net.bridge.bridge-nf-call-ip6tables = 1 net.ipv4.ip_forward = 1 #1. 用于对外连接的随机端口范围。缺省是# 32768 60999 #端口范围开始和结束要奇偶不同,如果设置为1024 65530则在dmesg中会报ip_local_port_range: prefer different parity for start/end values. net.ipv4.ip_local_port_range = 1024 65335 # 如果dmesg中有类似“nf_conntrack: table full, dropping packet”日志,则需要调大 conntrack 参数,默认是2621440,该值不能太大,否则会出现:nf_conntrack: falling back to vmalloc. net.netfilter.nf_conntrack_max = 2621440 net.nf_conntrack_max = 2621440 # 指定了进程可以拥有的内存映射区域的最大数目。这个设置对于使用大量内存映射的应用程序很重要 vm.max_map_count = 1048576 #2. 如果 netstat -s | grep "buffer errors" 中errors数在增加,则需要调整如下参数 # net.ipv4.tcp_wmem 默认值:4096 16384 4194304 net.ipv4.tcp_wmem = 4096 16384 4194304 # net.ipv4.tcp_rmem 默认值:4096 87380 6291456 net.ipv4.tcp_rmem = 4096 87380 6291456 # net.ipv4.tcp_mem 默认值:381462 508616 762924 net.ipv4.tcp_mem = 381462 508616 762924 # net.core.rmem_default 默认值:212992 net.core.rmem_default = 8388608 # net.core.rmem_max 默认值:212992 net.core.rmem_max = 26214400 # net.core.wmem_max 默认值:212992 net.core.wmem_max = 26214400 # 调大文件句柄数 fs.nr_open = 16777216 fs.file-max = 16777216 #3.如果dmesg中有类似"arp_cache: neighbor table overflow",则需要调整如下参数 # net.ipv4.neigh.default.gc_thresh1 默认值 128 net.ipv4.neigh.default.gc_thresh1 = 40960 # net.ipv4.neigh.default.gc_thresh2 默认值 512 net.ipv4.neigh.default.gc_thresh2 = 81920 # net.ipv4.neigh.default.gc_thresh3 默认值 1024 net.ipv4.neigh.default.gc_thresh3 = 102400 #4. 连接队列满导致丢包,需要调整半连接队列和全连接队列 #TCP 连接请求队列长度,默认为1024,加大队列长度为8192,可以容纳更多等待连接的网络连接数。 net.ipv4.tcp_max_syn_backlog = 65535 # 调整全连接队列上限,即服务器同时接受连接的数量 net.core.somaxconn = 65535 # 网络设备最大接收队列长度 net.core.netdev_max_backlog = 250000 #5. 在低版本内核中(比如 3.10),支持使用 tcp_tw_recycle 内核参数来开启 TIME_WAIT 的快速回收,但如果 client 也开启了 timestamp (一般默认开启),同时也就会导致在 NAT 环境丢包,甚至没有 NAT 时,稍微高并发一点,也会导致PAWS校验失败,导致丢包,所以生产环境不建议开启。 #### TIME_WAIT # 默认0 # 用 SYN Cookie 防御机制 net.ipv4.tcp_syncookies = 1 # 开启 TIME-WAIT 状态的重用,此处为0,未开启 net.ipv4.tcp_tw_reuse = 0 # 不建议启用tcp_tw_recycle,会导致数据错乱,4.12内核已去掉这个参数 # net.ipv4.tcp_tw_recycle = 0 # 默认60 net.ipv4.tcp_fin_timeout = 30 #6.启用fastopen,跳过tcp3次握手;第 1 个比特位为 1 时,表示作为客户端时支持 TFO;第 2 个比特位为 1 时,表示作为服务器时支持 TFO,所以当 tcp_fastopen 的值为 3 时(比特为 0x11)就表示完全支持 TFO 功能。 net.ipv4.tcp_fastopen = 3 net.ipv4.tcp_orphan_retries = 3 # 默认0,表示如果三次握手第三步的时候 accept queue 满了,则 server 丢弃 client 发过来的 ack;为1表示第三步的时候如果全连接队列满了,server 发送一个 rst 包给 client ,表示拒绝这个握手过程和这个连接 # 只有确信守护进程真的不能完成连接请求时才打开该选项,该选项会影响客户的使用 net.ipv4.tcp_abort_on_overflow = 1 EOF # sysctl -p /etc/sysctl.d/ceph.conf ## 修改资源限制参数 cat > /etc/security/limits.d/ceph.conf <<EOF # End of file * hard nofile 655360 * soft nofile 655360 * soft core 655360 * hard core 655360 * soft nproc unlimited root soft nproc unlimited EOF -
配置时间同步服务,以 k8s01 作为时间同步服务器
## 在 k8s01 上操作 # yum -y install chrony # vi /etc/chrony.conf pool ntp.aliyun.com iburst ... ... allow 192.168.59.0/24 allow 192.168.61.0/24 allow 10.168.61.0/24 local stratum 10 # systemctl restart chronyd && systemctl enable chronyd ## 在k8s01-k8s04上操作 # yum -y install chrony # vi /etc/chrony.conf pool k8s01 iburst ... # systemctl restart chronyd && systemctl enable chronyd;chronyc sources MS Name/IP address Stratum Poll Reach LastRx Last sample =============================================================================== ^* k8s01 3 6 17 8 -34us[ -28us] +/- 35ms -
配置ssh免密登录
-
安装containerd和docker
修改/etc/systemd/system/containerd.service,将## 先在 ceph01 上操作 ## 直接下载:cri-containerd-1.7.2-linux-amd64.tar.gz,该包里面包含了containerd、 ctr、crictl、containerd-shim等二进制文件,还有启动命令等,只要在/下解压即可。 # tar xf cri-containerd-1.7.2-linux-amd64.tar.gz -C / # cat > /etc/crictl.yaml << EOF runtime-endpoint: unix:///var/run/containerd/containerd.sock image-endpoint: unix:///var/run/containerd/containerd.sock timeout: 10 debug: false EOF ## 生成containerd配置文件 # mkdir /etc/containerd # containerd config default > /etc/containerd/config.toml ## 修改 /etc/containerd/config.toml ## 把SystemdCgroup = false修改为:SystemdCgroup = true # vi /etc/containerd/config.toml ## 调整日志 max_container_log_line_size = 163840 ## 用于从安全上下文中提取设备所有权信息,kubevirt cdi依赖该参数,不设置该参数可能出现无权限 device_ownership_from_security_context = true SystemdCgroup = trueLimitNOFILE=infinity改为LimitNOFILE=655360
# systemctl enable containerd && systemctl restart containerd
## 使用 crictl info 查看配置是否生效
## 其他节点只需将相关文件拷贝过去启动即可
# 在 k8s01 上操作
# for i in {2..3};do scp -rp /usr/local/bin ceph0$i:/usr/local/;scp -rp /usr/local/sbin ceph0$i:/usr/local/;scp /etc/containerd ceph0$i:/etc/; scp /etc/systemd/system/containerd.service ceph0$i:/etc/systemd/system/;done
## 在ceph02-03上操作
# for i in {2..3};do cmd=$(systemctl daemon-reload && systemctl enable containerd && systemctl restart containerd);ssh ceph$i "$cmd";done
## 先检查节点上是否有安装 podman,如果有,要先卸载!!!
## 先在 k8s01 上操作
1). 下载docker安装包 https://download.docker.com/linux/static/stable/x86_64/docker-20.10.23.tgz
2). 解压
# tar xf docker-20.10.23.tgz
3). 需要用到的二进制文件包括:docker 和 dockerd,拷贝到 /usr/bin/目录下即可。
# mv docker/docker* /usr/bin/
4). 创建 docker 用户 useradd -s /sbin/nologin docker # 如果docker组存在,则使用 useradd -s /sbin/nologin docker -g docker
5). 启动的配置文件 docker.serivce 和 docker.socket 拷贝到 /usr/lib/systemd/system/,daemon.json文件放到/etc/docker目录
# cat > /usr/lib/systemd/system/docker.service <<EOF
[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
After=network-online.target docker.socket firewalld.service containerd.service
Wants=network-online.target
Requires=docker.socket
[Service]
Type=notify
# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
# ExecStart=/usr/bin/dockerd --graph=/data/docker -H fd:// --containerd=/run/containerd/containerd.sock --cri-containerd --debug
ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock --cri-containerd --debug
ExecReload=/bin/kill -s HUP \$MAINPID
TimeoutSec=0
RestartSec=2
Restart=always
# Note that StartLimit* options were moved from "Service" to "Unit" in systemd 229.
# Both the old, and new location are accepted by systemd 229 and up, so using the old location
# to make them work for either version of systemd.
StartLimitBurst=3
# Note that StartLimitInterval was renamed to StartLimitIntervalSec in systemd 230.
# Both the old, and new name are accepted by systemd 230 and up, so using the old name to make
# this option work for either version of systemd.
StartLimitInterval=60s
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
# 可能会出现错误:"Failed at step LIMITS spawning /usr/bin/dockerd: Operation not permitted",则需要将LimitNOFILE=infinity 改成:LimitNOFILE=65530
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
# Comment TasksMax if your systemd version does not support it.
# Only systemd 226 and above support this option.
TasksMax=infinity
# set delegate yes so that systemd does not reset the cgroups of docker containers
Delegate=yes
# kill only the docker process, not all processes in the cgroup
KillMode=process
[Install]
WantedBy=multi-user.target
EOF
# cat > /usr/lib/systemd/system/docker.socket <<EOF
[Unit]
Description=Docker Socket for the API
PartOf=docker.service
[Socket]
ListenStream=/var/run/docker.sock
SocketMode=0660
SocketUser=root
SocketGroup=docker
[Install]
WantedBy=sockets.target
EOF
## 启动docker
# systemctl daemon-reload && systemctl enable docker --now
## 配置镜像加速和私有仓库
# cat > /etc/docker/daemon.json <<EOF
{
"registry-mirrors": ["https://vty0b0ux.mirror.aliyuncs.com"],
"exec-opts": ["native.cgroupdriver=systemd"],
"log-driver": "json-file",
"log-opts": {
"max-size": "500m"
},
"storage-driver": "overlay2",
"insecure-registries": ["registry.demo.com","192.168.59.249:5000"]
}
EOF
# systemctl restart docker
## 在 k8s01 上操作
# for i in {2..3};do scp /usr/bin/docker* ceph0$i:/usr/bin/;scp -rp /etc/docker ceph0$i:/etc/;scp /usr/lib/systemd/system/docker.service ceph0$i:/usr/lib/systemd/system/;scp /usr/lib/systemd/system/docker.socket ceph0$i:/usr/lib/systemd/system/;done
## 在 k8s02-03 上操作
# useradd -s /sbin/nologin docker
# systemctl daemon-reload && systemctl enable docker --now
-
安装cephadm,O版开始就不再支持ceph-deploy工具
cephadm安装 cephadm安装前提 - Python3 - Systemd - Podman or Docker - Chrony or NTP - LVM2 cephadm工作原理:cephadm命令管理ceph集群完整生命周期。包括创建引导集群,启动一个提供shell的容器来管理集群。cephadm使用ssh与集群各节点通信。 cephadm bootstrap 就是在单一节点上创建一个小型的ceph集群,包括一个ceph monitor和一个ceph mgr,监控组件包括prometheus、node-exporter等。
## 在ceph所有点上执行 # CEPH_RELEASE=17.2.7 # curl --silent --remote-name --location https://download.ceph.com/rpm-${CEPH_RELEASE}/el9/noarch/cephadm # chmod +x cephadm # mv cephadm /usr/sbin/ ## 执行cephadm install 会在当前节点上安装cephadm依赖相关的软件包,版本较低,所以不建议执行 # cephadm install ERROR: Distro openeuler version 22.03 not supported ## 修改 /usr/sbin/cephadm ,在 # vi /usr/sbin/cephadm ## DISTRO_NAMES 这个字典中增加 openeuler 7654 DISTRO_NAMES = { 7655 'centos': ('centos', 'el'), 7656 'rhel': ('centos', 'el'), 7657 'scientific': ('centos', 'el'), 7658 'rocky': ('centos', 'el'), 7659 'openeuler': ('centos', 'el'), 7660 'almalinux': ('centos', 'el'), 7661 'ol': ('centos', 'el'), 7662 'fedora': ('fedora', 'fc'), 7663 'mariner': ('mariner', 'cm'), 7664 } 7665 ## 将 cephadm 文件拷贝到其他节点上 # for i in {2..3};do scp -rp /usr/sbin/cephadm ceph0$i:/usr/sbin/;done -
检查ceph各节点是否满足安装ceph集群,该命令需要在当前节点执行,比如要判断ceph02是否支持安装ceph集群,则在ceph02上执行
-
初始化mon
cephadm bootstrap 过程是在单一节点上创建一个小型的ceph集群,包括一个ceph monitor和一个ceph mgr,监控组件包括prometheus、node-exporter等。
## 初始化时,指定了mon-ip、集群网段、dashboard初始用户名和密码 # cephadm --image registry.demo.com/ceph/ceph:v17.2.7 \ bootstrap --fsid 96213b64-2921-11ee-b082-000c2921faf1 \ --mon-ip 192.168.61.241 \ --cluster-network 10.168.61.0/24 \ --initial-dashboard-user admin \ --initial-dashboard-password demo2024 \ --ssh-private-key /root/.ssh/id_rsa \ --ssh-public-key /root/.ssh/id_rsa.pub \ --registry-url registry.demo.com \ --registry-username admin \ --registry-password Harbor12345 # ls /etc/ceph/ ceph.client.admin.keyring ceph.conf rbdmap- ceph.client.admin.keyring 是具有ceph管理员的秘钥 - ceph.conf 是最小化配置文件## 指定dashboard用户名和密码 --initial-dashboard-user admin --initial-dashboard-password demo2024 ## 指定私钥和公钥 --ssh-private-key /root/.ssh/id_rsa --ssh-public-key /root/.ssh/id_rsa.pub ## 启动前不拉取默认镜像 --skip-pull ## 指定私有镜像仓库 --registry-url registry.demo.com \ --registry-username admin \ --registry-password Harbor12345 ## 需要指定监控组件镜像 ceph config set mgr mgr/cephadm/container_image_prometheus registry.demo.com/prometheus/prometheus:v2.43.0 ceph config set mgr mgr/cephadm/container_image_grafana registry.demo.com/ceph/ceph-grafana:9.4.7 ceph config set mgr mgr/cephadm/container_image_alertmanager registry.demo.com/prometheus/alertmanager:v0.23.0 ceph config set mgr mgr/cephadm/container_image_node_exporter registry.demo.com/prometheus/node-exporter:v1.5.0
在3个以上ceph节点时,默认会将其中3个节点当做mon,这可以从
ceph orch ls中看出来# ceph orch ls NAME PORTS RUNNING REFRESHED AGE PLACEMENT alertmanager ?:9093,9094 1/1 7m ago 46m count:1 crash 1/1 7m ago 46m * grafana ?:3000 1/1 7m ago 46m count:1 mgr 1/2 7m ago 46m count:2 mon 1/5 7m ago 46m count:5 node-exporter ?:9100 1/1 7m ago 46m * prometheus ?:9095 1/1 7m ago 46m count:1初始化mon后,此时集群还处于WARN状态,没有OSD,MON的数量也才只有1个,MGR也只有1个,所以接下来就是先添加ceph节点。
11. 添加ceph节点 使用cephadm将主机添加到存储集群中,执行添加节点命令后,会在目标节点拉到ceph/node-exporter镜像,需要一定时间,所以可提前在节点上将镜像导入。# ceph -s cluster: id: 96213b64-2921-11ee-b082-000c2921faf1 health: HEALTH_WARN OSD count 0 < osd_pool_default_size 3 services: mon: 1 daemons, quorum k8s01 (age 11m) mgr: k8s01.xwbcem(active, since 9m) osd: 0 osds: 0 up, 0 in data: pools: 0 pools, 0 pgs objects: 0 objects, 0 B usage: 0 B used, 0 B / 0 B avail pgs:查看加入到集群的节点# cephadm shell ceph orch host add k8s02 192.168.61.242 --labels=mon,mgr # ceph orch host add k8s03 192.168.61.243 --labels=mon12. 给节点添加标签、删除标签 给节点打上指标标签后,后续可以按标签进行编排。# ceph orch host ls HOST ADDR LABELS STATUS k8s01 192.168.61.241 _admin k8s02 192.168.61.242 mon mgr k8s03 192.168.61.243 mon 3 hosts in cluster给节点打_admin标签,默认情况下,_admin标签应用于存储集群中的 bootstrapped 主机, client.admin密钥被分发到该主机(ceph orch client-keyring {ls|set|rm})。 将这个标签添加到其他主机后,其他主机的/etc/ceph下也将有client.admin密钥。
删除标签 注意:删除节点上的_admin标签,并不会删除该节点上已有的## 给 k8s02、k8s03 加上 _admin 标签 # ceph orch host label add k8s02 _admin # ceph orch host label add k8s03 _admin ## 给 ceph01-ceph04加上 mon 标签 # ceph orch host label add k8s02 mon # ceph orch host label add k8s03 mon ## 给 k8s01、k8s02 加上 mgr 标签 # ceph orch host label add k8s01 mgr # ceph orch host label add k8s02 mgr ## 列出节点,查看节点上标签 # ceph orch host ls HOST ADDR LABELS STATUS k8s01 192.168.61.241 _admin,mgr k8s02 192.168.61.242 mon,mgr,_admin k8s03 192.168.61.243 mon,_admin 3 hosts in clusterceph.client.admin.keyring密钥文件
-
添加osd
说明:添加OSD时,建议将磁盘先格式化为无分区的原始磁盘
## https://rook.github.io/docs/rook/v1.10/Getting-Started/ceph-teardown/?h=sgdisk#zapping-devices DISK="/dev/sdX" ## Zap the disk to a fresh, usable state (zap-all is important, b/c MBR has to be clean) sgdisk --zap-all $DISK ## Wipe a large portion of the beginning of the disk to remove more LVM metadata that may be present dd if=/dev/zero of="$DISK" bs=1M count=100 oflag=direct,dsync ## SSDs may be better cleaned with blkdiscard instead of dd blkdiscard $DISK ## Inform the OS of partition table changes partprobe $DISK## 查看各ceph节点有哪些磁盘是可用的,关注`AVAILABLE`列 # ceph orch device ls HOST PATH TYPE DEVICE ID SIZE AVAILABLE REFRESHED REJECT REASONS k8s01 /dev/nvme0n2 ssd VMware_Virtual_NVMe_Disk_VMware_NVME_0000 200G Yes 16m ago k8s01 /dev/sda hdd VMware_Virtual_SATA_Hard_Drive_00000000000000000001 200G Yes 16m ago k8s02 /dev/nvme0n2 ssd VMware_Virtual_NVMe_Disk_VMware_NVME_0000 200G Yes 4m ago k8s02 /dev/sda hdd VMware_Virtual_SATA_Hard_Drive_00000000000000000001 200G Yes 4m ago k8s03 /dev/nvme0n2 ssd VMware_Virtual_NVMe_Disk_VMware_NVME_0000 200G Yes 3m ago k8s03 /dev/sda hdd VMware_Virtual_SATA_Hard_Drive_00000000000000000001 200G Yes 3m ago ## 接下来初始化osd ## 将指定的磁盘格式化为无分区的原始磁盘 # blkdiscard /dev/nvme0n2 # cephadm shell ceph orch device zap k8s01 /dev/sda # cephadm shell ceph orch device zap k8s01 /dev/nvme0n2 ## 接着初始化其他节点上磁盘 ## 添加OSD # ceph orch daemon add osd k8s01:/dev/sda # ceph orch daemon add osd k8s01:/dev/nvme0n2 # ceph orch daemon add osd k8s02:/dev/sda # ceph orch daemon add osd k8s02:/dev/nvme0n2 # ceph orch daemon add osd k8s03:/dev/sda # ceph orch daemon add osd k8s03:/dev/nvme0n2 -
添加池