HPC 系列文章(11):节点状态

sinfo

前面介绍了如何查看作业的状态,下面我们介绍一下如何查看节点的状态。跟作业状态查看类似,节点状态也可以通过命令行获取,有两种方式:sinfoscontrol show node,命令行工具的使用时非常简单的,但是节点的状态理解比较复杂。还是先从简单的介绍命令行工具开始吧。

命令行工具

sinfo

查看节点信息最简单的方式是直接使用sinfo命令,该命令会输出当前系统中的所有队列及节点的基本状态,如果只是快速查看一下可用的节点等信息,sinfo所提供的信息就足够了。

sinfo

此外sinfo还提供了一些简单的参数,用于提供格式化或者过滤特定节点等功能,都比较简单,可以通过–help自行查看,这里就不赘述了。

sinfo-help

scontrol

如果想查看更为详细的节点状态信息,则应该使用命令

scontrol show node [nodename]

该命令会输出节点的详细信息,包括该节点的各种资源使用状态及负载,还包括该节点的系统信息、软件信息及启动信息等等,如果想要通过slurm了解具体节点的详情,则应用使用该方式。

show-node

Slurm节点状态

Slurm中节点的状态非常多,而且节点状态由状态和特殊符号两部分构成。节点状态中可能会有一个特殊的字符后缀用于标识节点关联的状态标志。组合起来的作业状态就更复杂了,经验不足的使用者想要理解还是有一些难度的。这里我以官方文件加上我自己的理解逐一解释各个状态。

特殊字符

后缀特殊字符表示的含义如下:

“*”:

The node is presently not responding and will not be allocated any new work. If the node remains non-responsive, it will be placed in the DOWN state (except in the case of COMPLETING, DRAINED, DRAINING, FAIL, FAILING nodes).

表示节点目前无响应,不会将任何作业分配到该节点。如果节点仍然没有响应,它将被系统置为DOWN状态(状态为COMPLETINGDRAINEDDRAININGRAILFAILING的节点除外)

“~”:

The node is presently in a power saving mode (typically running at reduced frequency).

表示该节点处于节能模式下(通常是处于降频工作状态)

“#”:

The node is presently being powered up or configured.

节点正在启动或配置中。

“$”:

The node is currently in a reservation with a flag value of “maintenance”.

节点正处于标识值为maintenance的保留(预约)状态(详见资源预留相关文档)

“@”:

The node is pending reboot.

节点正在被安排重启。

节点状态

  • ALLOCATED

The node has been allocated to one or more jobs.

该节点已被分配给一个或多个作业。

  • ALLOCATED+

The node is allocated to one or more active jobs plus one or more jobs are in the process of COMPLETING.

该节点已被分配给一个或多个作业,并且其中一部分作业正在完成过程中,即部分作业处于COMPLETING状态。

  • COMPLETING

All jobs associated with this node are in the process of COMPLETING. This node state will be removed when all of the job’s processes have terminated and the Slurm epilog program (if any) has terminated. See the Epilog parameter description in the slurm.conf man page for more information.

与此节点相关联的所有作业都处于完成过程中。Slurm允许用户指定作业完成后要执行的程序(详见slurm.conf文件中的Epilog参数)当所有作业进程以及epilog程序都已终止时,该状态会被移除。

  • DOWN

The node is unavailable for use. Slurm can automatically place nodes in this state if some failure occurs. System administrators may also explicitly place nodes in this state. If a node resumes normal operation, Slurm can automatically return it to service. See the ReturnToService and SlurmdTimeout parameter descriptions in the slurm.conf(5) man page for more information.

节点不可用。如果出现某些故障,Slurm会自动将节点置于此状态,系统管理员也可以手动将节点置于此状态。如果执行了某些恢复操作,Slurm会自动使节点回归服务。相关配置参考slurm.conf中ReturnToService以及SlurmdTimeout参数的配置描述。

  • DRAINED

The node is unavailable for use per system administrator request. See the update node command in the scontrol(1) man page or the slurm.conf(5) man page for more information.

节点不可接受用户请求。

  • DRAINING

The node is currently executing a job, but will not be allocated to additional jobs. The node state will be changed to state DRAINED when the last job on it completes.

节点正在执行作业,但后续不会再分配到任何作业,当节点上最后一个作业执行完毕后,该状态会转变成DRAINED

  • ERROR

The node is currently in an error state and not capable of running any jobs. Slurm can automatically place nodes in this state if some failure occurs. System administrators may also explicitly place nodes in this state. If a node resumes normal operation, Slurm can automatically return it to service. See the ReturnToService and SlurmdTimeout parameter descriptions in the slurm.conf(5) man page for more information.

节点当前处于出错状态,无法运行任何作业。如果发生某些错误,Slurm会自动将节点置于此状态,系统管理员也可以手动将节点置于此状态。如果执行了某些恢复操作,Slurm会自动使节点回归服务。相关配置参考slurm.conf中ReturnToService以及SlurmdTimeout参数的配置描述。

  • FAIL

The node is expected to fail soon and is unavailable for use per system administrator request.

节点即将不可用。(与DRAIN相似,但是DRAIN通常是指主动拒绝分配作业到该节点,而FAIL则是无法分配到该节点)

  • FAILING

The node is currently executing a job, but is expected to fail soon and is unavailable for use per system administrator request.

节点正在执行作业,但即将不可用。(与DRAINING很相似,区别在于工作在该节点上的任务将无法完成)

  • FUTURE

The node is currently not fully configured, but expected to be available at some point in the indefinite future for use.

该节点目前尚未完全配置,但预计在无限期的未来能使用。

  • IDLE

The node is not allocated to any jobs and is available for use.

该节点未分配给任何作业,可供使用。

  • MAINT

The node is currently in a reservation with a flag value of “maintainence”.
节点当前处于保留状态,标志值为“维护”。(详见资源预留相关文档)

  • REBOOT

The node is currently scheduled to be rebooted.

节点正计划重启。

  • MIXED

The node has some of its CPUs ALLOCATED while others are IDLE.

节点资源被部分分配。

  • PERFCTRS (NPC)

Network Performance Counters associated with this node are in use, rendering this node as not usable for any other jobs

该节点被用于监控网络性能,无法用于其他作业。

  • POWER_DOWN

The node is currently powered down and not capable of running any jobs.

该节点目前已关闭电源,无法运行任何作业。该状态是由省电模式节能程序设置的,详细SuspendProgResumeProg配置。

  • POWER_UP

The node is currently in the process of being powered up.

节点正在开机。该状态是由省电模式节能程序设置的,详细SuspendProgResumeProg配置。

  • RESERVED

The node is in an advanced reservation and not generally available.

节点处于高级预留状态,通常不可用。(详见资源预留配置)。

  • UNKNOWN

The Slurm controller has just started and the node’s state has not yet been determined.

Slurm控制器刚刚启动,节点的状态尚未确定。

状态分类

虽然状态非常多,但是总的来说我们可以把它分成以下几大类:

  • 工作中: ALLOCATED ALLOCATED+ COMPLETING DRAINING FAILING MIXED
  • 空闲: IDLE
  • 不可用: DOWN DRAINED ERROR FAIL FUTURE MAINT REBOOT PERFCTRS POWER_UP RESERVED 包含- # $ @任一字符
  • 未知: UNKNOWN
坚持原创技术分享,您的支持将鼓励我继续创作!