/usr/bin/mpirun -np 16
–map-by ppr:4:socket -bind-to socket
–hostfile ./hostfile
–allow-run-as-root
–tag-output
–report-bindings
–mca pml ob1
–mca btl ^openib
–mca btl_tcp_if_exclude lo,docker0,bond0
–wdir /home/deeprec
-x NCCL_IB_DISABLE=0
-x NCCL_IB_GID_INDEX=3
-x NCCL_IB_HCA=mlx5
-x NCCL_DEBUG=INFO
-x NCCL_IB_TIMEOUT=25
-x NCCL_IB_RETRY_CNT=7
-x NCCL_SOCKET_IFNAME=eth0
-x TF_GPU_CUPTI_FORCE_CONCURRENT_KERNEL=1
-x JAVA_HOME=/opt/jdk/jdk1.8
-x START_STATISTIC_STEP=100
-x LIBHDFS_OPTS=-Dhadoop.root.logger=WARN,console
-x STOP_STATISTIC_STEP=110
-x MEM_USAGE_STRATEGY=251
-x JEMALLOC_PATH=/home/deeprec
-x SEC_TOKEN_PATH=/home/deeprec/tokens_sectoken
-x TF_SCRIPT=train.py
-x YARN_APP_ID=application_1681844181995_4023507
-x TF_WORKSPACE=/home/deeprec
-x HADOOP_HDFS_HOME=/opt/yarn/hadoop
-x HADOOP_TOKEN_FILE_LOCATION=/home/deeprec/container_tokens
-x PYTHONPATH=/usr/lib/python3.8/site-packages/merlin_sok-1.1.4-py3.8-linux-x86_64.egg:
-x PATH=/opt/yarn/hadoop/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
-x APPLICATION_ID=application_1681844181995_4023507
-x LD_LIBRARY_PATH=/opt/jdk/jdk1.8/jre/lib/amd64/server:/opt/yarn/hadoop/lib/native:/opt/yarn/hadoop/lib/native python train.py
–output_dir=hdfs:///user/xxx/deeprec
–data_location=hdfs:///user/xxx/criteo_1tb
–protocol=grpc
–smartstaged=false
–batch_size=2048
–steps=30000
–ev=true
–ev_elimination=l2
–ev_filter=counter
–op_fusion=true
–input_layer_partitioner=0
–dense_layer_partitioner=16
–group_embedding=collective
–workqueue=true
–parquet_dataset=false
机器学习PAI用的还是deepfm模型, 上次跑通了单机多卡, 这次想试试, 多机多卡, 在上yarn调度. ssh都配好了, mpi在多机上可以跑通,帮忙看下这个问题?是在yarn上先拉起两个大容器, 每个容器基本占一台物理机(8卡A100), 然后在容器内打通ssh, 然后在容器内用mpi拉起deeprec进程
机器学习PAI用的还是deepfm模型, 多机多卡, mpi在多机上可以跑通,帮忙看下这个问题?[阿里云机器学习PAI]
「点点赞赏,手留余香」
还没有人赞赏,快来当第一个赞赏的人吧!