[1,7]:Registered kernels: [1,7]: device=’GPU’; Tindices in [DT_INT32]; Toffsets in [DT_INT32] [1,7]: device=’GPU’; Tindices in [DT_INT32]; Toffsets in [DT_INT64] [1,7]: device=’GPU’; Tindices in [DT_INT64]; Toffsets in [DT_INT32] [1,7]: device=’GPU’; Tindices in [DT_INT64]; Toffsets in [DT_INT64] [1,7]: [1,7]: [[input_layer/input_layer/group_embedding_lookup/PreprocessingForward/PreprocessingForward]] [1,7]: [1,7]:During handling of the above exception, another exception occurred: [1,7]: [1,7]:Traceback (most recent call last): [1,7]: File “train.py”, line 887, in [1,7]: main() [1,7]: File “train.py”, line 642, in main [1,7]: train(sess_config, hooks, model, train_init_op, train_steps, [1,7]: File “train.py”, line 505, in train [1,7]: with tf.train.MonitoredTrainingSession( [1,7]: File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 655, in MonitoredTrainingSession [1,7]: return MonitoredSession( [1,7]: File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1085, in init [1,7]: super(MonitoredSession, self).init( [1,7]: File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 800, in init [1,7]: self._sess = _RecoverableSession(self._coordinated_creator) [1,7]: File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1282, in init [1,7]: _WrappedSession.init(self, self._create_session()) [1,7]: File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1287, in _create_session [1,7]: return self._sess_creator.create_session() [1,7]: File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 953, in create_session [1,7]: self.tf_sess = self._session_creator.create_session() [1,7]: File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 713, in create_session [1,7]: return self._get_session_manager().prepare_session( [1,7]: File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/session_manager.py”, line 306, in prepare_session [1,7]: sess.run(init_op, feed_dict=init_feed_dict) [1,7]: File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py”, line 955, in run [1,7]: result = self._run(None, fetches, feed_dict, options_ptr, [1,7]: File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py”, line 1179, in _run [1,7]: results = self._do_run(handle, final_targets, final_fetches, [1,7]: File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py”, line 1358, in _do_run [1,7]: return self._do_call(_run_fn, feeds, fetches, targets, options, [1,7]: File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py”, line 1384, in _do_call [1,7]: raise type(e)(node_def, op, message) [1,7]:tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op ‘PreprocessingForward’ used by node input_layer/input_layer/group_embedding_lookup/PreprocessingF[1,7]:orward/PreprocessingForward (defined at /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py:1748) with these attrs: [num_ranks=16, num_gpus=16, Toffsets=DT_INT64, Tindices=DT_INT64, num_lookups=26, combiners=[“mean”, “mean”, “mean”, “mean”, “mean”, …, “mean”, “mean”, “mean”, “mean”, “mean”], dimensions=[16, 16, 16, 16, 16, …, 16, 16, 16, 16, 16], shard=[-1, -1, -1, -1, -1, …, -1, -1, -1, -1, -1], rank=7, id_in_local_rank=0] [1,7]:Registered devices: [CPU, XLA_CPU] [1,7]:Registered kernels: [1,7]: device=’GPU’; Tindices in [DT_INT32]; Toffsets in [DT_INT32] [1,7]: device=’GPU’; Tindices in [DT_INT32]; Toffsets in [DT_INT64] [1,7]: device=’GPU’; Tindices in [DT_INT64]; Toffsets in [DT_INT32] [1,7]: device=’GPU’; Tindices in [DT_INT64]; Toffsets in [DT_INT64] [1,7]: [1,7]: [[input_layer/input_layer/group_embedding_lookup/PreprocessingForward/PreprocessingForward]] 请帮助看一下, 机器学习PAI以前出现过这个问题吗?用的还是deepfm模型, 上次跑通了单机多卡, 这次想试试, 多机多卡, 在上yarn调度. ssh都配好了, mpi在多机上可以跑通
请帮助看一下, 机器学习PAI以前出现过这个问题吗?[阿里云机器学习PAI]
「点点赞赏,手留余香」
还没有人赞赏,快来当第一个赞赏的人吧!