英特尔
直播中

杜亚琼

8年用户 171经验值
私信 关注
[问答]

批量大小为3的内存不足

我正在使用IntelAI节点来训练pytorch中的深层网络。
但是,当我运行程序时,出现内存不足错误。
我的训练数据大小为1GB,并按批量3进行批量加载。我无法进一步降低内存需求。
请帮忙

以上来自于谷歌翻译


以下为原文

I am using IntelAI node for training a deep network in pytorch. However I get an out of memory error when I run the program. My training data size is 1GB and loaded in batches of size 3. I have no options to further reduce memory requirements. Please help

回帖(10)

汤宇

2018-11-14 11:53:23
嗨Jhilik,
您能否附上错误的屏幕截图。另外,请确认您使用的是Jupyter Hub还是Putty / SSH终端。
问候,安居房

以上来自于谷歌翻译


以下为原文

Hi Jhilik,
 
Could you please attach the screenshot of the error.
Also, please confirm if you are using Jupyter Hub or Putty/SSH terminal.
 
Regards,
Anju
举报

李玉鑫

2018-11-14 12:06:06
引用: jerry1978 发表于 2018-11-14 10:06
嗨Jhilik,
您能否附上错误的屏幕截图。另外,请确认您使用的是Jupyter Hub还是Putty / SSH终端。
问候,安居房

这是我得到的错误....使用ssh连接
上次登录时间:Tue Sep 4 03:54:13 2018从10.5.0.7
[u19304 @ c009~] $ source activate en
(en)[u19304 @ c009~] $ cd bum
(en)[u19304 @ c009 bum] $ python3 bumcpu.py
回溯(最近的呼叫最后): 
文件“bumcpu.py”,第211行,in 
training_set = DLibdata(train = True) 
在__init__中输入“/home/u19304/bum/loaddata.py”,第46行 
self.train_data = torch.load('trn.pt') 
文件“/home/u19304/.conda/envs/en/lib/python3.6/site-packages/torch/serialization.py”,第358行,载入中 
return _load(f,map_location,pickle_module) 
在_load中输入文件“/home/u19304/.conda/envs/en/lib/python3.6/site-packages/torch/serialization.py”,第542行 
result = unpickler.load() 
文件“/home/u19304/.conda/envs/en/lib/python3.6/site-packages/torch/serialization.py”,第508行,在persistent_load中 
data_type(大小),位置)
RuntimeError:$ Torch:没有足够的内存:你试图分配2GB。
买新的RAM!
at /opt/conda/conda-bld/pytorch-cpu_1532576596369/work/aten/src/TH/THGeneral.cpp:204

以上来自于谷歌翻译


以下为原文

This is the error I get....Am using ssh to connect
 
Last login: Tue Sep  4 03:54:13 2018 from 10.5.0.7
[u19304@c009 ~]$ source activate en
(en) [u19304@c009 ~]$ cd bum
(en) [u19304@c009 bum]$ python3 bumcpu.py
Traceback (most recent call last):
  File "bumcpu.py", line 211, in
    training_set = DLibdata(train=True)             
  File "/home/u19304/bum/loaddata.py", line 46, in __init__
    self.train_data = torch.load('trn.pt')
  File "/home/u19304/.conda/envs/en/lib/python3.6/site-packages/torch/serialization.py", line 358, in load
    return _load(f, map_location, pickle_module)
  File "/home/u19304/.conda/envs/en/lib/python3.6/site-packages/torch/serialization.py", line 542, in _load
    result = unpickler.load()
  File "/home/u19304/.conda/envs/en/lib/python3.6/site-packages/torch/serialization.py", line 508, in persistent_load
    data_type(size), location)
RuntimeError: $ Torch: not enough memory: you tried to allocate 2GB. Buy new RAM! at /opt/conda/conda-bld/pytorch-cpu_1532576596369/work/aten/src/TH/THGeneral.cpp:204
举报

汤宇

2018-11-14 12:23:10
引用: cd340823 发表于 2018-11-14 10:18
这是我得到的错误....使用ssh连接
上次登录时间:Tue Sep 4 03:54:13 2018从10.5.0.7
$ source activate en

嗨Jhilik,您正在尝试在登录节点上运行该程序。
登录节点不是为了承担繁重的工作负载而设计的。
所有计算密集型作业都必须在计算节点上运行。要执行此操作,您可以使用以下任一选项:1。
输入qsub -I。
这将为您提供其中一个计算节点上的交互式终端。
你可以在那里执行你的程序。
将所有bash命令包装在脚本文件中(例如“job.sh”)并提供“qsub job.sh”。
这将把你的工作提交给调度程序,调度程序将获取脚本并在计算节点中执行它。有关详细信息,请参阅以下文档:https://communities.intel.com/docs/DOC-112425https:
//communities.intel.com/docs/DOC-112294https://communities.intel.com/docs/DOC-112293https://communities.intel.com/docs/DOC-112422https://communities.intel.com/
线程/ 127653Regards,安居房

以上来自于谷歌翻译


以下为原文

Hi Jhilik,

You are trying to run the program on login node. Login nodes are not designed to take heavy workloads. All compute intensive jobs have to be run on compute nodes.

To do this, You can use either of the following options:
1. Type qsub -I. This will give you an interactive terminal on one of the compute nodes. You can execute your program there.
2. Wrap all your bash commands in a script file (say "job.sh") and give "qsub job.sh". This will submit your job to the scheduler, which will take the script and execute it in the compute node.

For more details on this, please refer the following documents:
https://communities.intel.com/docs/DOC-112425
https://communities.intel.com/docs/DOC-112294
https://communities.intel.com/docs/DOC-112293
https://communities.intel.com/docs/DOC-112422
https://communities.intel.com/thread/127653

Regards,
Anju
举报

杜亚琼

2018-11-14 12:29:02
引用: jerry1978 发表于 2018-11-14 10:35
嗨Jhilik,您正在尝试在登录节点上运行该程序。
登录节点不是为了承担繁重的工作负载而设计的。
所有计算密集型作业都必须在计算节点上运行。要执行此操作,您可以使用以下任一选项:1。

嗨Jhilik,你能否确认提供的解决方案是否有帮助.Regards,Anju

以上来自于谷歌翻译


以下为原文

Hi Jhilik,

Could you please confirm if the solution provided helped.

Regards,
Anju
举报

更多回帖

发帖
×
20
完善资料,
赚取积分