動機:因研究計畫需求,常常需要操作國網中心的高速運算環境,特留下記錄以免忘記及犯錯?!
準備環境:
1.Windows 10筆電
2.WSL2
3.Chrome瀏覽器
4.網路
步驟:
1.先登入 國網中心 iService (帳/密: user999@mail.nuk.edu.tw / password)
(1).建立一個 [開發型容器]:
按下 [+建立], 按下 [Custom Image], 拉下 "映像檔": cuda-10.1-cudnn7-devel-ubuntu18.04:tadnn999999, 基本設定: 點選 [cm.xsuper, GPUx2, CPUx8 記憶體容量:120GB, 共享記憶體:60GB], 按下 [下一步: 儲存資訊], 按下 [下一步: 檢閱+建立], 按下 [建立]
等候容器Initializing...建立...約60秒
(2).完成
>>> 開發型容器名稱: ctr9999999999999, ssh u9999999@203.145.216.149 -p 99999 (按SSH右側的正方形圖示, 複製網址)
>>> 查看 https://www.twcc.ai/user/container/detail/9999999
2.使用 WSL2執行程式(使用tmux):
開啟 Windows Terminal / Ubuntu-20.04
(1).安裝tmux:
davis@LAPTOP-99999999:/mnt/c/Users/dvsse$ sudo apt install tmux
(2).使用tmux:
輸入tmux指令:
# New a Session
davis@LAPTOP-99999999:/mnt/c/Users/dvsse$ tmux new -s twcc
davis@LAPTOP-99999999:/mnt/c/Users/dvsse$ ssh u9999999@203.145.216.149 -p 99999
輸入 [yes]
輸入 [password]
3.下載正齡的tadnn程式:
u9999999@vd6dcjctr9999999999999-xnpd4:~$ cd /work/u9999999/
u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999$ git clone https://gitlab.ical.tw/jamesljlster/tadnn.git
Username for 'https://gitlab.ical.tw': [user999]
Password for 'https://user999@gitlab.ical.tw': [password]
xxx 4.系統更新
xxx u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999$ sudo apt update
xxx u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999$ sudo apt upgrade
5.安裝 Miniconda3 Linux 64-bit
下載: u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999$ chmod +x Miniconda3-latest-Linux-x86_64.sh
u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999$ ./Miniconda3-latest-Linux-x86_64.sh
按下 [ENTER]
輸入 [yes]
按下 [ENTER]
輸入 [yes]
u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999$ source ~/.bashrc
(base) u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999$
6.安裝、設定tadnn相關套件
(base) u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999$ conda create -n tadnn python=3 -y
(base) u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999$ conda activate tadnn
(tadnn) u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999$ cd tadnn
(tadnn) u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999/tadnn$ pip install gpustat
(tadnn) u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999/tadnn$ conda install pytorch torchvision cudatoolkit=10.2 tqdm pandas opencv matplotlib -c pytorch -y
(tadnn) u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999/tadnn$ cd build
(tadnn) u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999/tadnn/build$ ./cmake_clean.sh
(tadnn) u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999/tadnn/build$ ./conda_build.sh
(tadnn) u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999/tadnn/build$ cd ~
(tadnn) u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999/tadnn/build$ cd ../pytorch/
7.下載STL-10 dataset資料集
(tadnn) u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999/tadnn/pytorch$ sudo apt install nano
(tadnn) u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999/tadnn/pytorch$ nano stl10.py (改 lines 11: download=True, 就可以下載STL10, 改 numWorkers=0, 不然會有multiprocessing錯誤)
(tadnn) u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999/tadnn/pytorch$ python stl10.py
Class Names: ('airplane', 'bird', 'car', 'cat', 'deer', 'dog', 'horse', 'monkey', 'ship', 'truck')
airplane ship truck monkey
8.執行驗證訓練程式(使用tmux)
(1).tmux指令:
# Detach
[Ctrl-b] d
# List
tmux ls
# Attach
tmux attach-session -t number
(2).於tmux session內反覆執行
------ 執行概念驗證
(tadnn) u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999/tadnn/pytorch$ python experi_stl10_baseline_train.py
>>> 按下 [Ctrl-b] d # [Detached (from session twcc)], 暫時離開 session, 這個 session 依然在背景執行
>>> 要連回之前離開的 session 需要指定參數...如下:
davis@LAPTOP-99999999:/mnt/c/Users/dvsse$ tmux attach-session -t twcc
>>> 查詢tmux清單:
davis@LAPTOP-99999999:/mnt/c/Users/dvsse$ tmux ls
twcc: 1 windows (created Tue Oct 20 10:28:56 2020)
------
留言