使用國網中心開發型容器操作記錄

動機:因研究計畫需求,常常需要操作國網中心的高速運算環境,特留下記錄以免忘記及犯錯?!


準備環境

1.Windows 10筆電

2.WSL2

3.Chrome瀏覽器

4.網路


步驟

1.先登入 國網中心 iService (帳/密: user999@mail.nuk.edu.tw / password)

(1).建立一個 [開發型容器]:

按下 [+建立], 按下 [Custom Image], 拉下 "映像檔": cuda-10.1-cudnn7-devel-ubuntu18.04:tadnn999999, 基本設定: 點選 [cm.xsuper, GPUx2, CPUx8 記憶體容量:120GB, 共享記憶體:60GB], 按下 [下一步: 儲存資訊], 按下 [下一步: 檢閱+建立], 按下 [建立]

等候容器Initializing...建立...約60秒

(2).完成

>>> 開發型容器名稱: ctr9999999999999, ssh u9999999@203.145.216.149 -p 99999  (按SSH右側的正方形圖示, 複製網址)

>>> 查看 https://www.twcc.ai/user/container/detail/9999999

2.使用 WSL2執行程式(使用tmux):

開啟 Windows Terminal / Ubuntu-20.04

(1).安裝tmux:

davis@LAPTOP-99999999:/mnt/c/Users/dvsse$ sudo apt install tmux

(2).使用tmux:

輸入tmux指令: 

# New a Session

davis@LAPTOP-99999999:/mnt/c/Users/dvsse$ tmux new -s twcc

davis@LAPTOP-99999999:/mnt/c/Users/dvsse$ ssh u9999999@203.145.216.149 -p 99999

輸入 [yes]

輸入 [password]

3.下載正齡的tadnn程式:

u9999999@vd6dcjctr9999999999999-xnpd4:~$ cd /work/u9999999/

u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999$ git clone https://gitlab.ical.tw/jamesljlster/tadnn.git

Username for 'https://gitlab.ical.tw': [user999]

Password for 'https://user999@gitlab.ical.tw': [password]

xxx 4.系統更新

xxx u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999$ sudo apt update

xxx u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999$ sudo apt upgrade

5.安裝 Miniconda3 Linux 64-bit

下載: u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999$ chmod +x Miniconda3-latest-Linux-x86_64.sh 

u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999$ ./Miniconda3-latest-Linux-x86_64.sh

按下 [ENTER]

輸入 [yes]

按下 [ENTER]

輸入 [yes]

u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999$ source ~/.bashrc

(base) u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999$

6.安裝、設定tadnn相關套件

(base) u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999$ conda create -n tadnn python=3 -y

(base) u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999$ conda activate tadnn

(tadnn) u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999$ cd tadnn

(tadnn) u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999/tadnn$ pip install gpustat

(tadnn) u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999/tadnn$ conda install pytorch torchvision cudatoolkit=10.2 tqdm pandas opencv matplotlib -c pytorch -y

(tadnn) u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999/tadnn$ cd build

(tadnn) u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999/tadnn/build$ ./cmake_clean.sh

(tadnn) u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999/tadnn/build$ ./conda_build.sh

(tadnn) u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999/tadnn/build$ cd ~

(tadnn) u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999/tadnn/build$ cd ../pytorch/

7.下載STL-10 dataset資料集

(tadnn) u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999/tadnn/pytorch$ sudo apt install nano

(tadnn) u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999/tadnn/pytorch$ nano stl10.py   (改 lines 11: download=True, 就可以下載STL10, 改 numWorkers=0, 不然會有multiprocessing錯誤)

(tadnn) u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999/tadnn/pytorch$ python stl10.py

Class Names: ('airplane', 'bird', 'car', 'cat', 'deer', 'dog', 'horse', 'monkey', 'ship', 'truck')

airplane  ship truck monkey

8.執行驗證訓練程式(使用tmux)

(1).tmux指令:

# Detach

[Ctrl-b] d

# List

tmux ls

# Attach

tmux attach-session -t number

(2).於tmux session內反覆執行

------ 執行概念驗證

(tadnn) u9999999@vd6dcjctr9999999999999-xnpd4:/work/u9999999/tadnn/pytorch$ python experi_stl10_baseline_train.py

>>> 按下 [Ctrl-b] d    # [Detached (from session twcc)], 暫時離開 session, 這個 session 依然在背景執行

>>> 要連回之前離開的 session 需要指定參數...如下:

davis@LAPTOP-99999999:/mnt/c/Users/dvsse$ tmux attach-session -t twcc

>>> 查詢tmux清單:

davis@LAPTOP-99999999:/mnt/c/Users/dvsse$ tmux ls

twcc: 1 windows (created Tue Oct 20 10:28:56 2020)

------


心得:需定期查看 [剩餘額度],否則超出$$錢錢就難處理了!!!

留言