Difference between revisions of "Nov. 16 - Nov. 20"

From CSclasswiki
Jump to: navigation, search
Line 17: Line 17:
 
:::<code>network-snapshot-000000        time 31m 33s      fid50k 345.4768</code>
 
:::<code>network-snapshot-000000        time 31m 33s      fid50k 345.4768</code>
 
:::<code>tick 1    kimg 1.0      lod 0.00  minibatch 8    time 38m 24s      sec/tick 352.3  sec/kimg 355.13  maintenance 1910.8 gpumem 6.1</code>
 
:::<code>tick 1    kimg 1.0      lod 0.00  minibatch 8    time 38m 24s      sec/tick 352.3  sec/kimg 355.13  maintenance 1910.8 gpumem 6.1</code>
 +
 +
*: However, the training ended abnormally:
 +
:::<code>network-snapshot-000001        time 31m 35s      fid50k 346.2851</code>
 +
:::<code>dnnlib: Finished training.training_loop.training_loop() in 1h 11m 03s.</code>

Revision as of 10:49, 20 November 2020

back to Computer Art

1. Training

  • Could not directly change setup_snapshot_image_grid(training_set) as stated last week. It will raise error "could not broadcast inout array from shape (3,256,256) into shape (3,0,256)"
  • gw and gh must be assigned to 1 before initializing the data array in setup_snapshot_image_grid(training_set). Directly set gw and gh to be 1 as parameters into create_image_grid(images,grid_size=None) will cause a mismatch in the second dimension of arrays, as in setup_snapshot_image_grid(training_set), data arrays and layouts are built based on np.zeros([gw*gh]+training_set.shape). Here, for more choices, we set the grid_size to be (1,4), so that's the first 4 portraits in the first row of real samples selected:
Sample1-4.png
  • I had to change to the charged service this week and could only train with GPU: 1 * NVIDIA T4 / 1 * 16G, the speed is very low. For 1 tick, it took:
network-snapshot-000000 time 31m 36s fid50k 345.4734
tick 1 kimg 1.0 lod 0.00 minibatch 8 time 2h 18m 46s sec/tick 341.5 sec/kimg 344.30 maintenance 7943.1 gpumem 6.1
and raised error:MemoryError: Unable to allocate 0 bytes for an array with shape (1073741824, 0) and data type float32. Searched online, I :::found the problem might be solved by rewriting self._np_labels = np.zeros([1<<30, 0], dtype=np.float32) to np.zeros(1<<20,0) in :::./training/dataset.py. Notice that 1<<30 == 1073741824 == 1G so it's very memory-consuming.
  • Now the second round of training turns out to be:
network-snapshot-000000 time 31m 33s fid50k 345.4768
tick 1 kimg 1.0 lod 0.00 minibatch 8 time 38m 24s sec/tick 352.3 sec/kimg 355.13 maintenance 1910.8 gpumem 6.1
  • However, the training ended abnormally:
network-snapshot-000001 time 31m 35s fid50k 346.2851
dnnlib: Finished training.training_loop.training_loop() in 1h 11m 03s.