Improving Performance of a Tape System:Log

From CSclasswiki
Jump to: navigation, search

( Back to Main Page )

Week of 1/30/17

  • Good first introduction meeting today (1/31/17)!
  • For next time, follow the steps in the section Understanding Deep Neural Nets below.
  • Install Eclipse on your laptop: go to this page, and download Eclipse Neon, 64 bits. Follow the directions from this installation page.
  • Download and install Tensorflow using information from this page
  • This should be plenty to do for this week. Installing software takes time and very often requires trying several times. Don't give up and try your best. If you get stuck, we'll attempt to work on it at our next meeting.
  • If you manage to install TensorFlow, you may want to go through the first Tutorial on this page: Machine Learning Recipe #1: Hello world!

Week of 2/6/17

  • Same as above, but install on Mac OS X! Please try to do so before Thursday. Update me with progress.

Week of 2/13/17

  • Another attempt to run TensorFlow: Docker
  • Follow directions on this tutorial.
  • Start reading the paper Analysis of the ECMWF Storage Landscape, by Grawinkel et al.
  • The video of the presentation can be seen here.
  • Traces of access logs can be available at this URL. Register and see about getting these traces on the Mac in FH343 (near window). And email will be sent, and directions to access the site with the traces. You need to use ftp to open the site.
  • Learn how to use ftp (file-transfer protocol) to get the README file on the repository of traces.
  • Figure out by looking at the README file, and by looking at the various python programs, how we could get a list of times at which various tapes are requested, either for READ or for WRITE operations.

Week of 2/20/17

  • Pogress report
    • Tensorflow currently working/installed on xgridmac.
    • Still some problem login in to hadoop0
    • New iMacs in FH343 ready for Tensorflow install (to be done later this week)
  • TO DO for next few weeks:
  1. Work on writing Python simulation that will start looking at tape requests and see how LRU replacement policy of tapes can be implemented.
  2. Study tape downloaded on xgridmac
  • Downloaded ECFS_Access_Traces.tar from ECFS center
  • Untared archive
  • Decompressed archive into file ecfs_access_2012.01-2014.05
  • To access the file, login to your account and enter the following commands:
  cd tensorflow
  cd ECFS_Access_Traces
  ls -l
The trace is huge, and contains 127,000,000 lines!
  • Created several smaller version of the archive files with 1000, 10,000, and 100,000 lines. This should make it easier to debug Python programs, by running them on smaller trace files.

 ls -l
total 34292640
-rw-r--r--  1 nancyxiao  staff     13707790 Feb 20 16:29 ecfs.05_100K.txt
-rw-r--r--  1 nancyxiao  staff      1355866 Feb 20 16:29 ecfs.05_10K.txt
-rw-r--r--  1 nancyxiao  staff       132263 Feb 20 16:29 ecfs.05_1K.txt
-rw-r--r--  1 nancyxiao  staff  17542625009 Jan  7  2015 ecfs_access_2012.01-2014.05
  • Here's an example program for reading a trace file (ecfs.05_1K.txt):
# D. Thiebaut
# a skeleton program for reading a trace file.
# Run the program as follows:
#      cd
#      cd tensorflow
#      python
from __future__ import print_function


def main():
    global TRACEFILE

    lineNo = 0
    with open( TRACEFILE ) as traceFile:
        for line in traceFile:
            # get all the fields from one line 
            fields = line.strip().split( '|' )
            #print( line )
            #print( len( fields ) )
            startTime = int(fields[0])
            userId    = int(fields[1])
            hostId    = int(fields[2])
            processId = int(fields[3])
            request   = fields[4]
            params    = fields[5]
            fileSize  = int(fields[6])
            execTime  = int(fields[7])
            # display startTime and request only
            print( "Line %3d: time: %10d  Request: %4s  Exec. Time: %10d"
                   % (lineNo, startTime, request, execTime ) )

            lineNo += 1


Week of 2/27/17

  • To do for next week:
  • generalize the simulator to work with traces of tape mounts which will be made available soon, and make the size of the tape drives a parameter that we can easily change. For example, we should be able to compute the number of misses for 12 drives or for 24 drives simply by changing one line of the code.
  • Study the programs downloaded from the ECMWF (transferred with the USB drive) and see how the programs categorize the information found on the traces. Can we learn anything about the format of the traces from the programs reading the files?

Week of 3/6/17

  • We decyphered a bit more of the robot trace. Enough to figure out that, at least for right now, we need to worry only about whether the operation is a MOUNT, what the tape # is. Right now we are not worried about DISMOUNT operations, although we can think of using this information in the future.
  • We created several simple programs to help in processing the traces. They are on xgridmac, but are also given here for completeness:

Example of Processing of Trace Fields

from __future__ import print_function

fileName = "robot100.log"
file = open( fileName, 'r' )
lines = file.readlines()

#format of each line read:
#hdre01#VLAD#20140501:005936#  DISMOUNT RC1021 Home 1,14,5,6,1 Drive 1,14,1,15 Client Host Id 328.681.634.674

tapesD = {}
for line in lines:
    line = line.replace( "  ", " " )
    line = line.replace( "  ", " " )
    line = line.replace( "  ", " " )
    line = line.replace( "  ", " " )
    print( line )
    fields = line.split(" ")
    md = fields[1]
    tapeId = fields[2]
    if tapeId not in tapesD.keys():
        tapesD[ tapeId ] = [line]
        print( "creating list for ", tapeId )
        tapesD[ tapeId ].append( line )
        print( "adding line for ", tapeId )

for tapeId in list( tapesD.keys() ):
    if len( tapesD[tapeId] ) >= 2:
        print( tapeId )
        for line in tapesD[tapeId]:
            print( line )

Getting the Command Line Arguments

from __future__ import print_function

import sys

def main():
    if len( sys.argv ) != 3:
        print( "Syntax: python noDrives traceFileName" )

    noDrives = int( sys.argv[1] )
    fileName = sys.argv[2] 

    print( "No drives:  ", noDrives )
    print( "trace file: ", fileName )


Script File for Running the Simulation

#! /bin/bash

date > simulation.txt
python 4 robot10K.log >> simulation.txt
python 8 robot10K.log >>simulation.txt
python 16 robot10K.log >>simulation.txt
python 32 robot10K.log >>simulation.txt
python 64 robot10K.log >>simulation.txt

  • Remember to make the file executable before running it:
chmod +x

  • Then run the file. It's output is captured in a file called simulation.txt

cat simulation.txtMon Mar  6 15:43:57 EST 2017
No drives:   4
trace file:  robot10K.log 
No drives:   8
trace file:  robot10K.log
No drives:   16
trace file:  robot10K.log
No drives:   32
trace file:  robot10K.log
No drives:   64
trace file:  robot10K.log

To Do for Next Time

  • Compute the miss ratio for LRU for all the long traces in the account on xgridmac
 56325682 Jan  9  2015 robot_mounts_log_2014-5.log
 73903685 Jan  9  2015 robot_mounts_log_2014-2.log
 79360186 Jan  9  2015 robot_mounts_log_2014-4.log
 74666574 Jan  9  2015 robot_mounts_log_2014-1.log
 78855538 Jan  9  2015 robot_mounts_log_2014-3.log

  • Generate an algorithm that will implement the Belady algorithm for replacing tapes in the drive. This algorithm is also referred to in the literature as Min, or Optimal. One way to think about it is to play with a simple case, and time stamp each Mount request:
time stamp         0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
tape requested     a  b  c  b  a  d  e  f  a  c  b  d  f  g  h  j  k  l  a  b  f  a  d

  • Assume 4 drives. When a, b, c, and d are in the tape drives, and e is requested, LRU would suggest removing c. However, we see that c will be needed again at Time 9, and b at Time 11, so b should be removed to make room for e.
  • Figure out how to code this in Python.

Week of 3/20/17

--Thiebaut 17:21, 20 March 2017 (EDT)

  • Several items on the to-do list for next week:
  • Clean-up the program, as well as the program. Improve documentation. No need to add a new section, just update the programs listed above.
  • Generate synthetic traces for that will test various features of the program and convince us of its correctness. Generating a synthetic trace of 100 lines is fine. Maybe 20 lines can be sufficient and easier to test by hand.
  • Make the program output the number of hits, number of misses, and total number of accesses as well as the miss-ratio. This will make it easier to test the correctness of the program.
  • Make the program print the result as well as store it in a log file.
  • Generate one or more sythetic traces to check the correctness of the Belady simulation program.
  • Generate a graph showing the miss-ratio as a function of number of requests for the # of drives equal to 2, 10, 20, 50, 75, and 100. Use eXcel or whatever package you are comfortable with.

Week of 3/27/17

--Thiebaut 15:45, 27 March 2017 (EDT)

  • Good graph of miss ratio curves for LRU, and good understanding of what is going on with the curves.
  • Please add "for LRU Replacement" at top of graph. Update a new graph on this Wiki page (if you click on the image, you'll see an option for updating the image with a new one.)
  • Fix the bug currently in the Belady code above. Make sure the output of the program in DEBUG mode is clear enough to understand everything.
  • Generate 5 new sets of miss-ratios (cold-start is fine) for the Belady algorithm and the 5 traces.
  • Actually, since all 5 traces represent accesses for the same year (2014), we should generate one long trace containing all 5 trances in just one long trace. Use this solution from to copy 5 files into one. This way you can simplify your miss-ratio graph since you'll just have 1 series of miss-ratios for LRU, and 1 series of miss-ratios for Belady.
  • Generate several new graphs that combines LRU and BELADY miss-ratios for the long trace, and for the 5 traces:
  • 1 graph showing the miss ratios for the Belady replacement algorithm for the original 5 traces
  • 1 graph showing the miss ratios for the LRU replacement algorithm for the original 5 traces
  • 1 graph showing the miss ratios for LRU & Belady for the 5 traces
  • 1 graph showing the miss ratios for LRU & Belady for the long trace combining all 5 together.

Week of 4/3/2017

Plan for the next 4 weeks:

  • Fix current bug in Belady/LRU comparison
  • Work out the simple examples presented in the video tutorials on TensorFlow at the bottom of this page. Get an understanding of:
  1. what TensorFlow placeholders are
  2. what TensorFlow variables are
  3. the idea of a graph of operations
  4. the importance of the sigmoid or tanh function in each neuron
  5. the training and the tuning taking place.
  6. where the gradient comes into play
  • Go through DT's tutorial. Implement it on your computer. Make it work.
  • Create the X and Y data for a TensorFlow Neural Net (NN) that will take windows of references from a trace, pass the window through a NN, and train the NN to output the Belady choice for a replacement. Figuring out how to code the data is tricky. It may take a while to figure out a good format.
  • Create a NN to process the data generated in the step before
  • Do the same thing as the last two steps, but for a Recursive Neural Net.

Week of 4/10/17

  • Great job on the graphs!
  • For next week:
  • go through tutorials on sklearn and tensorflow. BTW, Andy of AdventuresInMachineLearning sent me this Tensorflow tutorial, which looks quite good: check it out.
  • do simple examples that require some training and generate good predictions
  • go through my tutorials (see link above)
  • (For later) write a Python program that takes 2 parameters on the command line (tape name, and window size) and generate a new file with lines equal to 1-hot vector of all the TapeIds found in the sliding window, and the tapeId to replace as suggested by Belady.
  • Each line will contain two 1-hot vectors side by side: the first one represents the tape-Ids found in the window. The second 1-hot vector contains one 1 (or all 0s) corresponding to whatever the tape Id recommended by Belady is.

Week of 5/15/17

--Thiebaut 17:57, 17 May 2017 (EDT)

Step 1

This section is only visible to computers located at Smith College

Step 2

  • I have adapted the sinewave example to our LRU problem. It's not completed yet, and your job will be to organize better it and complete it.

Synthetic Traces

  • Instead of working on the original tape requests, I decided to start with a simpler smaller trace of requests. A good way to generate a synthetic collection of requests is to simulate the requests as the jumps of a random walker on the integer lattice. Imagine an array of size 100, where array[i] contains i. To generate requests with good locality properties, you start on Cell 50, for example, record 50 as the first request, then randomly jump up or down in the array, such that the probability of jumping a jump of size 1 is twice the probability of jumping a jump of size 2, which itself is twice the probability of jumping a jump of size 4, etc. The probability distribution of this jump distribution is hyperbolic, and the jumper is referred to as a fractal jumper.
  • The notebook called Generate Synthetic LRU Trace contains the code for generating a synthetic trace of 10,000 requests, which are just integers in [0,100) (100 not included). The function generateSyntheticTrace() is the one generating the trace. The two test sections at the end are for testing related functions, but are not used to generate the trace.
  • Study the Generate Synthetic LRU Trace and make sure you understand how it works!


  • I have created aJupyter Notebook that is adapted from the SineWave example, except that now the input is not just an number representing the amplitude of a sine wave, but is a one-hot vector representing a tape request whose Id is just an integer between 0 and 100 (not included). For example, if a request has Id 5, then we will represent it with a 1-hot vector of 99 0s and one 1, like this:
  0 0 0 0 0 1 0 0 0 0 0 0 ... 0

  • The Notebook is called RNN_LRU_Version0, and should be in your Jupyter Home window. Open it, and take a look at it. Observe the similarity with the SineWave notebook. Remember that the number of inputs will now be 100, the size of the 1-hot vectors. When we process the real trace of tape requests, the number of unique tape Ids will dictate the size of the 1-hot vectors, which will be large!
  • The notebook is not complete. It does not check the output of the RNN on the test Xs and ys.

Your Next Assignment

  1. Get familiar with Jupyter
  2. Get familiar with the various notebooks. Understand how they work.
  3. Work only with the synthetic trace
  4. See if you can display a graph of the variation of the MSE as a function of the number of iterations (epochs)
  5. Modify the RNN_LRU_Version0 notebook and see if you can add the following feature:
  • compute the mean squared error (MSE) on the test Xs and test ys. Make the notebook display this error at the end of the training.
  • make the training stop when the MSE is small enough. My last run generated the following MSE:

0 	MSE: 33.2641
10 	MSE: 25.0851
20 	MSE: 24.6219
30 	MSE: 24.2719
40 	MSE: 23.8548
50 	MSE: 23.311
60 	MSE: 22.7926
70 	MSE: 21.9363
80 	MSE: 21.663
90 	MSE: 20.9894
410 	MSE: 15.602
420 	MSE: 15.3204
430 	MSE: 15.2616
440 	MSE: 15.2208
450 	MSE: 15.7338
460 	MSE: 14.8918
470 	MSE: 15.327
1790 	MSE: 9.26779
1800 	MSE: 9.52875
1810 	MSE: 10.874
1820 	MSE: 9.70107
1830 	MSE: 9.25777
1840 	MSE: 9.24003
1850 	MSE: 8.95483
1860 	MSE: 8.96428
1870 	MSE: 9.90411
1880 	MSE: 9.33163
1890 	MSE: 9.07125
1900 	MSE: 8.59247
1910 	MSE: 8.53587
1920 	MSE: 8.70391
1930 	MSE: 9.61677
1940 	MSE: 25.4168
1950 	MSE: 24.3741
1960 	MSE: 24.1473
1970 	MSE: 24.0163
1980 	MSE: 23.8677
1990 	MSE: 23.7032
2000 	MSE: 23.5327
2010 	MSE: 23.3777
2020 	MSE: 23.1864
2030 	MSE: 22.9951
2040 	MSE: 22.1811
2050 	MSE: 21.2071
2060 	MSE: 20.6537
2070 	MSE: 20.2385
2080 	MSE: 19.2021
2090 	MSE: 18.5007
2100 	MSE: 17.5835
2110 	MSE: 17.2268
2120 	MSE: 16.858
2130 	MSE: 16.4983
2140 	MSE: 15.9282
2150 	MSE: 15.6081
2160 	MSE: 15.4832
2170 	MSE: 15.25
2180 	MSE: 14.9947
2190 	MSE: 14.8679
2200 	MSE: 14.1347
2210 	MSE: 13.4009
2220 	MSE: 12.655
2230 	MSE: 12.1597
2240 	MSE: 11.8985
2250 	MSE: 11.9872
2260 	MSE: 11.7702
2270 	MSE: 11.5086
6310 	MSE: 3.84869
6320 	MSE: 3.84042
6330 	MSE: 3.83337
6340 	MSE: 3.92093
6350 	MSE: 3.86177
6360 	MSE: 6.24231
6370 	MSE: 4.82714
6380 	MSE: 4.78008
6390 	MSE: 4.76264
6400 	MSE: 5.0721
6410 	MSE: 4.95971
6420 	MSE: 3.93358
6430 	MSE: 3.79946
6440 	MSE: 3.77603
6450 	MSE: 3.77525
6460 	MSE: 3.7576
7550 	MSE: 12.9113
7560 	MSE: 12.9065
7570 	MSE: 11.7822
7580 	MSE: 10.4574
7590 	MSE: 12.3207
7600 	MSE: 58.0444
7610 	MSE: 63.6633
7620 	MSE: 83.635
7630 	MSE: nan
7640 	MSE: nan
7650 	MSE: nan
7660 	MSE: nan

Note: NaN is a special symbol meaning Not-A-Number generated when a number is the result of an invalid operation, such as taking the log of a negative number. Once NaN is introduced somewhere in a computation, the result of the whole computation becomes NaN.