Improving Performance of a Tape System with Tensorflow
--Thiebaut 13:20, 23 January 2017 (EST)
Main Research Question
In the context of Super-Computing (SC) some large data sets are stored on tape. These tapes are kept in large shelving systems with robotic arms taking the tape from a shelf to a tape reader, or back from the reader to the shelf. When some High-Performance Computing (HPC) application requests a tape, the computer that controls the tape repository must locate the tape in question, and figure out which of the various tape-readers it must put it in. Typically ten to twenty tape readers will serve the whole collection. Figuring out which tape reader to empty out is a challenging problem, as one does not want to remove a tape that will be accessed in the near future. Typically one should pick the tape that is, either finished being accessed (read or write), or the tape that is going to be accessed the farthest in time. Several algorithms have been proposed for figuring out which tape to evict, with Least Recently Used (LRU) being one of the algorithms often implemented, or used in a comparison of efficiency.
An interesting question one we could explore is: could Machine Learning (ML) be used to pick the tape that will slow down the access to data the least? If we use the series of requests made to the tape system as a function of time, we have a time series. Every time a new request appears in time, we have a new data point, and for each data point, the tape replacement algorithm must pick one of the tape readers as the reader that will serve the new request. If the reader is empty, great! If it is not empty, then the tape that is currently in the reader must be removed and brought back to its place.
A robotic arm accessing a tape library can be seen in the video, below:
- Learn how to use TensorFlow, which is Google's deep learning platform released to the public for experimentation. Several good tutorials and references are listed below.
- Create a tutorial for Smith students interested in learning how to use TensorFlow.
- Read the following paper titled Analysis of the ECMWF Storage Landscape, by Grawinkel, Masker, Padua, Brinkmann of the University of Johannes Gutenberg University, in Mainz, Germany.
- Obtain a trace of tape accesses from one of the references listed in the paper. Then simulate the algorithm that is used by the tape controller for deciding which tape to remove when bringing a new one in. The tape controller must speculate on which tape is the safest to remove so as to keep performance high. One of the most used algorithm is called Least Recently Used, and the first goal is to implement it and see what average response time it gives to the tape system.
- And now, the interesting research question: can Machine Learning (deep neural networks implemented with Tensorflow) help the tape controller "learn" from the time series of accesses to tapes and predict what tape to replace when a new one is brought in and get better performance than LRU?