Tuesday, February 19, 2019

Saving yourself from grief over loss of computation

Original post: Original post link


You are running a long simulation of about 30 days and the power fails on the 29th day (or something equally tragic happens!). How would you react? Enter the classic case of 5 stages of grief: Denial (Nooooo!!!!), Anger (*&(#^&*(#@%&!!!!), Bargaining (It must’ve produced some output!), Depression (I’m never gonna graduate!) and Acceptance (Well, let’s restart the simulation!).
But before you restart your simulation, hear me out!
I have had this issue many times and had been thinking about how to avoid this. Here are some of the solutions I came across:
  1. UPS: An Un-interrupted Power Supply can potentially save you from this… But how can I stop my simulation midway if the power is not back up before the UPS dies out?
  2. NVDIMM: This is another hardware solution which uses Non-volatile RAM (NVDIMM) which copies your RAM in a non-volatile memory and then restores it when the power comes back online. It is a developing technology but can be very useful. Read more about this here. However, if you don’t have deep pockets, this is, at this time, a volatile option!
  3. Manual checkpointing: Checkpointing is the art of saving a snapshot of the state of your software or simulation onto a hard-disk and then restoring it when you need to! You can do this easily if you are writing your own code: save all important variables into a file every so many minutes and then restore them from the file. However, manual checkpointing can be a pain so I began looking for automatic checkpointing aka Checkpoint-Restore solutions! An ideal (albeit extreme) scenario would be to be able to save the whole Operating System state and restore it as if nothing happened. Below are a few solutions towards that.
  4. Hypervisors: If you can tolerate the performance loss and other issues associated with using a virtual machine (virtual-box, VMware, etc.), you can use the hypervisor’s “snapshot” mechanisms to store and then restore the state of your system. Read more about this here.
  5. Docker: Docker is essentially a native hypervisor, i.e., it does not have the overhead of a virtual machine because the processes do not run in a guest operating system but in the host OS from within docker. Docker is excellent — for installation, distribution and also for checkpointing. The idea here is to make a docker box of your simulation software and then use the underlying checkpoint and restore technology to restore the software after a failure. Read more about this here. However, this entails a significant additional overhead in making the docker container.
  6. CRIU: CRIU (pron. Kree-oo) stands for Checkpoint and Restore in User Space. With CRIU “you can freeze a running application (or part of it) and checkpoint it as a collection of files on disk. You can then use the files to restore the application and run it exactly as it was during the time of the freeze. With this feature, application live migration, snapshots, remote debugging, and many other things are possible.” Here is a complete list of usage scenarios. CRIU supports checkcpointing with docker as well. This is my second favorite tool for this purpose!
  7. DMTCP: This is my favorite solution — it allows you to create checkpoints and restore them in a very easy manner. It comes with Linux (or you can sudo apt-get install dmtcp). However, I would recommend the following process to install it: Download the DMTCP package in your Linux distribution as a .zip file and then extract it to a folder of your choice. After that go into the folder and execute “./configure” and “./make install” and that’s it! After installation you are ready to use it: For a demo, go to the contrib/python folder within your dmtcp folder with your terminal and execute “dmtcp_launch python hookexample.py” to start an example process. This will create a dmtcp checkpoint as well as a script that allows you to restore the process as and when required using ./dmtcp_restart_script.sh. This video also explains the process in more detail. These checkpoints also allow for great debugging by saving the state of your software.
8. Ruffus
Ruffus is a library for python that allows you to setup and execute a computational pipeline composed of multiple steps in which the output of one step becomes the input of the next. It also supports pretty easy checkpointing.
9. EdiblePickle
EdiblePickle also allows checpointing. Here is a simple example.
import string
import time
from ediblepickle import checkpoint

# A checkpointed expensive function
@checkpoint(key=string.Template('m{0}_n{1}_${iterations}_$stride.csv'), work_dir='/tmp/intermediate_results', refresh=True)
def expensive_computation(m, n, iterations=4, stride=1):
    for i in range(iterations):
        time.sleep(1)
    return range(m, n, stride)

# First call, evaluates the function and saves the results
begin = time.time()
expensive_computation(-100, 200, iterations=4, stride=2)
time_taken = time.time() - begin

print time_taken

# Second call, since the checkpoint exists, the result is loaded from that file and returned.
begin = time.time()
expensive_computation(-100, 200, iterations=4, stride=2)
time_taken = time.time() - begin

print time_taken

Follow it here: https://medium.com/@fayyazafsar/saving-yourself-from-grief-over-loss-of-computation-688819d85389