Uploaded image for project: 'Pegasus'
  1. Pegasus
  2. PM-689

monitord recovery mode for sqlite


    • Type: Icon: Improvement Improvement
    • Resolution: Fixed
    • Priority: Icon: Major Major
    • master, 4.2.1, 4.3
    • Affects Version/s: master
    • Component/s: Monitord
    • None

      If a monitord gets killed on a currently running workflow, then it restarts from the start.

      Monitord logs a monitord.recover file that contains the line number of the dagman out file parsed. However, the information in it is not sufficient to do any recovery.

      The current code logic is as follows:

      When monitord starts in a recover mode, it initiates the expunge operation on the DB to remove the existing content for that root workflow.

      The previous_processed_line from monitord.recover is parsed but not really used anywhere, other than the if loop to check set the last_processed_line to zero. This ensures, that all the events are populated again.
      Also the jobstate.log file gets rotated.

      However, it is important to keep in mind, that when the rescue dag is submitted then it starts from where it left off. In that case, there is a monitord.info file that is written out ( at end of monitord ) which is read, and allows monitord to start from where it left off as that has the last line of the dagman.out file parsed and other state information.

      Proposed Path Forward
      For the SQLlite DB backend, in case of recover mode we should just back up the existing database. There is one root workflow per sqlite db. There is no point trying to expunge existing entries.

            vahi Karan Vahi
            vahi Karan Vahi
            3 Start watching this issue