-
Type: Improvement
-
Resolution: Fixed
-
Priority: Major
-
Affects Version/s: master
-
Component/s: Planner: Cleanup Module
-
None
Hi Karan, I've been running a gigantic workflow with >250K jobs (I split it into 6 sub-workflows though because it took forever, >12 hours, for pegasus to plan the workflow).
So now the pegasus.file.cleanup.clusters.num=1 setup starts to finally get me because the cleanup won't start until all the jobs on the next level has finished. so workflows aborted due to out of storage for 3 times.
I would increase it to pegasus.file.cleanup.clusters.num=25 for future. However, i think a parameter that makes more sense is:
pegasus.file.cleanup.clusters.fraction
which is a value with the range (0,1]. it equals the number of cleanup jobs /the number of jobs on that level. So if it's 1, then one cleanup job for each computing job. If it's 0.2, then one cleanup for 5 computing jobs. and so on. You can set it default to 0.1 or something.
In this way, a level with lots of computing jobs would get the same rate of cleanup as a level with very few computing jobs. In some sense, this parameter controls the rate of cleanup.
The pegasus.file.cleanup.clusters.num parameter doesn't really achieve this. A level with lots of computing jobs would probably get cleanup a lot slower than a level with very few jobs.
What do u think? u can give the parameter whatever name u like.