-
Type: Bug
-
Resolution: Fixed
-
Priority: Major
-
Affects Version/s: master, 4.4.0
-
Component/s: Pegasus Dashboard
-
None
Hi Karan,
I’ve noticed that sometimes the details about a job are missing from pegasus dashboard. For example, there are two failed jobs listed in
<http://sugar-dev3.phy.syr.edu/pegasus/%7Esamantha.usman/root/48/workflow/7>
but if I click on their links,
<http://sugar-dev3.phy.syr.edu/pegasus/%7Esamantha.usman/root/48/workflow/7/job/3498>
<http://sugar-dev3.phy.syr.edu/pegasus/%7Esamantha.usman/root/48/workflow/7/job/3476>
they show “Bad Request” rather than showing the job summary. The error appears to be coming from views.py in
@app.route(basepath + '/root/<root_wf_id>/workflow/<wf_id>/job/<job_id>', methods=['GET'])
with
job = dashboard.get_job_information(wf_id, job_id)
returning None.
If I try and do this manually, I get state:
[dbrown@sugar-dev3 ~]$ sqlite3 ~samantha.usman/.pegasus/workflow.db
sqlite> select * from workflow where wf_id = 48;
48|b7a46964-b111-48ab-a9a5-0ed4a88182ba|ihope_ssipe|3.4|/home/samantha.usman/bns/ahope-comparison-paper/single-stage-runs/s6/965779215-966384015/ihope_ssipe.dax|ihope_ssipe-0.dag|1416878286|sugar-dev3.phy.syr.edu|/usr1/samantha.usman/log/H1L1-ihope_ssipe-965779215-604800.YmWJRc/.||samantha.usman||4.4.0|sqlite:////usr1/samantha.usman/log/H1L1-ihope_ssipe-965779215-604800.YmWJRc/ihope_ssipe-0.stampede.db
[dbrown@sugar-dev3 ~]$ sqlite3 /usr1/samantha.usman/log/H1L1-ihope_ssipe-965779215-604800.YmWJRc/ihope_ssipe-0.stampede.db
sqlite> select * from workflow where wf_id = 7;
7|3a9bb230-5f1d-4318-81b6-8cd674e086be|inspiral_hipe_full_data.FULL_DATA-0.dag|1416884021|sugar-dev3.phy.syr.edu|/usr1/samantha.usman/log/H1L1-ihope_ssipe-965779215-604800.YmWJRc/./full_data/inspiral_hipe_full_data.FULL_DATA_ID000003.000|--conf /usr1/samantha.usman/log/H1L1-ihope_ssipe-965779215-604800.YmWJRc/./pegasus.4396090288984859026.properties --dir /usr1/samantha.usman/log/H1L1-ihope_ssipe-965779215-604800.YmWJRc --relative-dir ././full_data --relative-submit-dir ./full_data/inspiral_hipe_full_data.FULL_DATA_ID000003 --sites local --cache /usr1/samantha.usman/log/H1L1-ihope_ssipe-965779215-604800.YmWJRc/./datafind/inspiral_hipe_datafind_ID000001/inspiral_hipe_datafind-0.cache,/home/samantha.usman/bns/ahope-comparison-paper/single-stage-runs/s6/965779215-966384015/pegasus-pfn-cache-965779215-966384015.cache,/usr1/samantha.usman/log/H1L1-ihope_ssipe-965779215-604800.YmWJRc/./ihope_ssipe-0.cache --inherited-rc-files /usr1/samantha.usman/log/H1L1-ihope_ssipe-965779215-604800.YmWJRc/./ihope_ssipe-0.replica.store --cluster horizontal --output-dir full_data --output-site local --force --cleanup none --verbose --verbose --verbose --verbose --verbose --verbose --deferred --group pegasus --dax /home/samantha.usman/bns/ahope-comparison-paper/single-stage-runs/s6/965779215-966384015/full_data/inspiral_hipe_full_data.FULL_DATA.dax|samantha.usman||4.4.0|inspiral_hipe_full_data.FULL_DATA|3.4|/home/samantha.usman/bns/ahope-comparison-paper/single-stage-runs/s6/965779215-966384015/full_data/inspiral_hipe_full_data.FULL_DATA.dax|1|1
sqlite> select * from job where job_id = 3498;
3498|7|lalapps_inspiral_ID000079|lalapps_inspiral_ID000079.sub|compute|0|1|/usr/bin/../bin/pegasus-kickstart| -n ligo-inspiraljob::lalapps_inspiral:1.0 -N ID000079 -R local -L inspiral_hipe_full_data.FULL_DATA -T 2014-11-24T20:11:57-05:00 /home/samantha.usman/bns/ahope-comparison-paper/single-stage-runs/s6/965779215-966384015/full_data/../executables/lalapps_inspiral --do-rsq-veto --trig-end-time 0 --cluster-method template --dynamic-range-exponent 69.0 --autochisq-stride 2 --bank-file H1-TMPLTBANK-965942699-2048.xml.gz --high-pass-order 8 --strain-high-pass-order 8 --ifo-tag FIRST --user-tag FULL_DATA --gps-end-time 965944747 --calibrated-data real_8 --channel-name H1:LDAS-STRAIN --snr-threshold 5.5 --enable-rsq-veto --number-of-segments 15 --trig-start-time 965944355 --enable-high-pass 30.0 --gps-start-time 965942699 --enable-filter-inj-only --autochisq-two-sided --high-pass-attenuation 0.1 --chisq-bins 16 --inverse-spec-length 16 --rsq-veto-threshold 15.0 --segment-length 1048576 --low-frequency-cutoff 40.0 --pad-data 8 --maximization-interval 30 --sample-rate 4096 --chisq-threshold 10.0 --rsq-veto-max-snr 12.0 --resample-filter ldas --strain-high-pass-atten 0.1 --strain-high-pass-freq 30 --bank-veto-time-freq --segment-overlap 524288 --frame-cache H-H1_LDAS_C02_L2_CACHE-965923163-21592.lcf --chisq-delta 0.2 --bank-veto-subbank-size 20 --approximant FindChirpSP --rsq-veto-time-thresh 0.0002 --write-compress --autochisq-length 100 --enable-output --rsq-veto-window 6.0 --order threePointFivePN --spectrum-type median|1
sqlite> select * from job_instance where job_id = 3498;
3978|3498||175|17833350.0|local|samantha.usman|/usr1/samantha.usman/log/H1L1-ihope_ssipe-965779215-604800.YmWJRc/full_data/inspiral_hipe_full_data.FULL_DATA_ID000003.000|||||lalapps_inspiral_ID000079.out||lalapps_inspiral_ID000079.err|||1|256
sqlite> select * from jobstate where job_instance_id = 256 order by timestamp desc;
256|POST_SCRIPT_SUCCESS|1416880087|7
256|POST_SCRIPT_TERMINATED|1416880087|6
256|JOB_SUCCESS|1416880082|4
256|JOB_TERMINATED|1416880082|3
256|POST_SCRIPT_STARTED|1416880082|5
256|EXECUTE|1416878446|2
256|SUBMIT|1416878413|1
Do you know what is going on here? According to the database, the job completed, so why is it showing up as failed? Why does the dashboard give bad request?
Cheers,
Duncan.
–