Bug #373

EC: doesn't terminate experiment when pressing CTRL-C

Added by Christoph Dwertmann over 1 year ago. Updated 12 months ago.

Status:Closed Start date:01/09/2010
Priority:Normal Due date:
Assignee:Thierry Rakotoarivelo % Done:

100%

Category:OMF Software
Target version:OMF 5.3.1

Description

root@norbit:/home/thierry/experiments/link-monitor# omf-5.3 load -i zzaidi_etx_hello5_omf5_3.ndz -t omf.nicta.node28,omf.nicta.node29
DEBUG: Executing '/usr/bin/ruby -I/usr/share/omf-expctl-5.3/ -I/usr/share/omf-common-5.3 /usr/share/omf-expctl-5.3/omf-expctl.rb -r -S pxe_slice -s system:exp:imageNode -- --nodes omf.nicta.node28,omf.nicta.node29 --image zzaidi_etx_hello5_omf5_3.ndz --timeout 800'

 INFO NodeHandler: OMF Experiment Controller 5.3.629
 INFO NodeHandler: Slice ID: pxe_slice 
 INFO NodeHandler: Experiment ID: pxe_slice-2010-09-01t13.19.04+10.00
 INFO NodeHandler: Message authentication is disabled
 INFO NodeHandler: Web interface available at: http://10.0.0.200:4000
 INFO Experiment: load system:exp:stdlib
 INFO property.resetDelay: value = 210 (Fixnum)
 INFO property.resetTries: value = 1 (Fixnum)
 INFO Experiment: load system:exp:eventlib
 INFO Experiment: load system:exp:imageNode
 INFO property.nodes: value = "omf.nicta.node28,omf.nicta.node29" (String)
 INFO property.image: value = "zzaidi_etx_hello5_omf5_3.ndz" (String)
 INFO property.domain: value = nil (NilClass)
 INFO property.outpath: value = "/tmp" (String)
 INFO property.timeout: value = 800 (Fixnum)
 INFO Topology: Loading topology 'omf.nicta.node28,omf.nicta.node29'.
 INFO Reset flag is set - Resetting the resources: 
^C INFO INTERRUPT: Event triggered. Starting the associated tasks.
 INFO INTERRUPT: 
User-requested Interruption... Exiting Now!

FATAL service_call: ------------
FATAL service_call:   A fatal error was encountered while making a request to an AM Service.
FATAL service_call:   Request: 'http://norbit.npc.nicta.com.au:5053/cmc/reset?name=omf.nicta.node29&domain=norbit'
FATAL service_call:   Exception: 'SystemExit'
FATAL service_call: ------------
^C INFO stdlib: Waiting for nodes (Up/Down/Total): 0/2/2 - (still down: omf.nicta.node29,omf.nicta.node28)
^C INFO stdlib: Waiting for nodes (Up/Down/Total): 0/2/2 - (still down: omf.nicta.node29,omf.nicta.node28)
^C INFO stdlib: Waiting for nodes (Up/Down/Total): 0/2/2 - (still down: omf.nicta.node29,omf.nicta.node28)
^C INFO stdlib: Waiting for nodes (Up/Down/Total): 0/2/2 - (still down: omf.nicta.node29,omf.nicta.node28)
^C^C^C^C^C^C^C^C^C^C^C^C^C
^C^C^C^C^C^C^C^C^C
 INFO stdlib: Waiting for nodes (Up/Down/Total): 0/2/2 - (still down: omf.nicta.node29,omf.nicta.node28)
 INFO stdlib: Waiting for nodes (Up/Down/Total): 0/2/2 - (still down: omf.nicta.node29,omf.nicta.node28)
 INFO ALL_UP: Event triggered. Starting the associated tasks.
 INFO exp: Progress(0/0/2): 0/0/0 min(omf.nicta.node29)/avg/max (60) - Timeout: 734 sec.
^C^C^C INFO exp: Progress(0/0/2): 20/20/20 min(omf.nicta.node29)/avg/max (60) - Timeout: 724 sec.
 INFO exp: Progress(0/0/2): 40/40/40 min(omf.nicta.node29)/avg/max (60) - Timeout: 714 sec.
 INFO exp: Progress(0/0/2): 60/60/60 min(omf.nicta.node29)/avg/max (60) - Timeout: 704 sec.
 INFO exp: Progress(0/0/2): 80/80/80 min(omf.nicta.node29)/avg/max (60) - Timeout: 694 sec.
 INFO exp: Progress(2/0/2): 100/100/100 min()/avg/max (60) - Timeout: 684 sec.
 INFO exp:  ----------------------------- 
 INFO exp:  Imaging Process Done 
 INFO exp:  - 2 node(s) successfully imaged - See the topology file: '/tmp/pxe_slice-2010-09-01t13.19.04+10.00_topo_active.rb'
 INFO exp:  ----------------------------- 
 INFO EXPERIMENT_DONE: Event triggered. Starting the associated tasks.
 INFO NodeHandler: 
 INFO NodeHandler: Shutting down experiment, please wait...
 INFO NodeHandler: 
 INFO NodeHandler: Shutdown flag is set - Turning Off the resources
 INFO run: Experiment pxe_slice-2010-09-01t13.19.04+10.00 finished after 2:16 exp: Progress(0/0/2): 40/40/40 min(omf.nicta.node29)/avg/max (60) - Timeout: 714 sec.

Even though I pressed CTRL-C several times, and my request was acknowledged (User-requested Interruption... Exiting Now!), the load proceeds until the imaging is complete.

History

Updated by Thierry Rakotoarivelo over 1 year ago

  • Assignee set to Thierry Rakotoarivelo
  • Target version set to OMF 5.3.1

Updated by Thierry Rakotoarivelo over 1 year ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

Fixed in 79d34a9c
In stdlib.rb, stop waiting for nodes to come up if the experiment is interrupted (i.e. the user pressed CTRL-C before the nodes came up)

Updated by Christoph Dwertmann over 1 year ago

  • Status changed from Resolved to Assigned

A follow-up to this bug, actually a different error message, but also generated when pressing CTRL-C during load. Again, the EC doesn't terminate but stays in a loop where it prints the same error over and over again:

cdw@vm1:~$ omf load -t "[omf.nicta.node1,omf.nicta.node2]" 
[omf.nicta.node1,omf.nicta.node2]
DEBUG: Executing '/usr/bin/ruby -I/usr/share/omf-expctl-5.3/ -I/usr/share/omf-common-5.3 /usr/share/omf-expctl-5.3/omf-expctl.rb -r -S pxe_slice -s system:exp:imageNode -- --nodes '[omf.nicta.node1,omf.nicta.node2]' --image 'baseline.ndz' --timeout '800''

 INFO NodeHandler: OMF Experiment Controller 5.3.testing
 INFO NodeHandler: Slice ID: pxe_slice 
 INFO NodeHandler: Experiment ID: pxe_slice-2010-09-02t10.25.30+00.00
 INFO NodeHandler: Message authentication is disabled
ERROR web: Cannot find any of the resource directories '../../omf-common/share/htdocs::/usr/share/omf-common-5.3/share/htdocs'
ERROR web: Cannot find any of the resource directories '../../omf-common/share/htdocs::/usr/share/omf-common-5.3/share/htdocs'
 INFO NodeHandler: Web interface available at: http://10.0.0.225:4000
 INFO Experiment: load system:exp:stdlib
 INFO property.resetDelay: value = 210 (Fixnum)
 INFO property.resetTries: value = 1 (Fixnum)
 INFO Experiment: load system:exp:eventlib
 INFO Experiment: load system:exp:imageNode
 INFO property.nodes: value = "[omf.nicta.node1,omf.nicta.node2]" (String)
 INFO property.image: value = "baseline.ndz" (String)
 INFO property.domain: value = nil (NilClass)
 INFO property.outpath: value = "/tmp" (String)
 INFO property.timeout: value = 800 (Fixnum)
 INFO Topology: Loading topology 'omf.nicta.node1,omf.nicta.node2'.
 INFO Reset flag is set - Resetting the resources: 
 INFO stdlib: Waiting for nodes (Up/Down/Total): 0/2/2 - (still down: omf.nicta.node2,omf.nicta.node1)
^C INFO INTERRUPT: Event triggered. Starting the associated tasks.
 INFO INTERRUPT: 
User-requested Interruption... Exiting Now!

^CERROR everyNS: Exception: return can't jump across threads (ThreadError)
    system:exp:stdlib:54
    /usr/share/omf-expctl-5.3/omf-expctl/node/nodeSetPath.rb:153:in `call'
    /usr/share/omf-expctl-5.3/omf-expctl/node/nodeSetPath.rb:153:in `call'
    /usr/share/omf-expctl-5.3/omf-expctl/handlerCommands.rb:495:in `everyNS'
    /usr/share/omf-expctl-5.3/omf-expctl/handlerCommands.rb:490:in `initialize'
    /usr/share/omf-expctl-5.3/omf-expctl/handlerCommands.rb:490:in `new'
    /usr/share/omf-expctl-5.3/omf-expctl/handlerCommands.rb:490:in `everyNS'
    system:exp:stdlib:50:in `_binding'
    /usr/share/omf-expctl-5.3/omf-expctl/handlerCommands.rb:683:in `_binding'
    /usr/share/omf-expctl-5.3/omf-expctl/oconfig.rb:160:in `load'
    /usr/share/omf-expctl-5.3/omf-expctl/experiment.rb:102:in `load'
    /usr/share/omf-expctl-5.3/omf-expctl/nodeHandler.rb:357:in `run'
    /usr/share/omf-expctl-5.3/omf-expctl/nodeHandler.rb:356:in `each'
    /usr/share/omf-expctl-5.3/omf-expctl/nodeHandler.rb:356:in `run'
    /usr/share/omf-expctl-5.3/omf-expctl.rb:35

^CERROR everyNS: Exception: return can't jump across threads (ThreadError)
    system:exp:stdlib:54
    /usr/share/omf-expctl-5.3/omf-expctl/node/nodeSetPath.rb:153:in `call'
    /usr/share/omf-expctl-5.3/omf-expctl/node/nodeSetPath.rb:153:in `call'
    /usr/share/omf-expctl-5.3/omf-expctl/handlerCommands.rb:495:in `everyNS'
    /usr/share/omf-expctl-5.3/omf-expctl/handlerCommands.rb:490:in `initialize'
    /usr/share/omf-expctl-5.3/omf-expctl/handlerCommands.rb:490:in `new'
    /usr/share/omf-expctl-5.3/omf-expctl/handlerCommands.rb:490:in `everyNS'
    system:exp:stdlib:50:in `_binding'
    /usr/share/omf-expctl-5.3/omf-expctl/handlerCommands.rb:683:in `_binding'
    /usr/share/omf-expctl-5.3/omf-expctl/oconfig.rb:160:in `load'
    /usr/share/omf-expctl-5.3/omf-expctl/experiment.rb:102:in `load'
    /usr/share/omf-expctl-5.3/omf-expctl/nodeHandler.rb:357:in `run'
    /usr/share/omf-expctl-5.3/omf-expctl/nodeHandler.rb:356:in `each'
    /usr/share/omf-expctl-5.3/omf-expctl/nodeHandler.rb:356:in `run'
    /usr/share/omf-expctl-5.3/omf-expctl.rb:35
ERROR everyNS: Exception: return can't jump across threads (ThreadError)
    system:exp:stdlib:54
    /usr/share/omf-expctl-5.3/omf-expctl/node/nodeSetPath.rb:153:in `call'
    /usr/share/omf-expctl-5.3/omf-expctl/node/nodeSetPath.rb:153:in `call'
    /usr/share/omf-expctl-5.3/omf-expctl/handlerCommands.rb:495:in `everyNS'
    /usr/share/omf-expctl-5.3/omf-expctl/handlerCommands.rb:490:in `initialize'
    /usr/share/omf-expctl-5.3/omf-expctl/handlerCommands.rb:490:in `new'
    /usr/share/omf-expctl-5.3/omf-expctl/handlerCommands.rb:490:in `everyNS'
    system:exp:stdlib:50:in `_binding'
    /usr/share/omf-expctl-5.3/omf-expctl/handlerCommands.rb:683:in `_binding'
    /usr/share/omf-expctl-5.3/omf-expctl/oconfig.rb:160:in `load'
    /usr/share/omf-expctl-5.3/omf-expctl/experiment.rb:102:in `load'
    /usr/share/omf-expctl-5.3/omf-expctl/nodeHandler.rb:357:in `run'
    /usr/share/omf-expctl-5.3/omf-expctl/nodeHandler.rb:356:in `each'
    /usr/share/omf-expctl-5.3/omf-expctl/nodeHandler.rb:356:in `run'
    /usr/share/omf-expctl-5.3/omf-expctl.rb:35

Updated by Thierry Rakotoarivelo over 1 year ago

Fixed in 08b750dd
- In stdlib.rb, returning a value from the everyNS block sometime raised an
Exception, as this block was executed in another Thread.

Updated by Thierry Rakotoarivelo over 1 year ago

  • Status changed from Assigned to Resolved

Updated by Thierry Rakotoarivelo over 1 year ago

  • Status changed from Resolved to Closed

Updated by Christoph Dwertmann about 1 year ago

  • Status changed from Closed to Assigned

Problem still exists in 290080b8:

root@vm8:/home/ubuntu# omf exec test:exp:test01.rb -- --res1 omf.nicta.node1

 INFO NodeHandler: OMF Experiment Controller 5.3.705
 INFO NodeHandler: Slice ID: default_slice (default)
 INFO NodeHandler: Experiment ID: default_slice-2010-11-11t03.58.05+00.00
 INFO NodeHandler: Message authentication is disabled
 INFO NodeHandler: Web interface available at: http://10.0.0.200:4000
 INFO Experiment: load system:exp:stdlib
 INFO property.resetDelay: value = 210 (Fixnum)
 INFO property.resetTries: value = 1 (Fixnum)
 INFO Experiment: load system:exp:eventlib
 INFO Experiment: load test:exp:test01.rb
 INFO property.res1: value = "omf.nicta.node1" (String)
 INFO property.res2: value = "unconfigured-node-2" (String)
 INFO Topology: Loading topology 'omf.nicta.node1'.
^C INFO INTERRUPT: Event triggered. Starting the associated tasks.
 INFO INTERRUPT: 
User-requested Interruption... Exiting Now!

^[[A^[[B INFO EXPERIMENT_DONE: Event triggered. Starting the associated tasks.
^[[A INFO stdlib: Waiting for nodes (Up/Down/Total): 0/1/1 - (still down: omf.nicta.node1)

^C
 INFO stdlib: Waiting for nodes (Up/Down/Total): 0/1/1 - (still down: omf.nicta.node1)
^C

Waiting for 5 minutes now, still hasn't exited and doesn't respond to further CTRL-C presses. Here's the EC log:

2010-11-11 03:58:11 DEBUG nodeHandler::OMFPubSubTransport: Cannot extract message from PubSub event '<service-response><response-to>default_slice-2010-11-11t03.58.05+00.00</response-to><message-id>3</message-id><timestamp>1289447890</timestamp><status/></service-response>'
2010-11-11 03:58:11 DEBUG nodeHandler::OMFPubSubTransport: Error: 'Unrecognized message body <service-response><response-to>default_slice-2010-11-11t03.58.05+00.00</response-to><message-id>3</message-id><timestamp>1289447890</timestamp><status/></service-response>'
Event received on '/OMF/system')
2010-11-11 03:58:11 INFO nodeHandler::event::INTERRUPT: Event triggered. Starting the associated tasks.
2010-11-11 03:58:11 INFO nodeHandler::INTERRUPT: 
User-requested Interruption... Exiting Now!

2010-11-11 03:58:13 INFO nodeHandler::event::EXPERIMENT_DONE: Event triggered. Starting the associated tasks.
2010-11-11 03:58:14 DEBUG nodeHandler::set::_ALLGROUPS_: Running each for Actor
2010-11-11 03:58:16 DEBUG nodeHandler::CMC: Can't switch ON node 'omf.nicta.node1'
2010-11-11 03:58:16 DEBUG nodeHandler::node::omf.nicta.node1: Configure path 'expconfigure' with value '1' - status: 'default_slice-2010-11-11t03.58.05+00.00'
2010-11-11 03:58:16 DEBUG nodeHandler::OmfXMPPServices: Could not subscribe to unknown node '/OMF/default_slice/default_slice-2010-11-11t03.58.05+00.00/omf.nicta.node1' on domain 'norbit.npc.nicta.com.au'
2010-11-11 03:58:16 DEBUG nodeHandler::OMFPubSubTransport: Creating new node '/OMF/default_slice/default_slice-2010-11-11t03.58.05+00.00/omf.nicta.node1'
2010-11-11 03:58:17 DEBUG nodeHandler::OMFPubSubTransport: Received on '/OMF/default_slice/default_slice-2010-11-11t03.58.05+00.00/omf.nicta.node1' - msg: '<APP_EVENT id='omf-payload'><MESSAGE>status: 9</MESSAGE><TARGET>omf.nicta.node11</TARGET><EXPID>testing_slice-2010-09-16t08.46.33+10.00</EXPID><VALUE>DONE.ERROR</VALUE><SLICEID>testing_slice</SLICEID><APPID>freezedccp_qoe_recv_app_2</APPID></APP_EVENT>'
2010-11-11 03:58:17 DEBUG nodeHandler::OMFPubSubTransport: Listening on '/OMF/default_slice/default_slice-2010-11-11t03.58.05+00.00/omf.nicta.node1' at 'norbit.npc.nicta.com.au'
2010-11-11 03:58:17 DEBUG nodeHandler::ECCommunicator: Ignoring message with unknown slice and exp IDs: 'testing_slice' and 'testing_slice-2010-09-16t08.46.33+10.00'
2010-11-11 03:58:17 DEBUG nodeHandler::OMFPubSubTransport: Listening on '/OMF/default_slice/default_slice-2010-11-11t03.58.05+00.00/omf.nicta.node1' at 'norbit.npc.nicta.com.au'
2010-11-11 03:58:17 DEBUG nodeHandler::OMFPubSubTransport: Received on '/OMF/default_slice/resources/omf.nicta.node1' - msg: '<ENROLL id='omf-payload'><SLICEID>default_slice</SLICEID><IMAGE>*</IMAGE><TARGET>omf.nicta.node1</TARGET><INDEX>10</INDEX><EXPID>default_slice-2010-09-29t14.33.17+10.00</EXPID></ENROLL>'
2010-11-11 03:58:17 DEBUG nodeHandler::OMFPubSubTransport: Listening on '/OMF/default_slice/resources/omf.nicta.node1' at 'norbit.npc.nicta.com.au'
2010-11-11 03:58:17 DEBUG nodeHandler::OMFPubSubTransport: Send to '/OMF/default_slice/resources/omf.nicta.node1' - msg: '<omf-message><CONFIGURE id='omf-payload'><TARGET>omf.nicta.node1</TARGET><IMAGE>*</IMAGE><INDEX>1</INDEX><EXPID>default_slice-2010-11-11t03.58.05+00.00</EXPID><PATH>exp/configure</PATH><SLICEID>default_slice</SLICEID></CONFIGURE></omf-message>'
2010-11-11 03:58:17 DEBUG nodeHandler::OMFPubSubTransport: Received on '/OMF/default_slice/resources/omf.nicta.node1' - msg: '<CONFIGURE id='omf-payload'><TARGET>omf.nicta.node1</TARGET><IMAGE>*</IMAGE><INDEX>1</INDEX><EXPID>default_slice-2010-11-11t03.58.05+00.00</EXPID><PATH>exp/configure</PATH><SLICEID>default_slice</SLICEID></CONFIGURE>'
2010-11-11 03:58:18 DEBUG nodeHandler::every: *: fires
2010-11-11 03:58:18 DEBUG nodeHandler::set::_ALLGROUPS_: Running each for Actor
2010-11-11 03:58:18 INFO nodeHandler::stdlib: Waiting for nodes (Up/Down/Total): 0/1/1 - (still down: omf.nicta.node1)
2010-11-11 03:58:19 DEBUG nodeHandler::set::_ALLGROUPS_: Running each for Actor
2010-11-11 03:58:24 DEBUG nodeHandler::set::_ALLGROUPS_: Running each for Actor
2010-11-11 03:58:28 DEBUG nodeHandler::every: *: fires
2010-11-11 03:58:28 DEBUG nodeHandler::set::_ALLGROUPS_: Running each for Actor
2010-11-11 03:58:28 INFO nodeHandler::stdlib: Waiting for nodes (Up/Down/Total): 0/1/1 - (still down: omf.nicta.node1)
2010-11-11 03:58:28 DEBUG nodeHandler::every: *: finishes
2010-11-11 03:58:29 DEBUG nodeHandler::set::_ALLGROUPS_: Running each for Actor
2010-11-11 03:58:34 DEBUG nodeHandler::set::_ALLGROUPS_: Running each for Actor

Updated by Thierry Rakotoarivelo 12 months ago

  • % Done changed from 100 to 80

Updated by Thierry Rakotoarivelo 12 months ago

  • Status changed from Assigned to Resolved
  • % Done changed from 80 to 100

Fixed in e97d3000 on the release-5.3 branch.

  • Original code:
    After trapping the CTRL-C the EC was setting its state to 'Interrupted'. The separate Event thread (monitoring that state) start the shutting down process of the EC. However, if the user press CTRL-C again (and again, etc...) the EC still trapped it and still tried to set its state to 'Interrupted' on a cleaning EC... This caused the EC to be unresponsive.
  • In the fix:
    Now the EC only set its state to 'Interrupted' if it is still in the 'running' state, i.e. it won't do anything with the trapped signal if it is already in the 'not running' state.

Updated by Thierry Rakotoarivelo 12 months ago

  • Status changed from Resolved to Closed

Applied the fix to the Master branch (3251190b1319e4a2e9352742c0494a1276a6123d)

Also available in: Atom PDF