Low Overhead Fault Tolerant Networking in Myrinet*

1 downloads 0 Views 298KB Size Report
Vijay Lakamraju, Israel Koren and C.M. Krishna. Department of Electrical and Computer .... of Injections. Our work Iyer et al.[15]. Local Interface Hung. 28.6. 23.4.
                                            

                                                                                                             

    !                  "                          #                                       !        $                         

             

                                        

                     

!"  #$%    

        &     &         '               &  (       )'                 *   '  +        

 , & - ./    & .-+       $+ &  ½   



          

0   '           &    &      )                   )        &      

  &       &      

     ' 

         '      &  1        )    ' 

    

      2'   

'     1        .'                3     &         3      &    &      '    Æ  '  

    3          

&     4     '             &     3                                &    &         /  '  

            5   6   



      &  7                 8 

    9  &  

            3'  6   6                    3       &                     4   

           3    

 

'      

           :  3       3  

 

  &        

 ; &   &   

      3     &  & 

       '    

    &

      

      8 $9     +              ' 

 

   +     +        +     +              > &        &     &'    .  &            ;  D)         '         '    

    +         & 3        2  .      1    

?       &     

      #%   

           & >  &           +/" -   ?  

      D) .-+'  

   +/" '     '           '  

 ?      &                   .                   &  D)        '   

 

 

     &         

             #%     8D) G9    8A+ 9            8     9         +    Table 1. Results of fault injection on a Myrinet system (1000 runs)

*  /  D   7 + /  -    7 +/" -  7 /   / :   )  

H  ?  :     #% $@ ,= $,  I    , @ =   , $

     &        2       GH      1         &   . '  

   &  1  

?       

      ' &  

    >     1                    D)                     4   '           3  ;   2      

  &  A+  '                   &                 8    9      * 

' '   

          /"0  B                        >   # %'   *+ +   B    /"0  B           &?      

 +  &                   &  ,          I  *A+'           &              

           

           *C 

  E 

         '    *  G  

           

?   

   *C    &              &  % )*         ; 



    ' &    &  8 ,9    $          *C        3      +/"        

&      *0D C/C       E   3      &             I@ 

 *C  '     &         +/" interrupt latency fault injected

interrupt handled

context−switch overhead

1 0 0 01 1 0 1 0 1 0 01 1 0 1 0 1 0 1

11 00 00 11 00 11 00 11 00 11

interrupt raised

Fault detection time

per−process recovery started

FTD woken up

MCP reloaded

FTD recovery time

per−process recovery started

....

1 0 0 1 0 1 0 1 0 1

.... FAULT event(s) posted

handling of send tokens

handling of send tokens

per−process recovery time

Figure 9. The timeline of the fault recovery process

          &                              

  *0D C/C    :     

         G   &              &  D)   &    

        &    

 &      &    &  ,    &      





&    '   4 & #G%' A  &   #@' =%' K) #I%  + #I%

"  

          !  " 

Table 3. Components of the fault recovery time

         &         *A+ ;  

           '     4     $@   *A+   &         ;        

# $         &             +     

         ?                   

           &      * '       &   &    

       

       

      *

    &       

'      

   

     .  3    & &     ?      

   C       '   

   & 

    '   

    &    & &    &        

  3      &      

        #% +                  

  >       '                

   &     ' 

      

      '    

  : 

               &                 +       

&# ''(

* +, - *.   /   0

 1

  



  #   234!52678

&# ''( 2

J 89 $ I@ G

 

 # $ % )

/  *

C     *C -   "  -  

            

&     *  9     0   :  ;  &  ;#    , ,     0

  

?

534!)'627 9 ''5

    @        -    # 

, !   

   1  ,  



  

- 

*

 A ')A7B5  # $ 1  1   $

= '') 5

    * =    

   

$        *    - #   :,!

0     1 

   

    

)(3B4!6

)  $ ''2 7

* 9

& *  C    C 

   +   B

 - = 9

(

- / 1 

'

  

  / ''( %   0 



  !

%D  

8

!AA000+  ,#A

      ,   

2)   '''

-



  



!AA000 D  A

/ - ,  * =     E0   1 -=  0 

 

'3)4!?86?( =

''5 

=$  % 

)

   F       9

!AA000$ A  

, ! Æ        1 0 

     =     "# $$$ %  &' ( & $  53)4!786B2  ''B 2

: $   -

%!  0  

 , 1  , 1  0 ,  $  

  

  

2((!?B)6?(8

''( ?

  # 

Suggest Documents