Supplementary Material: Structured Images for RGB-D Action Recognition Pichao Wang1∗, Shuang Wang2∗, Zhimin Gao1 , Yonghong Hou2†and Wanqing Li1 1 Advanced Multimedia Research Lab, University of Wollongong, Australia 2 School of Electronic Information Engineering, Tianjin University, China
[email protected],
[email protected],
[email protected],
[email protected],
[email protected]
1. Confusion Matrices From the following confusion matrices, we can see that the recognition accuracy improves as the granularity increases. The structured motion images aggregate spatio-temporal information contained in the three hierarchical spatial levels by using rank pooling, and keeps the structure information of human motion explicitly using the proposed structured dynamic images. 0.09 0.03 0.06 0.18 0.03 High arm wave 0.61 0.03 Horizontal arm wave 0.97 0.67 0.03 0.09 0.21 Hammer 0.09 0.48 0.03 0.27 0.03 0.09 Hand catch 0.07 0.90 0.03 Forwad patch 0.03 0.97 High throw 0.03 0.07 0.10 0.70 0.07 0.03 Draw X 0.83 0.07 0.03 0.03 0.03 Draw tick Draw circle 0.03 0.10 0.03 0.03 0.80 1.00 Hand clap 1.00 Two hand wave 0.03 0.03 0.94 Side−boxing 0.97 0.03 Bend 0.03 0.03 0.03 0.12 0.79 Forward kick 0.03 0.97 Side kick 0.15 0.76 0.09 Jogging 0.06 0.03 0.12 0.48 Tennis swing 0.060.09 0.030.090.03 0.50 0.03 0.47 Tennis serve 0.03 0.03 0.88 0.03 0.03 Golf swing 0.03 0.03 0.65 0.24 0.06 Pickup&throw
0.03 High arm wave 0.93 0.03 0.03 Horizontal arm wave 0.97 0.93 0.03 0.03 Hammer 0.70 0.13 0.13 Hand catch 0.03 0.19 0.77 0.04 Forwad patch 0.15 0.65 0.19 High throw 0.23 0.03 0.73 Draw X 0.27 0.73 Draw tick 0.03 0.97 Draw circle 1.00 Hand clap 1.00 Two hand wave 1.00 Side−boxing 0.97 0.03 Bend 0.03 0.97 Forward kick 0.79 0.21 Side kick 1.00 Jogging 1.00 Tennis swing 1.00 Tennis serve 1.00 Golf swing 1.00 Pickup&throw
Figure 1: Confusion matrix for structured body DDI on Figure 2: Confusion matrix for structured part DDI on MSRAction3D Dataset. MSRAction3D Dataset. 0.03 0.07 High arm wave 0.90 0.03 Horizontal arm wave 0.97 Hammer 0.03 0.97 0.90 0.03 0.07 Hand catch 0.96 0.04 Forwad patch 0.73 High throw 0.27 1.00 Draw X 1.00 Draw tick 0.90 0.10 Draw circle 1.00 Hand clap 1.00 Two hand wave 1.00 Side−boxing 1.00 Bend 0.03 0.97 Forward kick 0.97 0.03 Side kick 1.00 Jogging 1.00 Tennis swing 0.03 0.97 Tennis serve 1.00 Golf swing 0.03 0.03 0.94 Pickup&throw
Figure 3: Confusion matrix for structured joint DDI on MSRAction3D Dataset. ∗ Both
authors contributed equally to this work author
† Corresponding
1.1. Confusion Matrix for MSRAction3D Dataset From the final results we can see that the proposed method can well recognize the simple actions, as it aggregates spatiotemporal information from global to fine-grained by adopting the three hierarchical spatial levels and keeps the structure information of human body by using structured images. Moreover, the multiply-score fusion method works very well on this dataset. For example, action “Forward kick” is confused with “Bend” in part and joint DDIs, but not confused with “Bend” in body DDIs, and after multiply score fusion, “Forward kick” is well recognised from “Bend”. It implies that the three structured DDIs are likely to be statistically independent and complementary to each other.
1.2. Confusion Matrix for G3D Dataset Punch right 0.070.93 Punch right 1.00 0.07 Punch left 0.200.73 Punch left 1.00 0.730.27 0.930.07 Kick right Kick right 0.530.47 1.00 Kick left Kick left 0.07 0.93 1.00 Defend Defend 0.20 0.07 0.07 0.87 0.80 Golf swing Golf swing 0.200.470.13 0.130.800.07 0.07 Tennis swing forehand 0.070.07 Tennis swing forehand 0.27 0.400.33 0.330.400.130.07 Tennis swing backhand 0.07 Tennis swing backhand 0.070.600.20 1.00 Tennis serve 0.13 Tennis serve 0.130.07 0.80 1.00 Throw bowling ball Throw bowling ball 0.04 1.00 0.96 Aim and fire gun Aim and fire gun 0.930.07 1.00 Walk Walk 0.070.93 1.00 Run Run 0.13 0.070.670.13 1.00 Jump Jump 0.07 0.93 1.00 Climb Climb 0.07 0.07 0.80 0.93 0.07 0.07 Crouch Crouch 0.930.07 0.07 0.93 Steer a car Steer a car 0.20 0.80 0.07 0.93 Wave Wave 0.270.73 1.00 Flap Flap 1.00 1.00 Clap Clap
Figure 4: Confusion matrix for structured body DDI on G3D Figure 5: Confusion matrix for structured part DDI on G3D Dataset. Dataset. Punch right 1.00 Punch right 0.870.13 Punch left 1.00 Punch left 1.00 0.870.13 1.00 Kick right Kick right 1.00 1.00 Kick left Kick left 0.20 0.07 0.80 0.93 Defend Defend 1.00 1.00 Golf swing Golf swing 0.130.80 0.130.87 Tennis swing forehand 0.07 Tennis swing forehand 0.07 0.80 0.20 0.07 0.200.200.47 Tennis swing backhand Tennis swing backhand 0.07 0.93 0.87 Tennis serve Tennis serve 0.13 1.00 1.00 Throw bowling ball Throw bowling ball 1.00 1.00 Aim and fire gun Aim and fire gun 1.00 1.00 Walk Walk 1.00 1.00 Run Run 1.00 1.00 Jump Jump 1.00 1.00 Climb Climb 0.07 0.93 0.93 0.07 Crouch Crouch 1.00 1.00 Steer a car Steer a car 0.20 0.27 0.73 0.80 Wave Wave 1.00 1.00 Flap Flap 1.00 1.00 Clap Clap
Figure 6: Confusion matrix for structured joint DDI on G3D Dataset.
Figure 7: Confusion matrix for S2 DDI on G3D Dataset.
From the confusion matrix for S2 DDI we can see that the proposed method confuses “Tennis swing backhand” with “Tennis swing forehand” and “Golf swing” after score fusion, even though “Tennis swing backhand” can be recognized much better based on structured joint DDI.
1.3. Confusion Matrix for SYSU 3D HOI Dataset Drinking
0.65
0.10
Pouring
0.10
0.75
0.10
0.05
Calling phone
0.15
0.10
0.40
0.20
0.10
0.05
0.05
0.80
Playing phone Wearing backpacks
0.15
0.10
Packing backpacks
0.10
0.05
Drinking
0.10
0.05
0.10
0.05
0.60
Sitting chair
0.05
Moving chair
0.05
Wearing backpacks
0.05
Packing backpacks
0.05
0.10
0.15
0.05
0.60
0.05
0.50
0.05
0.25
0.15
Mopping Sweeping
0.05
0.10 1.00
0.05
0.05
0.85
0.05 1.00 0.80
Sitting chair 0.90
0.15
0.90
Playing phone 0.05
1.00
Taking out wallet Taking from wallet
Pouring
0.20
Calling phone
0.05
0.10 0.75
0.80
0.05
0.05 0.95
Moving chair
0.05
Taking from wallet 0.60
0.35
Mopping
0.10
0.25
0.55
Sweeping
0.20
0.15
0.10
0.95
0.05
0.95
Taking out wallet
0.05
0.05 0.05
0.15
0.10
0.05
0.35
0.05
0.05
0.90 0.05
0.10
0.85
Figure 8: Confusion matrix for structured body DDI on SYSU Figure 9: Confusion matrix for structured part DDI on SYSU 3D HOI Dataset. 3D HOI Dataset. Drinking
Calling phone Playing phone
Drinking
1.00
Pouring 0.10
0.90
Playing phone
0.05
0.05
0.05
0.85
Sitting chair
0.05
Mopping
1.00
Sweeping
1.00
Sweeping
0.05
1.00 1.00
Taking out wallet Taking from wallet
0.85
0.05
0.90
Sitting chair
1.00 0.05
1.00
Moving chair
1.00
Taking out wallet
0.05
0.90
Packing backpacks
0.05
1.00
Moving chair
0.10
1.00 0.05
Wearing backpacks
1.00
Packing backpacks
Mopping
1.00
Calling phone
0.90
Wearing backpacks
Taking from wallet
1.00
Pouring
1.00
1.00 0.20
0.05
0.10
0.65 1.00 1.00
Figure 10: Confusion matrix for structured joint DDI on SYSU Figure 11: Confusion matrix for S2 DDI on SYSU 3D HOI 3D HOI Dataset. Dataset. From the confusion matrices we can see that, the structured joint DDI achieved the best performance. The “Taking from wallet” action is greatly confused in structured body and part DDIs, that affects the final performance of S2 DDI.
1.4. Confusion Matrix for UTD-MHAD Dataset From the confusion matrices, we can see that as the granularity increases, the proposed method achieves higher accuracy. After multiply score fusion, S2 DDI achieves the best result. It implies that the three structured DDIs are likely to be statistically independent and provide complementary information.
Swipe left 0.500.50 Swipe left 0.920.08 Swipe right 1.00 Swipe right 0.080.92 0.50 0.42 0.08 0.08 Wave 0.250.67 Wave 0.75 0.58 0.08 0.42 0.17 Clap Clap 0.67 0.75 0.25 Throw 0.250.08 Throw 1.00 1.00 Arm cross Arm cross 0.08 0.42 0.92 0.58 Basketball shoot Basketball shoot 0.83 0.080.08 0.670.33 Draw X Draw X 0.080.58 1.00 Draw circle(cw) 0.170.17 Draw circle(cw) 0.17 0.920.08 Draw circle(ccw) 0.080.75 Draw circle(ccw) 0.83 0.08 0.42 0.17 0.50 Draw triangle Draw triangle 0.83 0.75 0.17 0.25 Bowling Bowling 1.00 1.00 Boxing Boxing 0.750.25 1.00 Baseball swing Baseball swing 0.170.42 1.00 0.17 0.08 0.08 0.08 Tennis swing Tennis swing 0.75 0.58 0.08 0.33 0.08 0.08 0.08 Arm curl Arm curl 1.00 0.250.08 0.25 0.33 Tennis serve 0.08 Tennis serve 0.50 0.67 0.080.25 0.080.17 0.25 Push Push 0.33 0.08 0.92 0.08 0.08 0.420.08 Knock Knock 0.250.330.08 0.33 0.25 0.08 0.08 0.08 0.25 0.17 Catch 0.08 Catch 1.00 0.08 0.25 0.08 0.500.08 Pickup and throw Pickup and throw 1.00 1.00 Jog Jog 0.92 1.00 0.08 Walk Walk 0.920.08 0.830.17 Sit to stand Sit to stand 1.00 1.00 Stand to sit Stand to sit 0.75 1.00 0.25 Lunge Lunge 0.58 0.33 0.08 0.08 0.42 0.50 Squat Squat
Figure 12: Confusion matrix for structured body DDI on UTD- Figure 13: Confusion matrix for structured part DDI on UTDMHAD Dataset. MHAD Dataset. Swipe left 1.00 Swipe left 0.830.17 Swipe right 1.00 Swipe right 1.00 1.00 1.00 Wave Wave 1.00 1.00 Clap Clap 0.67 0.67 0.08 0.08 0.25 0.25 Throw Throw 1.00 1.00 Arm cross Arm cross 1.00 1.00 Basketball shoot Basketball shoot 1.00 1.00 Draw X Draw X 1.00 0.92 Draw circle(cw) Draw circle(cw) 0.08 0.750.25 0.33 0.67 Draw circle(ccw) Draw circle(ccw) 1.00 1.00 Draw triangle Draw triangle 1.00 1.00 Bowling Bowling 1.00 1.00 Boxing Boxing 1.00 1.00 Baseball swing Baseball swing 1.00 0.92 0.08 Tennis swing Tennis swing 0.92 1.00 0.08 Arm curl Arm curl 1.00 1.00 Tennis serve Tennis serve 0.080.17 0.08 0.50 0.33 0.58 0.25 Push Push 0.83 0.08 0.670.25 0.08 0.08 Knock Knock 0.080.75 0.17 0.080.92 Catch Catch 0.25 1.00 0.08 0.08 0.50 0.08 Pickup and throw Pickup and throw 1.00 1.00 Jog Jog 1.00 1.00 Walk Walk 0.920.08 0.920.08 Sit to stand Sit to stand 1.00 1.00 Stand to sit Stand to sit 0.08 0.92 1.00 Lunge Lunge 0.67 0.25 0.75 0.33 Squat Squat
Figure 14: Confusion matrix for structured joint DDI on UTD- Figure 15: Confusion matrix for S2 DDI on UTD-MHAD MHAD Dataset. Dataset.