Correcting the Dynamic Call Graph Using Control Flow ... - Google Sites

1 downloads 178 Views 177KB Size Report
Complexity of large object oriented programs. ❑ Decompose the program into small methods ... FDOM (Frequency dominator
Correcting the Dynamic Call Graph Using Control Flow Constraints Byeongcheol (BK) Lee Kevin Resnick Michael Bond Kathryn McKinley UT Austin

1

Motivation 

Complexity of large object oriented programs  



Decompose the program into small methods Method boundary becomes performance-bottleneck

Dynamic interprocedural optimization   

Solve the method boundary problem Inlining and specialization vary the performance by factor of 2 Dynamic call graph (DCG) is critical input! b w 1

a

2

w2

c

Dynamic call graph

Inaccurate call graph

b

1,000

call b

a call c

500

c

DCGSample Error method a

3

Call stack

Timer-based sampling and timing bias

…c

b

cc

b a

cc

b

cc



cc

b a

t

4

Call stack

Timer-based sampling and timing bias

…c

b

cc

b a

cc

b

cc



cc

b a

t

5

Call stack

Timer-based sampling and timing bias

…c

b

cc

b a

cc

b

cc



cc

b a

t

6

Call stack

Timer-based sampling and timing bias

…c

b

cc

b a

cc

b

cc



cc

b a

t

7

Timer-based sampling and timing bias

Call stack

timer tick

…c

timer tick

b

cc

b a

timer tick

cc

b

cc

timer tick



cc

b a

DCGSample

t

8



b 910

b

b 11

1011

c

a 5

c

a 5

c

a 56



b 9991000

c

a 500

Overhead and accuracy in call graph profiling Full instrumentation

Overhead (%)

25 20 15 10 5 0 40

9

Arnold-Grove sampling [2005]

Correction [2007]

Timer-based sampling [2000]

60

80 Accuracy (%)

100

Outline   

10

Motivation Call graph correction Evaluation

Timing bias in SPEC JVM98 raytrace Normalized frequency(%)

Sampling 5

4

3

2

1

0

Method calls grouped by source method

11

Normalized frequency(%)

Timing bias in SPEC JVM98 raytrace

5

4

3

2

1

0

Method calls grouped by source method

12

Correction algorithms 

Detect and correct DCG error 



DCG constraint

Static and dynamic approaches  New

Static FDOM (Frequency dominator) correction  



Dynamic basic block profile correction  

13

Static approach Uses static FDOM constraint on DCG Dynamic approach Uses dynamic basic block profile constraint on DCG

Static FDOM constraint 

FDOM constraint on CFG  

call c is executed at least as many times as call b call c FDOM call b call b



FDOM constraint on DCG 

f( a

c

) ≥ f( a

b

call c

)

method a

14

Static FDOM correction FDOM constraint: f( b 1,000

a

500

c

DCGSample 

c

) ≥ f(

a

b

)

b

Correction

750

c

a 750

DCGFDOMCorrection

Detect error and assign the same average frequency 

15

a



One possible solution to the FDOM constraint Preserve total frequency sum

Dynamic basic block profile constraint 

Some dynamic optimization systems do edge profiling 



Dynamic basic block profile constraint on CFG 



Baseline compiler in Jikes RVM

call b

f(call c) = 2 * f(call b)

Dynamic basic block profile constraint on DCG 

50%

f( a

c

) = 2 * f( a

b

call c

) method a

16

50%

Dynamic basic block profile correction Constraint: f( a

c

b 1,000 500

a

c

DCGSample

fNew( a fNew( a

17

b c

) = 2* f( a

b

) b

Correction

500

a

1,000

c

DCGEdgeProfileCorrection

) = 1/(1+2) * (1,000+500) = 500 ) = 2/(1+2) * (1,000+500) = 1,000

Best result: raytrace Normalized frequency(%)

5

4

3

2

1

0

3

Static FDOM correction 2

5

1

0

Sampling

Normalized frequency(%)

Normalized frequency(%)

5

4

4

3

2

1

0

Dynamic basic block profile correction

18

Outline   

19

Motivation Call graph correction Evaluation

Experimental methodology  

Jikes RVM 2.4.5 on 3.2G Pentium 4 Replay methodology [Blackburn et al. ‘06]   



Deterministic run 1st iteration – compilation + application run 2nd iteration – application run

Measurement 

Accuracy 



Overhead 



20

1st iteration includes call graph correction

Performance 



Use overlap accuracy [Arnold & Grove ’05]

2nd iteration is application-only

SPECJVM98 and DaCapo benchmarks

Accuracy No correction

Static FDOM correction

Dynamic basic block profile correction

100 90

Accuracy(%)

80 70 60 50 40 30 20 10

21

Average

jbb

ipsixql

luindex

jython

hsqldb

fop

bloat

antlr

jack

mtrt

mpegaudio

javac

db

raytrace

jess

compress

0

22 Average

jbb

ipsixql

luindex

jython

hsqldb

fop

bloat

Static FDOM Correction

antlr

jack

mtrt

mpegaudio

javac

db

raytrace

jess

compress

Normalized execution time

Overhead Dynamic basic block profile correction

1.04

1.02

1

0.98

0.96

0.94

0.92

0.9

Inlining performance Static FDOM Correction

Dynamic basic block Profile correction

Perfect DCG

Normalized execution time

1.05

1

0.95

0.9

0.85

23

Average

jbb

ipsixql

luindex

jython

hsqldb

fop

bloat

antlr

jack

mtrt

mpegaudio

javac

db

raytrace

jess

compress

0.8

Baseline: profile-guided inlining with default call graph sampling

Summary   

CFG constraint improves the DCG Inlining has been tuned for bad call graph Advantages Can be easily combined with other DCG profiling  Minimal overhead only during the compilation 



Future work 

24

More inter-procedural optimizations with high accuracy DCG

Question and comment 

25

Thank you!

26

27

28

29

Timing bias misleads optimizer

5,000 times

a

10,000 times

b c

Sampling with timing bias

1,000 samples

a

DCGPerfect



DCGSample 



30

Inliner may inline b instead of c

c

DCGSample

Edge frequencies were reversed!

Inlining decision 

500 samples

b

Call graph profiling in online optimization system Source program

Compile & instrument

Machine code

e.g. Java byte code Dynamic call graph Online optimization system

  

31

Profiling and program run at the same time Minimize profiling overhead Corollary: sacrifice profiling accuracy