Using a CUDA-‐enabled Graphics Card to Accelerate

0 downloads 0 Views 3MB Size Report
Mar 29, 2010 - Figure D-‐1 – Project's Visual Studio 2008 Solution Explorer Window . ..... Sampling techniques such as cross-‐validation or the bootstrap are frequently employed in ...... Note, boost calls derived from Random Library tutorial.
        Using  a  CUDA-­‐enabled  Graphics  Card  to   Accelerate  Neural  Network  Design  for   Breast  Cancer  Computer-­‐aided   Diagnosis   By  Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report   Department  of  Computer  Science,   Birkbeck,  University  of  London   2010                                   This  report  is  substantially  the  result  of  my  own  work  except  where  explicitly  indicated  in  the  text.    I   give  my  permission  for  it  to  be  submitted  to  the  JISC  Plagiarism  Detection  Service.   The  report  may  be  freely  copied  and  distributed  provided  the  source  is  explicitly  acknowledged.      

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  2  of  119   September,  2010  

Table  of  Contents   Table  of  Figures  .....................................................................................................................................  5   Abstract  .................................................................................................................................................  6   Chapter  1  –  Introduction  .......................................................................................................................  7   1.1  The  Need  to  Accelerate  Performance  in  Computer-­‐aided  Diagnosis  ..........................................  7   1.2  Report  Structure  ..........................................................................................................................  8   Chapter  2  –  Background  ........................................................................................................................  9   2.1  Overview  ......................................................................................................................................  9   2.2  Computer-­‐aided  Diagnosis  in  Breast  Cancer  ...............................................................................  9   2.2.1  CADx  Overview  .....................................................................................................................  9   2.2.2  How  Cases  are  Characterized  in  CADx  ................................................................................  10   2.2.3  Az,  the  Common  CADx  Classification  Accuracy  Measure  ....................................................  11   2.2.4  Neural  Networks  in  CADx  ....................................................................................................  12   2.2.5  The  Challenges  Facing  CADx  Neural  Networks  ...................................................................  13   2.3  Current  Parallel  Computing  Approaches  ...................................................................................  14   2.3.1  Why  Parallelism  Is  Important  ..............................................................................................  14   2.3.2  CUDA  and  the  GPU  .............................................................................................................  15   2.3.3  The  Multi-­‐Core  CPU  ............................................................................................................  15   2.3.4  Other  Parallel  Options  ........................................................................................................  15   2.4  A  Comparison  of  CPU  and  GPU  Threading  Models  ....................................................................  16   2.4.1  A  Comparison  of  CUDA  to  CPUs  Using  Flynn’s  Taxonomy  ..................................................  16   2.4.2  The  Impact  of  the  Threading  Model  on  CPU  and  GPU  Optimization  ..................................  18   2.5  Relevant  Research  .....................................................................................................................  19   2.5.1  CADx  Literature  ...................................................................................................................  19   2.5.2  CUDA  Literature  ..................................................................................................................  20   2.5.3  Multi-­‐core  CPU  Literature  ...................................................................................................  20   2.6  Libraries,  Tools,  and  Technologies  Employed  ............................................................................  20   Chapter  3  –Analysis  and  Design  ...........................................................................................................  22   3.1  Overview  ....................................................................................................................................  22   3.2  Analysis  ......................................................................................................................................  22   3.2.1  Requirements  .....................................................................................................................  22   3.3  Design  ........................................................................................................................................  23   3.3.1  The  Algorithms  ....................................................................................................................  23     Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  3  of  119   September,  2010  

3.3.2  The  Object-­‐oriented  Design  ................................................................................................  24   3.3.3  Data  Structures  for  SIMD  ....................................................................................................  25   3.3.4  The  SIMD  Sampling  Data  Structure  .....................................................................................  27   3.3.5  Full  Design  ...........................................................................................................................  29   3.3.6  Framework  Flexibility  ..........................................................................................................  29   3.4  Summary  ....................................................................................................................................  30   Chapter  4  –  Implementation  ...............................................................................................................  31   4.1  Overview  ....................................................................................................................................  31   4.2  General  Implementation  Details  ................................................................................................  31   4.2.1  CUDA  Development  ............................................................................................................  31   4.2.2  SSE  Development  ................................................................................................................  32   4.2.3  The  Choice  of  Native  C++  ....................................................................................................  32   4.2.4  Factory  Classes  and  Smart  Pointers  ....................................................................................  33   4.2.5  Random  Numbers  and  Distributions  ..................................................................................  34   4.3  Neural  Network  Output  Calculation  ..........................................................................................  34   4.3.1  Base  Design  .........................................................................................................................  34   4.4  CPU  Implementation  .................................................................................................................  35   4.5  GPU  Implementation  .................................................................................................................  37   4.5.1  The  Project’s  Implementation  ............................................................................................  37   4.5.2  Preliminary  GPU  Optimization  Efforts  ................................................................................  40   4.6  Summary  ....................................................................................................................................  40   Chapter  5  –  Testing  and  Results  ..........................................................................................................  42   5.1  Overview  ....................................................................................................................................  42   5.1.1  The  Tests  .............................................................................................................................  42   5.1.2  The  Datasets  .......................................................................................................................  43   5.2  The  NeuralNetEvaluator  Tests  ...................................................................................................  43   5.2.1  Functional  Test  Description  and  Results  .............................................................................  43   5.2.2  Performance  Test  Description  and  Results  .........................................................................  43   5.3  The  NeuralNetTrainer  Tests  .......................................................................................................  44   5.4  The  GeneticSelector  Tests  ..........................................................................................................  44   5.4.1  Test  Description  ..................................................................................................................  44   5.4.2  Functional  Test  Results  .......................................................................................................  45   5.4.3  Performance  Test  Results  ...................................................................................................  46   5.5  CPU  Allocation  ...........................................................................................................................  46     Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  4  of  119   September,  2010  

Chapter  6  –  Summary,  Conclusion,  Future  Work,  and  Evaluation  .......................................................  48   6.1  Summary  ....................................................................................................................................  48   6.1.1  Background  .........................................................................................................................  48   6.2  Conclusion  .................................................................................................................................  48   6.2.1  Overall  Conclusion  ..............................................................................................................  48   6.2.2  Design  .................................................................................................................................  48   6.2.3  Implementation  ..................................................................................................................  49   6.2.4  Testing  and  Results  .............................................................................................................  49   6.3  Evaluation  ..................................................................................................................................  50   6.4  Future  Work  ...............................................................................................................................  50   Bibliography  .........................................................................................................................................  52   Appendix  A  

–  Compute  Capability  2.0  .............................................................................................  58  

Appendix  B  

–  Testing  Details  ...........................................................................................................  59  

Appendix  C  

–  Factors  in  CUDA  Performance  ...................................................................................  61  

C.1  

Thread  Hierarchy  .................................................................................................................  61  

C.2    

Memory  Layout  ....................................................................................................................  61  

C.3    

 A  Shared  Memory  Implementation  ....................................................................................  63  

Appendix  D  

–  Systems  Manual  ........................................................................................................  64  

Appendix  E  

–  Source  Code  ..............................................................................................................  66  

 

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  5  of  119   September,  2010  

Table  of  Figures     Figure  2-­‐1  –  BI-­‐RADS  Categories  ..........................................................................................................  10   Figure  2-­‐2  –  Three  ROC  Curves  with  Increasing  Area  Under  the  Curve  (Az)    Values  ............................  11   Figure  2-­‐3  –  A  Feed  Forward  Neural  Network  .....................................................................................  12   Figure  2-­‐4  –  A  Perceptron  Network  .....................................................................................................  12   Figure  2-­‐5  –  Logistic  Sigmoid  ...............................................................................................................  13   Figure  2-­‐6  –  The  SISD  Architecture.  .....................................................................................................  17   Figure  2-­‐7  –  The  SIMD,  Processor  Array  Architecture.  ........................................................................  17   Figure  2-­‐8  –  The  SIMD,  Vector  Pipeline  Architecture.  .........................................................................  18   Figure  2-­‐9  –  The  MIMD  Architecture.  ..................................................................................................  18   Figure  2-­‐10  –  GeForce  GTX  260  Device  Information  ...........................................................................  21   Figure  3-­‐1  –  A  Package  Based  Design  Approach  ..................................................................................  25   Figure  3-­‐2  –  Improved  Design  ..............................................................................................................  25   Figure  3-­‐3  –  The  Array  of  Structures    (AoS)  Memory  Layout  ...............................................................  27   Figure  3-­‐4  –  The  Structure  of  Arrays  (SoA)  Memory  Layout  Transposes  the  AoS    Layout  ...................  27   Figure  3-­‐5  –  Structure  of  Arrays  (SoA)  with  Required  Padding  ............................................................  27   Figure  3-­‐6  –  SoA  Memory  Layout  for  Training/Validation  ...................................................................  28   Figure  3-­‐7  –  Data  Classes,  SoA  Implementation  ..................................................................................  28   Figure  3-­‐8  –  Overall  Project  Design  .....................................................................................................  29   Figure  4-­‐1  –  SIMD  Node  Multiply  Add  Assign  ......................................................................................  35   Figure  4-­‐2  –  CPU,  SSE  Implementation  (SseGlobal.cpp)  ......................................................................  36   Figure  4-­‐3  –  CUDA  Invocation  Function  (CudaBasic.cu)  ......................................................................  38   Figure  4-­‐4  –  CUDA  Kernel  ....................................................................................................................  39   Figure  4-­‐5  –  Device  Load,  Unload,  and  Copy  Helper  Functions  (PROJ_MarungoF_Cuda.cu)  ..............  40   Figure  5-­‐1–  Results  ..............................................................................................................................  42   Figure  5-­‐2  -­‐-­‐  Windows  Task  Manager  ..................................................................................................  47   Figure  C-­‐1  –  Memory  System  of  the  8800GTX  .....................................................................................  62   Figure  C-­‐2  –  A  Shared  Memory  Design  ................................................................................................  63   Figure  D-­‐1  –  Project’s  Visual  Studio  2008  Solution  Explorer  Window  .................................................  65    

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  6  of  119   September,  2010  

Abstract     The  graphics  processing  unit  (GPU)  is  a  high  performance  chip  that  controls  the  graphic  card  inside   the  computer.    NVIDIA  Corporation  introduced  the  Compute  Unified  Device  Architecture  (CUDA)  in   late  2006.    CUDA  is  designed  to  allow  general-­‐purpose  programming  targeting  the  GPU.    This  project   examines  techniques  and  issues  in  using  CUDA  to  accelerate  computational  processing  in  breast   cancer  related  computer-­‐aided  diagnosis  (CADx).    It  presents  the  following:   1. Implementations  of  GPU  based  and  CPU  based  neural  network  calculators.   2. A  design  framework  for  integrating  CUDA  into  typical  CADx  neural  network  based   algorithms.   3. A  sample  implementation  of  the  framework  elements  using  a  genetic  algorithm  for  feature   selection,  and  an  evolutionary  computing  algorithm  for  network  training.   4. Functional  and  performance  testing  implementations  for  the  framework  elements.   Despite  numerous  optimizations  in  the  CPU  implementation,  the  GPU  implementation  provides   roughly  an  18x  speedup  in  raw  network  output  calculation.  Using  the  GPU  creates  slightly  less  than  a   4x  speedup  in  total  runtime.    The  GPU  implementation  provides  better  scaling  for  future  hardware   upgrades.    

 

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  7  of  119   September,  2010  

Chapter  1 –  Introduction   1.1  

The  Need  to  Accelerate  Performance  in  Computer-­‐aided  Diagnosis  

Computer-­‐aided  diagnosis  (CADx)  in  breast  cancer  involves  the  classification  of  a  previously   identified  region  of  interest  in  a  medical  image.    The  term  “previously  identified”  is  relevant  as  it   distinguishes  CADx  from  computer-­‐aided  detection,  also  known  as  CADe  (Lo  et  al.  2006).    The  most   common  medical  image  is  a  mammogram.    Other  possible  imaging  modalities  include  MRI  and   ultrasound.   Neural  networks  are  frequently  used  for  region  classification  in  CADx.    Creating  an  optimal  neural   network  for  a  CADx  problem  presents  several  challenges,  including:   1. 2. 3. 4.

Selecting  the  optimal  features  to  serve  as  network  inputs  (Zheng  2009).   Selecting  an  appropriate  network  architecture  (Land  et  al.  2006).   Evaluating  classification  accuracy  with  relatively  small  data  sets  (Lo  et  al.  2006).   Training  networks  optimally  (Land  et  al.  2006).  

Methods  employed  to  address  the  difficulties  above  include:   1. Genetic  algorithms  for  feature  selection  and  network  architecture  (Campanini  &  Lanconelli   2006).   2. Sampling  techniques  such  as  k-­‐fold  cross-­‐validation  and  the  bootstrap  (Kohavi  1995).   3. Evolutionary  computing  for  training  a  neural  network  (Porto,  Fogel  &  Fogel  1995).   These  methods  are  computationally  expensive.      Until  recently  the  regular  release  of  newer  and   faster  CPUs  led  to  automatic  increases  in  processing  speed  with  each  hardware  upgrade.    In  2005   the  engineering  challenges  of  creating  processors  faster  than  3  GHz  led  manufacturers  to  shift  focus   from  creating  faster  chips  to  fitting  more  processor  units  (“cores”)  into  a  single  chip.    The  speed  of   each  core  is  about  3  GHz  (Geer  2005).       A  direct  consequence  of  the  new  trend  is  that  computationally  expensive  techniques  will  not   experience  a  reduction  in  runtime  on  new  hardware  unless  they  are  implemented  in  a  scalable   parallel1  manner.    The  following  parallelism  technologies  are  available  on  workstations:   1. Graphics  Processing  Units  (GPUs)  are  the  chips  that  control  graphics  cards  in  workstations.     GPUs  are  many-­‐core2  processors  designed  for  massive  computational  lockstep  parallelism.     The  Compute  Unified  Device  Architecture  (CUDA)  is  a  framework  released  by  NVIDIA  in   2006;  CUDA  provides  a  means  to  apply  the  GPU’s  processing  power  to  general-­‐purpose3   calculation.   2. Multi-­‐core  CPUs4  offer  concurrency  at  two  levels.    Each  core  is  an  independent  processing   unit  capable  of  hosting  one  or  two  hardware  threads.    Each  thread,  at  the  register  level  is   capable  of  executing  a  single  math  operation  on  a  vector  of  four  floating  point  numbers   simultaneously.    Intel  calls  the  CPU  instructions  that  execute  a  single  operation  over  four   numbers  Streaming  SIMD  Extensions  (SSE).                                                                                                                           1

 In  this  report,  the  terms  concurrent  and  parallel  are  used  interchangeably  unless  otherwise  stated.    Some  will   describe  concurrent  systems  as  MIMD  systems  and  parallel  systems  as  SIMD  systems.    See  Section  2.4.1  for   further  details  on  these  two  terms.   2  Many-­‐core,  as  opposed  to  multi-­‐core,  is  a  designation  of  scale.    For  example,  a  present  day  GPU  has  more   than  200  cores;  a  present  day  CPU  may  have  six  cores.   3  Generally  the  term  GPU  refers  to  any  graphics  processing  unit.    In  this  report  the  term  GPU  refers  to  a  GPU   controlling  a  CUDA-­‐enabled  device  unless  otherwise  stated.   4  In  this  report  the  term  CPU  refers  to  modern  multi-­‐core  Intel  x86  microprocessors  unless  otherwise  stated.     Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  8  of  119   September,  2010  

3. Hosted  languages,  such  as  Java  and  C#  allow  CPU  thread  creation.    A  recent  development  is   Microsoft‘s  Task  Parallel  Library  available  in  Visual  Studio  2010.    The  library  supports   optimized  parallel  execution  on  multi-­‐core  workstations  using  C#  version  4.0.   This  project  explores  using  CUDA  for  feature  selection  and  network  architecture  design  in  CADx   neural  networks.    The  project  presents  a  design  that  integrates  CUDA  into  a  genetic  algorithm.    The   algorithm  performs  feature  selection  and  designs  the  network  architecture.    The  project’s   implementation  includes  an  evolutionary  neural  network  trainer  and  two  neural  network  output   calculators.    One  calculator  uses  the  CPU  and  the  other  uses  the  GPU.    The  CPU  implementation   employs  a  great  deal  of  optimization  with  low-­‐level  Assembly  calls.    The  optimizations  create  a  fair   comparison  between  the  two  technologies.  The  project  includes  functional  and  performance  tests   for  the  genetic  algorithm,  evolutionary  trainer,  and  two  neural  network  calculation   implementations.  

1.2  

Report  Structure  

Chapter  2  through  Chapter  5  start  with  overviews  that  briefly  describe  the  topics  the  chapter  covers.     Chapter  3  and  Chapter  4,  the  chapters  that  cover  the  technical  work,  have  summaries  to  review   salient  points.   Chapter  2  is  a  very  substantive  chapter;  its  subject  matter  spans  the  various  topics  this  project   touches  upon.    The  chapter  opens  with  an  overview  of  breast  cancer  CADx  in  general  and  CADx   neural  networks  in  particular.      An  overview  of  parallelism  follows;  this  covers  parallelisms   importance  in  contemporary  computing,  its  role  in  modern  CPUs  and  GPUs,  and  other  parallelism   options  that  exist.    A  more  in  depth  look  at  the  CPU  and  GPU  follows.    There  is  an  examination  of   both  processors’  threading  models  and  optimization  methods.    The  chapter  closes  with  a  look  at   previous  work  in  CADx  neural  networks  and  relevant  hardware  based  acceleration.   Chapter  3  opens  with  the  system  requirements.    These  requirements  are  driven  by  the  domain   needs  of  CADx.    The  design  section  starts  with  a  presentation  of  the  algorithms  implemented.    Next   there  is  a  comparison  of  two  possible  object-­‐oriented  approaches  and  a  description  of  the  approach   adopted  in  this  project.    An  explanation  of  the  data  layout  needs  of  both  GPU  and  CPU  programming   is  the  last  topic.   Chapter  4  presents  the  implementation  of  the  ideas  presented  in  Chapter  3.    It  opens  with  general   features  of  GPU  and  CPU  development,  an  explanation  of  why  the  project  required  a  native  C/C++   implementation  and  a  description  of  some  design  decisions  that  are  specific  to  a  C/C++   implementation.    The  chapter  continues  with  an  explanation  of  the  core  calculation  of  the  program;   that  is  the  calculation  of  a  neural  network’s  output.  Finally  there  is  an  explanation  of  the  specifics  of   the  CPU  and  GPU  calculation.   Chapter  5  describes  the  project’s  functional  and  performance  testing.    There  is  a  general  description   of  the  project’s  test  methodology  and  the  datasets  used.    Each  project  component’s  test  and  results   are  then  handled  in  turn.    There  is  an  explanation  of  the  purpose  of  each  test,  a  description  of  the   functional  and  performance  (if  applicable)  test  procedures,  and  the  test  results.   Chapter  6  opens  with  a  review  of  the  purpose  of  the  project.    It  continues  with  an  overall  discussion   of  CUDA’s  potential  in  CADx  as  well  as  conclusions  drawn  from  different  phases  of  the  project.    Next   there  is  an  evaluation  covering  my  personal  views  of  the  project.    Finally  the  report  concludes  with  a   look  at  future  work.  

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  9  of  119   September,  2010  

Chapter  2 –  Background   2.1     Overview   This  chapter  presents  background  on  the  major  topics  that  impact  the  project.    It  begins  with  the   role  of  breast  cancer  computer-­‐aided  diagnosis  (CADx)  in  general.    The  real  world  goal  of  diagnosis  is   to  classify  a  lesion  as  malignant  or  benign.    This  chapter  provides  a  description  of  CADx’s  role  in  this   process,  a  summary  of  lesion  categorization,  and  an  explanation  of  the  accuracy  metric  for  CADx   systems.   The  chapter  continues  with  a  description  of  neural  networks  in  CADx;  the  description  covers  both   the  role  of  neural  networks  and  the  challenges  faced  when  trying  to  employ  neural  networks  in   CADx.   After  the  initial  coverage  of  CADx  the  chapter  shifts  focus  to  parallelism.    The  Computer  Unified   Device  Architecture  is  a  design  for  using  the  parallel  computing  ability  of  the  graphics  card  to  solve   problems  unrelated  to  screen  rendering.    These  sections  of  the  chapter  cover  why  parallelism  is   important,  what  are  the  options  for  implementing  parallelism,  and  what  is  the  nature  of  the  CPU   based  and  GPU  based  approaches.   Throughout  the  chapter  there  are  citations  of  relevant  literature.    The  penultimate  section  provides   an  overview  of  some  of  the  relevant  research.    However  if  the  reader’s  particular  interest  is  research   on  the  field  then  it  is  best  to  read  Chapter  2  in  its  entirety.       The  chapter  concludes  with  a  list  of  the  libraries,  tools,  and  technologies  used  for  the  project’s   implementation.  

2.2  

Computer-­‐aided  Diagnosis  in  Breast  Cancer  

2.2.1     CADx  Overview   There  are  many  different  opinions  on  what  constitutes  computer-­‐aided  diagnosis  (CADx)  as  opposed   to  computer-­‐aided  detection  (often  referred  to  as  CADe).    Many  no  longer  recognize  the  distinction,   and  refer  to  both  processes  as  computer-­‐aided  diagnosis  (Lo  et  al.  2006).    For  the  purpose  of  this   report,  computer-­‐aided  diagnosis  is  the  automated  classification  of  an  identified  region  of  interest,   using  previously  extracted  features.    Thus  CADx  does  not  entail  image  processing  techniques  such  as   segmentation  or  feature  calculation.    The  regions  of  interest  and  the  features  may  come  from  any   combination  of  automated  systems,  human  experts,  patient  medical  history,  etc.   CADx  systems  have  a  wide  range  of  applications.    One  use  of  CADx  systems  is  to  serve  as   components  in  CADe  systems.    After  a  CADe  system  detects  the  region  of  interest  its  CADx   subsystem  determines  the  region’s  likelihood  of  malignancy.    The  CADe  system  then  returns  the   regions  with  a  likelihood  that  crosses  a  preset  threshold.    A  recent  large-­‐scale  study  of  31,000   women  in  the  UK  demonstrated  that  the  combination  of  a  single  reader  and  a  detection  system  with   a  CADx  component  acting  as  a  “second  reader”  yielded  similar  mammogram  screening  accuracy  to   two  human  readers.    The  former  detected  198  out  of  227  possible  cases  and  the  latter  detected  199   of  the  227  possibilities  (Gilbert  et  al.  2008).     Another  application  of  CADx  systems  is  to  help  determine  if  a  biopsy  is  necessary.    Mammograms   detect  80%-­‐90%  of  all  symptom  free  breast  cancers  in  women.    The  downside  is  that  mammograms     Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  10  of  119   September,  2010  

also  generate  many  false  positives.    5%-­‐10%  of  all  women  have  their  mammograms  interpreted  as   “abnormal”  or  “inconclusive.”    Most  of  the  abnormal  and  inconclusive  interpretations  resolve  to   benign  cases  after  a  biopsy  and/or  further  imaging  studies  (American  Cancer  Society  2009).   75%  of  the  700,000  biopsies  performed  each  year  in  the  United  States  yield  benign  results.     Unnecessary  biopsies  are  an  emotional  and  physical  burden  on  patients  and  a  large  addition  to  the   true  cost  of  mammograms  (Bilhanan  2004,  p.  13).    Jiang,  et  al.  (1999)    report  that  radiologists  using   CADx  assistance  can  increase  both  sensitivity  and  selectivity  in  recommending  biopsies.   2.2.2     How  Cases  are  Characterized  in  CADx    

®

Category

Breast Imaging Reporting and Database System (BI-RADS ) Assessment Follow-up Recommendations

a. Assessment is Incomplete 0

Need Additional Imaging Additional imaging and/or prior images are Evaluation and/or Prior needed before a final assessment can be Mammograms for Comparison assigned b. Assessment is Complete – Final Categories 1

Negative

2

Benign Finding(s)

3

Probably Benign Finding – Initial Short-Interval Follow-Up Suggested Suspicious Abnormality – Biopsy Should Be Considered

4

5

6

Optional subdivisions:* 4A: Finding needing intervention with a low suspicion for malignancy 4B: Lesions with an intermediate suspicion of malignancy 4C: Findings of moderate concern, but not classic for malignancy Highly Suggestive of Malignancy – Appropriate Action Should Be Taken Known Biopsy-Proven Malignancy – Appropriate Action Should Be Taken

Routine annual screening mammography (for women over age 40) Routine annual screening mammography (for women over age 40) Initial short-term follow up (usually 6-month) examination Usually requires biopsy

Requires biopsy or surgical treatment

Category reserved for lesions identified on imaging study with biopsy proof of malignancy prior to definitive therapy

*  A  subdivision  may  be  used  in  addition  to  the  Category  4  final  assessment;  MQSA  does  not  allow   a  subdivision  to  replace  a  Category  4  final  assessment.  Use  of  subdivision  is  at  the  discretion  of  the   facility  it  is  not  required  by  the  FDA     Copied  from  (American  College  of  Radiology  2009).   http://www.acr.org/SecondaryMainMenuCategories/quality_safety/BIRADSAtlas/BIRADSFAQs.as px     Figure  2-­‐1  –  BI-­‐RADS  Categories  

 

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  11  of  119   September,  2010  

The  American  College  of  Radiology’s  Breast  Imaging  Reporting  and  Database  System  (BI-­‐RADS  ®)   provides  the  standard  categorization  for  lesions  (D'Orsi,  Bassett  &  Berg  2003).    Figure  2-­‐1  on  page  10   describes  the  BI-­‐RADS  categories.       A  binary  classifier  that  recommends  whether  or  not  to  proceed  with  a  biopsy  needs  to  be  able  to   accurately  determine  the  nature  of  Category  4  lesions.    Barring  outside  information,  such  as  the   enlargement  of  a  lesion,  Category  3  lesions  are  unlikely  to  be  malignant  and  do  not  require  a  biopsy   (Sickles  1999)  (Sickles  1991).    Category  5  lesions  clearly  require  a  biopsy.   2.2.3     Az,  the  Common  CADx  Classification  Accuracy  Measure   Az,  or  the  area  under  the  Receiver  Operating  Characteristic  (ROC)  curve,  is  the  standard  measure  of   classification  accuracy  in  CADx  (Lo  et  al.  2006).    The  ROC  curve  captures  the  tradeoff  between   detecting  positive  cases  and  misdiagnosing  negative  cases.   The  ROC  curve  assumes  that  a  classifier’s  settings  can  vary.    Point  (0,  0)  represents  the  setting  where   the  classifier  rejects  everything.    Point  (1,  1)  presents  the  setting  where  the  classifier  accepts   everything.   Figure  2-­‐2  below  presents  three  ROC  curves  with  increasing  values  for  Az.    ROC  1  has  the  smallest  Az.     Point  (.6,  .2)  lies  on  ROC  1.    This  means  that  ROC  1  depicts  a  classifier  that  will  misdiagnose  60%  of   the  negative  cases  when  it  properly  diagnoses  20%  of  the  positive  cases.    ROC  1’s  classifier  is  clearly   not  desirable.    ROC  2  reverses  these  numbers.  When  ROC  2’s  classifier  properly  detects  60%  of  the   positive  cases  it  misdiagnoses  20%  of  the  negative  cases.    ROC  3  has  the  highest  Az;  when  its   classifier  properly  detects  80%  of  the  positive  cases  it  misdiagnoses  20%  of  the  negative  cases.    

Figure  2-­‐2  –  Three  ROC  Curves  with  Increasing  Area  Under  the  Curve  (Az)    Values  

 

 

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  12  of  119   September,  2010  

2.2.4   Neural  Networks  in  CADx   Neural  networks  are  commonly  used  as  CADx  classifiers.    By  far  the  most  frequently  used  CADx   neural  network  structure  is  the  single  feed  forward  network  with  one  hidden  layer  and  a  single   output  node  such  as  the  network  in  Figure  2-­‐3  below  (Zheng  2009).   Neural  networks  model  brain  function.    Each  network  node  represents  a  neuron.    The  hallmark  of  a   neural  network  is  that  each  node  accepts  a  single  input  value  and  emits  a  single  output  value.    A   node’s  input  value  is  the  weighted  sum  of  outputs  of  the  nodes  that  are  connected  to  it.    The  node’s   output  value  is  the  result  of  a  calculation  using  an  activation  function  and  the  input  value.   Figure  2-­‐4  depicts  the  basic  neural  network  calculation.    The  input  value  is  x1w1  +  x2w2  +  ...  +  x7w7.    f   is  the  activation  function.    The  node’s  output  is  y  =  f(x).    Typically  networks  use  the  simple  logistic   sigmoid  function  in  Figure  2-­‐5  on  the  following  page  as  the  activation  function.    The  node  feeds  its   output  value  forward  to  the  nodes  it  is  connected  to  on  the  next  layer.   The  network  in  Figure  2-­‐4  is  a  perceptron;  it  is  a  network  with  no  hidden  layer.    All  of  the  inputs  feed   directly  into  one  output  node5.    The  node  uses  the  weighted  sum  of  the  inputs  and  applies  an   activation  function;  the  result  is  the  perceptron’s  output.    Figure  2-­‐3  depicts  a  network  with  a  hidden   layer.    The  inputs  do  not  feed  directly  into  the  output  node;  they  feed  into  a  layer  of  hidden  nodes.     The  outputs  from  the  hidden  layer  serve  as  the  inputs  to  the  output  node.     The  addition  of  a  hidden  layer  has  a  significant  influence  on  a  network.    A  perceptron  can  only   perform  linear  separation;  a  network  with  a  hidden  layer  can  model  more  complex  nonlinear   relationships.    However  networks  with  hidden  layers  are  opaque  because  the  hidden  layer  masks  the   relationship  between  the  inputs  fed  into  the  network  and  final  output  calculated.   Networks  with  one  output  node  can  serve  as  binary  classifiers.    Set  a  cutoff  value,  𝑘  |  𝑘   ∈ 0,1 ,  let   x  =  the  output  from  the  neural  network.    x  <  k  yields  a  “benign”  classification,  and  x  ≥  k  yields  a   “malignant”  classification.    If  the  output  node  has  a  logistic  sigmoid  activation  function  normally   k  =  .5.   In  the  context  of  ROC,  point  (0,  0)  represents  k  =  1,  point  (1,  1)  represents  k  =  0,  and  the  ROC  curve   represents  the  tradeoff  as  k  varies.    There  is  no  fixed  k  value  in  ROC  analysis  of  neural  networks.    

  Copied  from  Wikipedia,  Neural  Network    

 

Figure  2-­‐3  –  A  Feed  Forward  Neural  Network  

Figure  2-­‐4  –  A  Perceptron  Network  

Copied  from  Wikipedia,  Perceptron  

                                                                                                                        5

 This  project  only  concerns  using  neural  networks  as  binary  classifiers.    Therefore,  the  report  only  examines   networks  with  a  single  node  in  the  output  layer.     Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

 

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  13  of  119   September,  2010  

 

 

  𝑓 𝑥 =  

1 , 𝑥   ∈  ℛ → 𝑓 ∈ 0,1   1   +  𝑒 !!  

  Figure  2-­‐5  –  Logistic  Sigmoid  

 

  2.2.5   The  Challenges  Facing  CADx  Neural  Networks   Failure  to  Outperform  Linear  Classifiers   In  practice  medical  neural  networks  with  hidden  layers  frequently  do  not  provide  classification   accuracy  that  is  superior  to  linear  regression  (Sargent  2001).    Depending  on  the  particular  biological   process  under  investigation  it  is  also  possible  that  the  best  type  of  classifier  is  a  linear  separator   (Schwarzer,  Vach  &  Schumacher  2000).     Selecting  an  optimal  combination  of  neural  network  topology  and  input  features  is  still  not   immediately  obvious  in  nonlinear  problems  where  a  neural  network  may  provide  more  accurate   results  than  a  linear  separator.    By  trying  many  network  configurations,  a  genetic  algorithm  can  be   an  effective  method  for  neural  network  design  and  feature  selection  (Campanini  &  Lanconelli  2006).     The  genetic  algorithm  can  compare  the  accuracy  of  linear  separation  to  nonlinear  separation  by   permitting  perceptron  networks.    When  perceptrons  are  allowed  by  a  genetic  algorithm  they  often   dominate  the  top  performing  networks  (Land  et  al.  2006).    This  finding  is  in  line  with  Sargent  (2001)   and  Schwarzer,  et  al.  (2000).   In  the  case  of  CADx,  having  a  genetic  algorithm  for  determining  optimal  neural  network  design  is   important.    The  availability  of  one  feature  can  significantly  alter  the  problem  domain.    For  example   breast  cancer  history  is  a  feature  that  frequently  appears  in  superior  networks  (Land  et  al.  2006).    If   this  feature  is  missing  it  is  quite  possible  that  the  classification  problem  may  change  from  linear  to   nonlinear  separation.    I.e.  the  ideal  network  topology  may  change  from  a  perceptron  to  a  neural   network  with  a  hidden  layer.    A  genetic  algorithm  may  detect  this  change.   The  Lack  of  a  Large  Mammogram  Database   There  are  two  distinct  types  of  regions  of  interest  in  breast  cancer  CADx,  masses  and  calcifications.     Masses  are  the  well  known  “lumps”  commonly  associated  with  breast  cancer.    Calcifications  are   clusters  of  small  lesions.    The  two  types  have  different  feature  sets  and  are  typically  evaluated  by   different  CADx  systems.  

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  14  of  119   September,  2010  

An  ideal  database  would  contain  100,000  BI-­‐RADS  Category  4  cases  for  each  region  type  (Sutton   2009).    Frequently  the  problem  domain  only  consists  of  Category  4  cases  for  CADx  classifiers  (see   Section  2.2.2).    At  present,  the  largest  publicly  available  database  is  the  University  of  South  Florida’s   Digital  Database  for  Screening  Mammography  (DDSM)  (Heath  et  al.  2001)  (Heath  et  al.  1998).    The   DDSM  has  2,640  cases.    More  than  a  quarter  of  the  cases  are  normal;  the  cases  are  further   subdivided  between  calcifications  and  masses;  few  of  the  remaining  cases  are  BI-­‐RADS  Category  4.       Sampling  techniques  such  as  cross-­‐validation  or  the  bootstrap  are  frequently  employed  in  CADx  to   address  the  lack  of  a  large  database.    Bootstrap  sampling  involves  creating  multiple  datasets  by   sampling  the  original  dataset  with  replacement.    The  bootstrap  provides  both  an  assessment  of  the   accuracy  of  a  classifier  and  an  assessment  of  its  variability  under  different  training  sets  (Efron  &   Tibshirani  1998)  (Marungo  2010).    A  regular  bootstrap  methodology  is  inappropriate  with  algorithms   that  have  a  memory  component,  such  as  neural  networks  (Kohavi  1995).    The  leave-­‐one-­‐out   bootstrap  will  use  the  sampled  data  items  for  training,  and  use  the  estimated  36.8%  of  data  items   that  are  not  in  the  sample  for  validation  (Jiang  &  Simon  2007).   Drawbacks  of  Back  Propagation  Training   Neural  networks  require  training  to  determine  the  appropriate  values  for  the  network  weights.    The   typical  neural  network  training  method  is  back  propagation.    Back  propagation  uses  the  method  of   steepest  descent  to  adjust  the  network’s  weights  and  biases.    The  method  starts  with  an  initial  set  of   weights  and  continuously  modifies  the  set  to  incrementally  reduce  the  net  classification  error.    This   method  only  requires  that  all  of  the  activation  functions  are  differentiable.     Back  propagation  often  creates  locally  optimal,  but  globally  suboptimal,  neural  network  weights  and   biases.    This  will  lead  to  an  artificially  low  evaluation  of  a  neural  network’s  accuracy.    Evolutionary   computing  can  reduce  the  likelihood  of  converging  to  a  local  maximum  by  evaluating  a  broad   distribution  of  weight  combinations.  

2.3     Current  Parallel  Computing  Approaches   2.3.1     Why  Parallelism  Is  Important   During  the  1990s,  chip  clock  speeds  increased  60%  per  year;  increases  dipped  to  40%  from  2000   until  2004.    By  2004  doubling  a  single  core  CPU’s  die  only  led  to  a  20%  speed  increase.    That  year   problems  with  power  consumption  and  heat  generation  led  Intel  to  cancel  three  planned  single-­‐core   processors.    The  top  end  of  the  planned  processors  was  to  be  a  4  GHz  dual-­‐threaded  single-­‐core   processor.    During  that  year  Apple  also  had  to  delay  the  release  of  the  iMac  G5  due  to  CPU   manufacturing  problems  at  IBM  (Geer  2005)  (Sutter  2005).   Since  2005  clock  speeds  have  remained  relatively  constant  as  chip  makers  used  the  gains  from   Moore’s  Law  to  build  more  cores  onto  a  single  microprocessor.    Using  lower  clock  speeds,  and   adding  multiple  processing  cores  significantly  reduces  power  consumption.    For  example,  an  Intel   Pentium  4  Extreme  Edition  with  a  3.8GHz  clock  speed  uses  up  to  130W;  an  Intel  Core  2  Duo  with   2.93GHz  uses  40-­‐80W  (Giles  2009).       Six  years  after  the  cancellations  in  2004  clock  speeds  are  relatively  unchanged.    Intel’s  top  of  the  line   i7  chip’s  top  clock  speed  is  3.33GHz;  however  the  chip  has  six,  dual-­‐threaded  cores  (Wikipedia   2010b).   The  multi-­‐core  trend  has  significant  implications  for  software  design.    With  constant  clock  speeds   sequential  programs  do  not  obtain  performance  gains  with  each  hardware  upgrade.    Performance   gains  must  come  from  scalable  concurrent  design  (Kirk  &  Hwu  2008).     Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  15  of  119   September,  2010  

2.3.2   CUDA  and  the  GPU   A    Graphics  Processing  Unit  (GPU)  is  a  microprocessor  that  controls  a  graphics  card.    GPUs  are   designed  to  render  the  real-­‐time  high  definition  3D  graphics  required  by  gamers.    The  modern  day   GPU  is  a  massively  parallel  many-­‐core  processor  optimized  for  floating  point  operations  (NVIDIA   Corporation  2010b,  p.  1).    Graphics  rendering  requires  performing  the  same  operation  over   individual  pixels  on  the  screen  region  in  parallel.     Rendering  each  pixel  can  occur  in  lockstep  fashion.    With  one  parallel  execution  there  is  no  need  for   synchronization.      GPUs  are  designed  to  handle  large-­‐scale,  highly  parallel  floating  point  operations.       Historically,  the  drawback  with  attempting  to  apply  the  GPU’s  performance  advantage  to  general-­‐ purpose  programming  is  that  it  requires  translation  into  a  graphics  metaphor.    For  example  the   Steinkraus,  et  al.  (2005)  GPU  neural  network  implementation  translated  the  problem  into  a  matrix   multiplication  operation  using  texture  mapping.   In  November,  2006  NVIDIA  Corporation  introduced  the  Compute  Unified  Device  Architecture   (CUDA).    CUDA  removes  the  translation  requirement.    CUDA  exposes  an  extended  C  API  that  allows   for  direct  execution  of  general-­‐purpose  programs  on  the  GPU  (NVIDIA  Corporation  2010b,  pp.  4-­‐5).       CUDA  is  not  simply  a  software  abstraction;  it  specifies  hardware  requirements.    CUDA-­‐enabled   devices  can  operate  in  a  normal  graphics  processing  mode  or  in  a  separate  general-­‐purpose  compute   mode  (NVIDIA  Corporation  2008).   2.3.3   The  Multi-­‐Core  CPU   A  modern  CPU  is  a  multi-­‐core  processor.    In  essence,  each  core  is  an  independent  single-­‐core   processor.    The  CPU’s  responsibility  is  not  primarily  to  perform  calculations;  the  CPU  manages  the   entire  computer.    For  example  one  core  may  be  managing  network  I/O,  another  core  may  be  reading   a  file,  and  another  core  may  be  calculating  the  values  in  a  spreadsheet.    In  that  case  each  activity  is   entirely  independent  of  the  other.    Only  one  activity  involves  calculation.   2.3.4     Other  Parallel  Options    FPGAs   Reprogrammable  integrated  circuits  called  Field  Programmable  Gate  Arrays  (FPGAs)  offer  another   hardware  solution  to  implementing  high  performance  parallelism.    FPGAs  have  the  advantage  of   having  the  program  physically  burned  into  the  chip.    This  provides  FPGAs  with  an  initial  performance   advantage  over  GPUs.    However  as  the  amount  of  data  grows  beyond  the  FPGA’s  internal  capacity,   memory  latency  can  reduce  the  FPGA’s  advantage.    Che,  et  al.  (2008)  report  that  when  performing   calculations  using  small  matrix  sizes  an  FPGA  outperforms  a  GPU  by  ~6x  on  Gaussian  Elimination  and   ~50x  on  Needleman-­‐Wunsch.    As  the  input  matrix  size  increases,  the  performance  advantage  falls  to   3x  in  the  former  case,  and  virtually  nothing  in  the  latter  case.   There  are  implementations  of  nature  inspired  algorithms  on  FPGAs  (Graham  &  Nelson  1996)  (Chai  et   al.  2009);  however  FPGAs  offer  far  less  programmer  productivity  than  CPU  programming  (Benkrid   2008).    Che,  et  al.  (2008)  compare  FPGA  programming  productivity  to  GPU  programming   productivity  using  lines  of  code  as  a  metric.    They  report  that  the  study’s  FPGA  implementations  of   Gaussian  Elimination,  DES,  and  Needleman-­‐Wunsch  contains  450,  1400,  and  650  lines  of  code   respectively.    They  report  the  respective  CUDA  implementations  contain  160,  1100,  and  320  lines  of   code.  

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  16  of  119   September,  2010  

With  typical  FPGA  development  cycles  lasting  eight  months  (Feist  2009),  GPU  and  CPU  programming   offer  substantial  benefits  in  terms  of  speed  of  implementation  and  functionality  flexibility  over   custom  hardware  solutions  even  where  hardware  reprogramming  is  possible.   Distributed  Computing   A  cluster  of  workstations  can  also  be  used  for  parallel  processing.    A  distributed  implementation  of  a   genetic  algorithm  can  have  a  central  process  that  assigns  members  of  the  cluster  the  task  of   calculating  the  fitness  values  of  a  set  of  chromosomes.    Using  load  balancing  a  system  can  operate   on  the  cluster  workstations  in  the  background  leaving  the  users  undisturbed  (Bevilacqua,  Campanini   &  Lanconelli  2001).   Distributed  computing  uses  coarse-­‐grained  parallelism.    Therefore  distributed  computing  can  work   in  conjunction  with  the  fine-­‐grained  parallelism  of  CPUs  and  GPUs.     New  Parallelism  Support  Introduced  in  C#  4.0   Visual  C#  2010  (C#  v.  4.0)  contains  a  new  library  specifically  designed  to  support  parallelism,  the  Task   Parallel  Library  (TPL).    This  library  provides  a  method  to  manage  parallel  calls  from  the  managed  .NET   environment.    

2.4  

A  Comparison  of  CPU  and  GPU  Threading  Models6  

2.4.1     A  Comparison  of  CUDA  to  CPUs  Using  Flynn’s  Taxonomy   Flynn’s  taxonomy  (Flynn  1972)  is  a  widely  used  parallelism  classification  scheme.    There  are  four   classifications:   1. 2. 3. 4.

Single  instruction  stream,  single  data  stream  (SISD).   Single  instruction  stream,  multiple  data  streams  (SIMD).   Multiple  instruction  streams,  single  data  stream  (MISD).   Multiple  instruction  streams,  multiple  data  streams  (MIMD).  

SISD  systems  are  serial  computers  with  no  concurrency  support.    There  is  one  processor  that   performs  one  instruction  at  a  time  over  a  single  unit  of  data  (see  Figure  2-­‐6  on  the  next  page).    MISD   systems  are  rare  and  not  relevant  to  this  discussion.   In  SIMD  systems  a  single  instruction  executes  across  multiple  data  units.    This  operation  is   intrinsically  synchronous  due  to  the  fact  that  the  single  instruction  executes  simultaneously  over   multiple  independent  data  streams.    SIMD  can  be  sub-­‐divided  into  two  types,  processor  arrays  and   vector  pipelines  (Duncan  1990).   Figure  2-­‐7  on  the  next  page  depicts  a  SIMD  processor  array.    Each  processor  is  simultaneously   executing  the  same  instruction.    The  value  of  A(1)  is  broadcast  over  all  the  processors.    The   processor  index  number  specifies  the  non-­‐broadcast  values.    

                                                                                                                        6

 A  CUDA  enabled  device’s  Compute  Capability  defines  the  features  and  technical  specifications  it  offers.    This   project  employs  a  GTX  260,  which  has  Compute  Capability  1.3  (NVIDIA  Corporation  2010b,  p.  95).    This  report   is  based  on  Compute  Capability  1.3.    The  latest  devices,  released  by  NVIDIA  on  29  March,  2010,  have  Compute   Capability  2.0  (Rizzo  2010).    There  is  a  brief  overview  of  the  improvements  Compute  Capability  2.0  offers  in   Appendix  A.     Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  17  of  119   September,  2010  

Figure  2-­‐8  on  the  next  page  depicts  a  SIMD  vector  pipeline.    One  instruction  executes  over  a  fixed   data  size  (in  this  case  four  values).   Figure  2-­‐9  on  the  next  page  depicts  a  MIMD  system.    In  MIMD  there  is  no  relationship  between  the   activities  of  the  different  processors.    The  MIMD  system  is  essentially  a  collection  of  independent   SISD  processors.   To  apply  Flynn’s  taxonomy  to  CPU  and  GPU  threading  models  replace  the  term  “processor”  with  the   term  “thread.”    GPUs  use  an  SIMD  processor  array  threading  model.    NVIDIA  refers  to  this  model  as   Single  Instruction  Multiple  Thread  (SIMT).    The  term  “kernel”  refers  to  the  common  function  that  all   of  the  SIMT  threads  are  executing.   CPUs  use  an  MIMD  threading  model.    In  addition  each  CPU  thread  can  execute  a  SIMD  vector   pipeline  instruction  over  four  values.    I.e.  in  Figure  2-­‐9  following  operations  can  occur   simultaneously:   1. P1  –  load  the  four  numbers  at  B(1).   2. P2  –  multiply  two  sets  of  four  numbers  (in  this  case  x,  y,  and  z  are  each  vectors  with  four   components).   3. Pn  –  multiply  one  set  of  four  numbers  by  the  four  numbers  stored  at  memory  address  3.   Streaming  SIMD  Extensions  (SSE)  is  the  name  of  Intel’s  SIMD  vector  pipeline  instruction  set.    Intel   introduced  SSE  in  1999  (Wikipedia  2010d).   Processor  array  based  SIMT  varies  from  vector  pipeline  based  SSE  in  two  critical  ways:   1. SSE  can  only  operate  over  a  fixed  data  length.    SIMT  parallelism  is  scalable  over  n  threads.    A   thread  index  determines  which  data  to  read,  write,  or  operate  on  (NVIDIA  Corporation   2010b,  pp.  77-­‐78).       2. SSE  is  a  single  instruction  executed  on  an  independent  MIMD  thread.    SIMT  threads  are   constrained  to  executing  the  same  set  of  instructions  over  a  group  of  threads  in  lockstep.      

  Figure  2-­‐6  –  The  SISD  Architecture.  

 

  Figure  2-­‐7  –  The  SIMD,  Processor  Array  Architecture.  

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

 

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

  Figure  2-­‐8  –  The  SIMD,  Vector  Pipeline  Architecture.  

Page  18  of  119   September,  2010  

 

 

  Figure  2-­‐9  –  The  MIMD  Architecture.  

 

All  four  previous  diagrams  copied  from  (Barney  2010)     2.4.2   The  Impact  of  the  Threading  Model  on  CPU  and  GPU  Optimization   A  GPU  dedicates  80%  of  chip  area  to  processing.    The  key  to  high-­‐performance  GPU  execution  is  to   constantly  keep  the  GPU’s  raw  computational  capacity  busy.       CUDA’s  execution  model  packages  SIMT  threads  into  groups  of  32;  NVIDIA  refers  to  these  groups  as   warps.    All  threads  in  a  warp  execute  simultaneously  in  lockstep  SIMD  fashion7.    Keeping  the   processor  busy  is  the  job  of  the  thread  scheduler.    For  example  the  thread  scheduler  can  switch   processing  from  warp  that  is  latent  and  waiting  for  a  read  from  memory  to  a  warp  that  is  ready  to   perform  calculations.    Because  there  is  no  caching  or  branch  prediction,  context  switching  is  a  zero   cost  operation.    The  principle  is  to  create  as  many  threads  as  possible.    The  scheduler  will  then  have   the  flexibility  to  keep  swapping  warps  for  execution  as  need  be.    Therefore,  the  more  SIMT  threads   the  better  CUDA  performs.  

                                                                                                                        7

 For  the  sake  of  readability  this  report  makes  no  clear  distinction  between  the  logical  and  physical  models  for   technical  descriptions.    For  example  the  actual  physical  CUDA  lockstep  execution  uses  alternating  half-­‐warps.     From  a  logical  point  of  view  this  fact  is  not  relevant  as  the  second  execution  immediately  follows  the  first  and   performs  the  same  instruction.    In  general  this  report  will  not  drill  into  such  details.    Please  see  cited  material   for  further  information  on  the  nuts  and  bolts  of  CUDA  and  CPU  execution.     Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  19  of  119   September,  2010  

Only  about  20%  of  a  CPU’s  available  chip  area  is  devoted  to  computation,  with  over  half  devoted  to   cache  memory.    The  key  to  high-­‐performance  CPU  execution  is  efficient  serial  execution.  A  high   cache  hit  rate  and  an  efficient  reordering  of  work  using  branch  prediction  underlie  CPU  optimization   (NVIDIA  Corporation  2008).    MIMD  threads  are  heavy  weight  threads.    Each  one  requires  a  large   cache  and  instruction  pipeline  to  optimize  for  serial  execution.    When  the  number  of  MIMD  threads   is  greater  than  the  number  of  available  hardware  threads8  the  CPU  must  perform  context  switching;   therefore  performance  declines.    The  Marowka  (2009)  study  bears  this  out.   In  summary  GPUs  consist  of  raw  calculators.    The  more  data  that  are  streaming  into  the  GPU,  the   busier  the  calculators  are,  and  the  higher  the  GPU’s  performance  is.    CPUs  consist  of  task  processors.     A  CPU  thread  can  remember  what  it  did  (caching)  and  plan  for  what  it  must  do  next  (branch   prediction).    The  more  a  CPU  thread’s  current  action  is  related  to  the  past,  or  the  future,  the  faster   the  execution.    In  addition  CPUs  have  an  SSE  instruction  set  that  allows  each  MIMD  thread  to   perform  a  simultaneous  SIMD  instruction  over  a  vector  of  four  numbers.    SSE  instructions  are  the   major  tool  for  accelerating  CPU  calculation.  

2.5  

Relevant  Research  

2.5.1     CADx  Literature   Fogel,  Wasson  III  &  Boughton  (1995)  and  Land  et  al.  (2006)  both  report  that  breast  cancer  CADx   neural  network  classifiers  with  few  hidden  nodes  tend  to  be  more  accurate  than  neural  networks   with  many  hidden  nodes.    The  Fogel,  et  al.  study  compares  networks  with  nine  input  nodes,  two   hidden  layer  nodes,  and  one  output  nodes  (9-­‐9-­‐1  networks)  to  9-­‐2-­‐1  networks.    The  networks’   respective  average  mean  error  squared  were  .13  and  .11  respectively.   Land  et  al.  (2006)  present  two  examples  of  evolutionary  computing  in  neural  networks.    One   example  uses  evolutionary  computing  for  network  training  exclusively;  the  other  uses  the  technique   for  both  training  and  architecture.    In  the  former  case  they  manually  vary  the  number  of  hidden   nodes  between  two  and  five.    They  find  that  networks  with  two  hidden  nodes  are  generally  more   accurate  than  networks  with  more  hidden  nodes.    However  they  did  note  that  on  a  particularly   difficult  sample  a  five  hidden  node  network  outperforms  a  two  node  network;  in  this  case  accuracy   declines  with  six  or  seven  node  networks.   The  evolutionary  computing  example  that  includes  both  training  and  node  architecture  consistently   has  perceptrons  as  top  performing  network.    Both  Fogel  et  al.  (1995)  and  Land  et  al.  (2006)   demonstrate  that  basic  feed-­‐forward  neural  networks  with  few  or  no  hidden  nodes  are  appropriate   in  CADx.   Porto,  Fogel  &  Fogel  (1995)  report  that  neural  network  training  using  evolutionary  computing  with  a   population  size  of  50  and  an  iteration  count  of  100  created  significantly  more  accurate  networks   than  back  propagation  training.    The  best  performing  network  in  the  Land  et  al.  (2006)  training  only   example  has  a  population  size  of  200  and  an  iteration  count  of  600.   Campanini  &  Lanconelli  (2006)  provide  an  overview  of  many  CADx  related  genetic  algorithms.    They   state  that  Az  is  a  frequent  fitness  measure.    They  also  provide  examples  of  modified  ROC  analysis   depending  on  targeted  specificity  and  sensitivity  characteristics.                                                                                                                           8

 In  a  CPU  there  are  one  or  two  hardware  threads  per  core.  

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  20  of  119   September,  2010  

2.5.2     CUDA  Literature   Graham  &  Nelson  (1996)  performed  pre-­‐GPU  research  in  applying  FPGAs  to  genetic  algorithms.     They  created  a  FPGA  implementation  of  the  selection  operator.    They  then  compared  the   performance  of  the  implementation  with  the  FPGA  component  to  an  implementation  that  only  used   a  contemporary  high-­‐end  CPU.    The  FPGA  implementation  led  to  a  38x  speedup  in  executing  the   selection  portion  of  the  algorithm.    The  total  FPGA  speedup  for  processing  the  entire  algorithm  was   4x.    The  early  study  demonstrated  the  potential  for  using  parallel  hardware  to  accelerate  genetic   algorithm  processing.   Pre-­‐CUDA  work  in  using  GPUs  for  back  propagation  training  of  neural  networks  revealed  a  3x   speedup  (Steinkraus,  Buck  &  Simard  2005).    This  implementation  employs  a  graphics  metaphor  using   texture  mapping  to  perform  matrix  inner  products.    One  of  the  key  advantages  of  CUDA  is  that  it   removes  the  requirement  to  employ  a  graphics  metaphor.    CUDA  specifies  a  separate  compute   mode  that  is  part  of  the  software  and  hardware  of  the  architecture  (Lindholm  et  al.  2008).    The   performance  gap  between  GPUs  and  CPUs  has  widened  substantially  since  Steinkraus,  et.  al.  (NVIDIA   Corporation  2010b,  p.  2).   Implementing  a  neural  network  with  a  combination  of  CUDA  and  OpenMP,  a  C++  and  FORTRAN   parallelism  API  for  the  CPU,  creates  another  level  of  performance  gains  (Jang,  Park  &  Jung  2008).     This  implementation  uses  a  multi-­‐core  CPU  for  feature  extraction  from  an  image,  and  CUDA  on  the   GPU  to  concurrently  process  the  neural  network.    The  CPU/GPU  blend  creates  a  15x  performance   gain  over  a  CPU  only  implementation,  and  a  4x  gain  over  a  GPU  only  implementation.    The  study   reports  that  using  the  CPU  for  feature  extraction  removes  the  overhead  of  transferring  the  large   amounts  of  raw  image  data  from  the  host  to  the  device.   Literature  regarding  not  only  CUDA  but  also  previous  hardware  driven  implementations  of  genetic   algorithms  and  neural  networks  demonstrate  speedups  between  2x  and  15x.   2.5.3     Multi-­‐core  CPU  Literature   Most  high  performance  C++  programming  techniques  are  now  well  known.    Synchronization  and   scheduling  can  be  very  costly  when  using  CPU  multithreading.    Because  CPU  threads  and  cores   operate  in  an  MIMD  fashion,  it  is  critical  to  avoid  communication  between  threads.    The  overhead   can  consume  70%  of  total  runtime  on  multi-­‐core  Intel  CPU  systems.    Additionally,  there  is  a  decline   in  performance  if  the  application  software  thread  count  exceeds  the  number  of  threads  available  in   hardware.    This  is  the  opposite  case  from  CUDA.    Thread  scheduling  is  an  overhead  on  the  CPU;  on   the  GPU  it  is  an  optimization  method  (Marowka  2009).     For  high  performance  SIMD  execution  there  is  a  SSE  implementation  of  the  logistic  sigmoid   activation  function.    The  SSE  implementation  displays  up  to  a  38x  speedup  over  conventional   approaches  (Milner  &  Grandison  2008).    This  project  uses  this  approach  for  calculating  neural   network  output  on  the  CPU.  

2.6     Libraries,  Tools,  and  Technologies  Employed   The  project  used  the  following  libraries,  tools,  and  development  environment:   1. The  NVIDIA  CUDA  Toolkit  3.0  (NVIDIA  2010).   2. The  XFX  GeForce  GTX  260  216  Core  Graphics  Card.    Figure  2-­‐10  on  the  next  page  displays  the   card’s  performance  capabilities.   3. A  Dell  XPS  400  with  a  Pentium  D  2.79  GHz  CPU  and  3GB  of  RAM.     4. Visual  C++  2008  Professional  Edition.     Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  21  of  119   September,  2010  

5. The  Boost  C++  Libraries  v1.43.0  (Boost  Project  2010).    Boost  is  an  Open  Source  project  that   provides  a  broad  range  of  useful  C++  libraries.    The  project  utilizes  the  following  libraries:   a. Random.   b. Smart_Ptr.   The  selection  above  represents  the  versions  and  hardware  available  at  the  time  project   development  began  in  early  April,  2010.    The  latest  Windows  based  C++  compiler  NVIDIA  supported   for  CUDA  at  that  time  was  Visual  C++  2008.    The  GeForce  GTX  260  is  the  least  expensive  card  in   NVIDIA’s  high  end  GTX  200  GPU  line.    The  XPS  400  is  a  personal  workstation.    The  clock  speed  is   consistent  with  current  top  CPU  clock  speeds.  

  Figure  2-­‐10  –  GeForce  GTX  260  Device  Information  

   

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  22  of  119   September,  2010  

Chapter  3 –Analysis  and  Design   3.1  

Overview  

The  project’s  analysis  and  design  bridges  the  divide  between  the  project’s  goals  and  the  subsequent   implementation.    The  requirements  drive  the  design.    The  design  forms  the  program’s  framework;   items  in  the  design  figures  map  directly  to  components  in  the  implementation.   The  analysis  section  contains  the  implementations  requirements.    The  requirements  derive  from  a   combination  of  the  domain  requirements  of  CADx  and  the  project’s  overall  goals.   The  design  section  begins  with  descriptions  of  the  genetic  selection  and  the  evolutionary  training   algorithms.    There  is  a  review  of  two  object-­‐oriented  approaches:  one  is  a  package  based  approach   and  the  other  is  a  decoupled  approach.    There  is  an  explanation  of  why  the  project  uses  the   decoupled  approach.   A  discussion  of  the  specific  data  layout  requirements  for  SIMD,  both  in  the  context  of  SSE  and  SIMT   follows  (see  Section  2.4).    There  are  descriptions  of  both  the  general  nature  of  the  layout  and  the   specific  data  structures  the  project’s  implementation  uses  to  encapsulate  this  need.   The  chapter  then  reviews  the  full  design  consisting  of  a  combination  of  the  decoupled  approach  and   the  project’s  SIMD  data  layout  classes.    A  demonstration  of  the  design’s  flexibility  featuring  a   substitution  of  one  evolutionary  training  algorithm  for  another  concludes  the  chapter.  

3.2  

Analysis  

3.2.1     Requirements   The  goal  of  this  project  was  to  explore  using  CUDA  to  accelerate  design  algorithms  for  breast  cancer   CADx  neural  network  design.    To  achieve  this  goal  required  a  baseline  measure  of  CPU  performance.     It  would  be  incorrect  to  assume  that  the  GPU  will  outperform  the  CPU.       Accomplishing  the  project’s  goal  requires  creating  a  corresponding  CPU  component  for  each  GPU   component.    Optimizations  for  the  CPU  are  different  from  the  optimizations  for  the  GPU.    The   implementation’s  design  must  be  flexible  enough  to  support  the  disparate  needs  of  both   approaches.       Accomplishing  these  goals  in  the  reference  implementation  has  the  following  project  implications:   1. To  provide  a  meaningful  comparison  between  the  CPU  and  GPU  the  CPU  implementation   should  use  the  SSE  instructions  for  calculation  (see  Section  2.4).       2. The  neural  network  topologies  should  be  basic  with  very  few  hidden  nodes  (see  Section   2.5.1).       3. The  domain  specific  Az  should  be  a  metric  in  analysis  (see  Section  2.2.3).     4. The  neural  network  training  algorithm  should  employ  evolutionary  computing  (see   Section  2.2.5).         5. The  training  and  validation  should  use  datasets  generated  via  a  sampling  technique,  such  as   leave-­‐one-­‐out  bootstrap  (see  Section  2.2.5).  

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

3.3  

   

Page  23  of  119   September,  2010  

Design  

3.3.1   The  Algorithms   The  genetic  and  evolutionary  algorithms  implemented  are  typical  general  purpose  versions  of  the   respective  methods.    The  goal  of  the  project  is  to  measure  runtime  performance,  not  effectiveness.     The  exact  nature  of  the  algorithms  is  not  important  so  long  as  they  are  characterized  by  large   amounts  of  parallel  computation.   Genetic  Algorithm   A  genetic  algorithm  performs  feature  selection  and  determines  the  number  of  nodes  in  the  hidden   layer.       The  genetic  algorithm’s  steps  are  as  follows  (Negnevitsky  2005,  pp.  222-­‐225):   1.  Randomly  generate  a  set  of  N  chromosomes.    Each  chromosome  represents  a  neural   network.    The  chromosomes  have  two  components;  a  binary  encoded  component   representing  the  available  features  and  a  single  integer  gene  representing  the  number  of   hidden  nodes.   2. Calculate  the  fitness  of  each  neural  network  (see  below).   3. Using  roulette  wheel  selection,  generate  N  -­‐  T  partner  pairs.    Each  pair  will  generate  an   offspring  chromosome.    Each  offspring  gene  has  an  equal  probability  of  coming  from  each   parent.   4. The  generated  chromosomes  and  the  top  T  chromosomes  based  on  fitness  form  the  next   generation.   5. Calculate  the  fitness  of  each  neural  network  (see  below).   6. If  generation  count  is  not  met  then  go  to  step  3.   7. Output  population  list  with  fitness.   To  calculate  fitness,  the  genetic  algorithm  performs  the  following:   1. For  each  sample  set9  perform  the  following:   a. Train  the  neural  network  using  the  training  data.   b. Generate  a  fitness  Az  value  using  the  validation  data.   2. Sort  the  fitness  values  in  descending  order.   3. Use  the  bottom  Cth  performance  value  as  chromosome’s  performance.    Using  the  worst   case  performance  result  provides  neural  networks  that  perform  well  under  variation.   Evolutionary  Trainer   The  neural  network  training  algorithm  is  as  follows  (Negnevitsky  2005,  pp.  288-­‐289):   1. Create  a  population  of  2m  weight  vectors  𝑤!  with  random  uniformly  distributed  numbers   between  -­‐1.0  and  +1.0.   2. Calculate  the  population  fitness  using  the  error  squared.   3. Sort  the  population  based  on  fitness.   4. Set  the  top  m  weight  vectors  as  “parent”  vectors,  and  drop  the  bottom  m  vectors.   5. For  each  parent,  create  a  “child”  weight  vector:                                                                                                                               9

 The  implementation  uses  leave-­‐one-­‐out  bootstrap  to  generate  the  data  samples.    The  training  set  contains   the  records  selected;  the  corresponding  validation  set  contains  the  unselected  records.    The  genetic  algorithm   implementation  supports  any  sampling  technique  however.     Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report        𝑤!! =   𝑤! +   𝛿 ! 𝜎!  𝑤ℎ𝑒𝑟𝑒    𝛿   ∈

   

Page  24  of  119   September,  2010  

0, 1 , 𝑛 = 𝑡ℎ𝑒  𝑐𝑢𝑟𝑟𝑒𝑛𝑡  𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛  𝑛𝑢𝑚𝑏𝑒𝑟, 𝑎𝑛𝑑   𝜎!  𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠  𝑎𝑟𝑒  𝑢𝑛𝑖𝑓𝑜𝑟𝑚𝑙𝑦  𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑑  𝑛𝑢𝑚𝑏𝑒𝑟𝑠  𝑖𝑛  𝑡ℎ𝑒  𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙   −1, 1     6. Calculate  fitness  on  the  combined  population  of  parents  and  children  based  on  squared   error.   7. Sort  the  population  based  on  fitness.   8. If  the  required  number  of  generations,  then  end,  otherwise  go  to  step  4.  

3.3.2     The  Object-­‐oriented  Design     A  “Package  Based”  Object-­‐oriented  Design  Approach   A  key  design  requirement  is  to  support  both  CPU  and  GPU  implementations.    Both  implementations   share  most  functionality;  only  a  small  portion  requires  dual  implementation.    One  possible  approach   to  reuse  the  shared  components  is  the  “package  based”  approach  in  Figure  3-­‐1  on  the  following   page.    In  this  approach  the  abstract  GeneticSelector  and  EvolutionaryTrainer  classes  contain  the   shared  functionality.    Implementation  specific  subclasses  then  reside  in  individual  packages.    In  this   case  there  are  CPU  and  GPU  packages.    A  CPU  based  GeneticSelector  directly  depends  on  using  a   CPU  based  EvolutionaryTrainer.   The  package  approach  is  frequently  used  in  object  oriented  systems.    One  classic  example  is   databases.    Often  a  database  API  will  have  a  set  of  defined  interfaces  and  abstract  classes.    These   base  classes  will  contain  general  database  functionality  such  as  connecting,  querying,  result  reading,   etc.    This  functionality  is  common  to  all  database  systems.    Each  individual  database  platform   (Oracle,  MySQL,  SQL  Server,  DB2,  etc.)  will  have  a  package  containing  platform  specific   implementation.    Each  class  in  the  implementation  package  will  derive  from  and  correspond  to  a   predefined  abstract  base  class.    The  classes  inside  an  implementation  package  are  interdependent   on  each  other.    You  cannot  use  an  Oracle  connection  object  with  a  SQL  Server  query  object  for   example.   While  the  package  approach  is  very  common,  in  this  case  it  presents  problems  with  coupling   (Larman  2002,  pp.  229-­‐236).    Coupling  is  the  measure  of  how  strongly  one  element  is  dependent  on   another.    Problems  occur  when  there  is  high  coupling  along  volatile  dimensions.    Database  libraries   couple  across  the  relatively  stable  dimensions  of  a  database  platform’s  functionality.    In  the  context   of  this  project  the  corresponding  EvolutionaryTrainer  and  GeneticSelector  implementation  classes   couple  across  two  volatile  dimensions.    The  combination  of  the  appropriate  algorithm  and  the   training  method  can  frequently  change.    A  new  genetic  algorithm  implementation  requires  three   new  classes;  one  base  class  implementing  the  algorithm,  and  two  corresponding  implementation   child  classes.   A  Decoupled  Object-­‐oriented  Design   The  design  in  Figure  3-­‐2  on  the  next  page  decouples  the  implementation  of  the  two  algorithms  and   the  neural  network  calculation.    It  separates  each  class’  responsibilities  and  collaborations  (Beck  &   Cunningham  1989).    In  this  case,  a  GeneticSelector’s  responsibility  is  to  execute  the  genetic   algorithm  to  find  optimal  features.    To  fulfill  its  responsibility,  the  GeneticSelector  uses  a   NeuralNetTrainer  to  obtain  trained  neural  networks.    In  this  design,  the  GeneticSelector  is  not  only   independent  of  whether  the  trainer  uses  the  CPU  or  GPU,  but  also  is  independent  of  the  training     Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  25  of  119   September,  2010  

technique.    The  GeneticSelector  could  use  a  back  propagation  trainer  without  any  modification.    In   this  case,  the  NeuralNetTrainer  is  responsible  for  training  the  neural  networks,  and  collaborates  with   a  NeuralNetEvaluator  to  calculate  the  output  for  a  set  of  neural  networks  given  data  and  weights.     The  CudaEvaluator  contains  the  GPU  implementation  and  the  SseEvaluator  contains  the  CPU   implementation.    The  NeuralNetTrainer  uses  the  base  NeuralNetEvaluator  base  class.    Thus  there  is   no  need  to  modify  a  NeuralNetTrainer  to  switch  between  GPU  and  CPU  implementations.    

  Figure  3-­‐1  –  A  Package  Based  Design  Approach  

 

     

  Figure  3-­‐2  –  Improved  Design  

 

    3.3.3   Data  Structures  for  SIMD   The  typical  data  layout  in  programming  uses  an  array  of  structures  (AoS)  approach.    Both  SIMD   paradigms  (processor  array  and  vector  pipeline)  require  a  structure  of  arrays  (SoA)  approach  (Wald   2004,  p.  79).    SoA  is  a  transpose  of  the  traditional  in  memory  data  layout.    The  task  of  calculating  the   average  grade  for  four  students  each  with  eight  marks  can  demonstrate  the  difference  between  the     Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  26  of  119   September,  2010  

two.    The  normal  AoS  approach  is  to  define  a  Student  data  structure  and  create  an  array  of  four   Student  structures.    In  this  case,  Student  contains  an  eight  element  array  of  grade  values.    Figure  3-­‐3,   Figure  3-­‐4,  and  Figure  3-­‐5  on  the  next  page  demonstrate  this  layout.   Figure  3-­‐3  shows  the  AoS  layout  of  the  grades  in  memory.    A  given  student’s  grades  are  adjacent  to   each  other.    Averaging  each  student’s  grade  is  straight-­‐forward:   •

For  each  student:   o Set  running  total  to  zero.   o For  each  Course:   § Add  current  student’s  current  course  grade  to  running  total.   o Divide  the  running  total  by  eight  

This  layout  is  not  compatible  with  SIMD.    SIMD  systems  calculate  the  average  for  all  four  students   simultaneously.    A  SIMD  calculation  is  as  follows:   • • •

Simultaneously  set  four  running  total  values  to  zero   For  each  Course   o Simultaneously  add  all  four  course  grades  to  the  corresponding  running  totals.   Simultaneously  divide  all  four  running  totals  by  eight.  

In  AoS  the  corresponding  grades  for  each  course  are  not  adjacent  to  each  other  in  memory;  they  are   separated  by  eight  memory  locations.    The  solution  is  to  transpose  the  data’s  memory  layout.    The   SoA  approach  places  the  corresponding  values  adjacent  to  each  other.    The  memory  layout  depicted   in  Figure  3-­‐4  accommodates  a  simultaneous  SIMD  execution  over  all  four  values.   Another  constraint  in  SIMD  is  that  there  are  fixed  multiples  of  operation.    SSE  is  a  vector  pipeline  and   must  execute  an  instruction  over  exactly  four  adjacent  memory  locations  at  a  time.    If  the  number  of   students  is  not  a  multiple  of  four  then  SSE  will  still  perform  the  operation  over  the  padded  values.   Figure  3-­‐5  depicts  an  example  where  there  are  five  students.    In  this  case  two  SSE  calculations  will   occur  per  course.    Care  must  be  taken  to  ignore  calculation  results  from  the  remaining  three  padded   columns.   Because  CUDA’s  SIMT  model  is  a  processor  array  not  a  vector  pipeline  it  does  not  operate  over  a   fixed  amount  of  data.    However  CUDA  does  group  execution  into  units  of  32  (the  number  of  threads   in  a  warp)  internally.    For  maximum  performance  it  is  best  to  execute  a  kernel  with  a  thread  count   that  is  a  multiple  of  32  and  use  padding.  

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  27  of  119   September,  2010  

 

     

memory  à   students  à  

memory  à  

Course  0  

g00  

g10  

g20  

g30  

grades  à  

Course  1  

g01  

g11  

g21  

g31  

Student  0   g00   g01   g02   g03   g04   g05   g06   g07  

Course  2  

g02  

g12  

g22  

g32  

Student  1   g10   g11   g12   g13   g14   g15   g16   g17  

Course  3  

g03  

g13  

g23  

g33  

Student  2   g20   g21   g22   g23   g24   g25   g26   g27  

Course  4  

g04  

g14  

g24  

g34  

Student  3   g30   g31   g32   g33   g34   g35   g36   g37  

Course  5  

g05  

g15  

g25  

g35  

Course  6  

g06  

g16  

g26  

g36  

Course  7  

g07  

g17  

g27  

g37  

   

 

 

Figure  3-­‐3  –  The  Array  of  Structures    (AoS)  Memory  Layout  

memory  à      

Figure  3-­‐4  –  The  Structure  of  Arrays  (SoA)   Memory  Layout  Transposes  the  AoS    Layout    

students  à  

Grade  0   g00   g10   g20   g30   g40   n/a   n/a   n/a   Grade  1   g01   g11   g21   g31   g41   n/a   n/a   n/a   Grade  2   g02   g12   g22   g32   g42   n/a   n/a   n/a   Grade  3   g03   g13   g23   g33   g43   n/a   n/a   n/a   Grade  4   g04   g14   g24   g34   g44   n/a   n/a   n/a   Grade  5   g05   g15   g25   g35   g45   n/a   n/a   n/a   Grade  6   g06   g16   g26   g36   g46   n/a   n/a   n/a   Grade  7   g07   g17   g27   g37   g47   n/a   n/a   n/a    

Figure  3-­‐5  –  Structure  of  Arrays  (SoA)  with  Required  Padding  

 

  3.3.4   The  SIMD  Sampling  Data  Structure   The  training  and  validation  datasets  use  a  SoA  layout  in  with  the  values  for  a  feature  adjacent  to   each  other.    On  the  next  page,  Figure  3-­‐6  displays  the  layout.    The  width  is  the  number  of  records   with  appropriate  padding.    The  height  is  the  number  of  features.    The  SamplingData  class  in  Figure   3-­‐7  contains  a  group  of  training  datasets  (a  set  of  bootstrap  samples  for  example)  with  the  matching   validation  datasets10   The  only  difference  between  the  TrainingSet  and  TestingSet  class  is  that  the  number  of  records  in   each  TestingSet  dataset  can  vary.    In  leave-­‐one-­‐out  bootstrap,  the  number  of  records  in  the   validation  set  is  random;  it  is  the  number  of  records  that  were  not  selected  in  the  original  bootstrap.                                                                                                                           10

 Lo,  et  al.  (2006)  makes  a  very  valid  distinction  between  the  terms  testing  and  validation.    They  correctly   assert  that  the  term  validation  occurs  during  the  learning  process.    Testing  data  should  never  be  any  part  of   the  learning  process.    Therefore  the  TestingSet  class  should  be  renamed  ValidationSet  during  a  future   refactoring.    Similar  refactoring  name  changes  should  occur  in  the  appropriate  variable  and  function  names.     Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  28  of  119   September,  2010  

In  the  TrainingSet  and  TestingSet  classes  (see  Figure  3-­‐7),  the  fields  are  as  follows:   1. Alignment:  SSE  requires  that  the  floating  point  array  starts  on  16  byte  aligned  memory   addresses.    This  means  that  the  starting  memory  address  is  a  multiple  of  16.   2. FieldDim:  How  many  features  a  record  contains.   3. RecordCnt/RecordCnts:  Is  the  number  of  records  in  the  dataset.    TrainingSet  uses  the  field   RecordCnt,  and  TestingSet  uses  the  field  RecordCnts  due  to  the  fact  that  with  a  leave-­‐one-­‐ out  bootstrap  the  number  of  records  in  the  validation  set  varies  randomly.   4. RecordDim/RecordDims:  If  the  number  of  records  is  not  an  appropriate  multiple,  there  is   padding.    The  RecordDim/RecordDims  is  a  multiple  of  the  RecordDimMultiple.    If  RecordCnt   is  five  and  RecordDimMultiple  is  four  then  the  RecordDim  is  eight.   5. RecordDimMultiple:  Four  for  SSE,  32  for  SIMT/CUDA.   6. SampleDim/TestsetDim:  The  number  of  samples  the  structure  contains.    In  the  case  of   bootstrapping,  this  is  the  number  of  bootstraps  executed;  in  k-­‐fold  cross-­‐validation,  this  is   the  value  of  k.   In  keeping  with  the  generic  design  approach  to  the  project  SamplingData  is  not  bound  to  the   bootstrap  method.    It  represents  any  set  of  sampled  data.    While  the  project  uses  the  Bootstrap  class   to  generate  a  SamplingData  instance,  there  is  nothing  that  prevents  GeneticSelector  from  using  a  k-­‐ fold  data  sample.            

            Records  à    Features  à  

 

 

 

                  Figure  3-­‐6  –  SoA  Memory  Layout  for  Training/Validation  

    Figure  3-­‐7  –  Data  Classes,  SoA  Implementation  

    Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

 

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  29  of  119   September,  2010  

3.3.5  Full  Design   Figure  3-­‐8  below  depicts  the  project’s  full  design;  the  approach  combines  the  decoupling  described   in  Section  3.3.2  combined  with  the  SoA  based  data  classes  described  in  Section  3.3.4.    The   GeneticSelector  uses  a  SamplingData  class.    In  the  implementation  a  Bootstrap  class  has  the   responsibility  of  creating  a  SamplingData  instance.   There  are  two  NeuralNetTrainer  classes  in  the  implementation.    The  OrigEvolutionaryTrainer  is  the   initial  training  algorithm  implementation.    This  implementation  failed  to  converge  during  testing  (see   Section  3.3.6  for  more  details).    

      Figure  3-­‐8  –  Overall  Project  Design  

3.3.6   Framework  Flexibility   A  demonstration  of  the  frameworks  flexibility  occurred  during  testing.    Below  is  a  description  of  the   original  evolutionary  training  algorithm  implementation  (Land  et  al.  2006):   1. Create  m  “parent”  weight  vectors  𝑤! .   2. Create  “child”  weight  vectors:    𝑤!! =   𝑤! +  𝐶𝜎!!  𝑤ℎ𝑒𝑟𝑒  𝐶  𝑖𝑠  𝑎  𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑  𝐶𝑎𝑢𝑐ℎ𝑦  𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒,    

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

𝜎!! =   𝜎!  𝑒𝑥𝑝 3. 4. 5. 6. 7.

1

𝑁 0,1 +  

   

Page  30  of  119   September,  2010  

1

𝑁! 0,1 , 𝑛 = 𝑡ℎ𝑒  𝑡𝑜𝑡𝑎𝑙  𝑛𝑢𝑚𝑏𝑒𝑟  𝑜𝑓  𝑤𝑒𝑖𝑔ℎ𝑡𝑠.   2𝑛 2 𝑛 Calculate  fitness  on  the  combined  population  of  parents  and  children  based  on  classification   error  (see  below).   For  each  population  member,  create  a  tournament  by  selecting  x  random  competitors.     Increase  the  win  count  tournament  member  with  the  highest  function.   Sort  the  population  by  win  count,  and  remove  the  bottom  50%.   Reset  win  count,  set  the  surviving  members  as  parents.   If  the  required  number  of  generations  has  been  evaluated,  then  end,  otherwise  go  to  step  2.  

The  algorithm  implementation  could  not  pass  testing,  due  in  large  part  to  a  failure  to  converge.     Because  it  was  not  within  the  scope  of  this  project  to  determine  the  exact  cause  of  an   implementation’s  failure  to  converge  –  the  failure  could  be  due  to  subtle  floating  point  issues  for   instance  –  a  more  typical  evolutionary  training  algorithm  was  substituted  into  the  project.   This  algorithm  is  still  in  the  project  as  OrigEvolutionaryTrainer.      The  project  will  still  compile  with   this  legacy  implementation.    The  ability  to  quickly  change  from  the  original  implementation,  to  the   final  implementation  with  no  changes  to  the  surrounding  implementation  or  framework   demonstrates  the  design’s  resiliency.  

3.4     Summary   The  goal  of  this  project  was  to  explore  using  CUDA  to  accelerate  CADx  neural  network  architecture.     To  accomplish  this  goal  required  creating  a  matching  CPU  implementation  for  the  GPU  component   of  the  system  in  order  to  verify  that  the  GPU  can  actually  outperform  the  CPU.   The  project’s  implementation  employs  a  genetic  algorithm  for  feature  selection  and  network   architecture.    The  algorithm  uses  Az  to  measure  fitness  (see  Section  2.2.3).    An  evolutionary   computing  algorithm  trains  each  neural  network.    There  are  separate  CPU  and  GPU  implementations   to  calculate  the  neural  network  output.   The  object-­‐oriented  design  for  the  project  employs  loose  coupling  rather  than  packaging   components.    Packaging  requires  creating  an  abstract  base  class  as  well  as  matching  CPU  and  GPU   implementation  classes  for  each  algorithm  component.    Decoupling  the  different  components   prevents  changes  in  one  portion  of  the  implementation  from  having  a  cascading  impact  on  unrelated   sections.   SIMD  requires  data  structures  with  a  structure  of  arrays,  SoA,  memory  layout  (Wald  2004,  p.  79).     The  SoA  layout  is  a  transpose  of  the  traditional  array  of  structures,  AoS,  memory  layout.    In  AoS  data   from  the  same  record  are  adjacent  to  each  other.    In  SoA  data  from  the  same  feature  are  adjacent  to   each  other.    The  project’s  data  classes  TrainingSet  and  TestingSet  contain  data  in  this  layout.   The  overall  design  for  the  project  combines  the  decoupled  approach  for  the  processing  components   with  the  SIMD  compatible  SoA  approach  to  the  data  classes.  

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  31  of  119   September,  2010  

Chapter  4 –  Implementation   4.1  

Overview  

This  chapter  presupposes  an  understanding  of  the  SIMT  and  SSE  threading  models  (see  Section  2.4),   the  basics  of  neural  network  calculation  (see  Section  2.2.4),  and  SoA  data  layout  (see  Section  3.3.3   and  Section  3.3.4).   The  project’s  implementation  applies  the  previous  design  and  requirements  to  the  specific  needs  of   CUDA  and  SSE.    The  description  CUDA  and  SSE  development  drills  down  from  the  generic  to  the   implementation  specific.   The  chapter  starts  with  general  notes  on  CUDA  and  SSE  development.    This  overview  describes  the   overall  process  for  development  in  the  respective  environments;  it  is  not  specific  to  this  project’s   implementation.    Next  there  is  an  explanation  of  why  the  two  environments  require  native  C++  and   a  review  of  some  implementation  details  specific  to  C++.   The  chapter  continues  with  a  description  of  how  the  implementation  calculates  the  neural  network   output  in  a  SIMD  fashion  using  data  arranged  in  a  SoA  layout.    It  concludes  with  separate   explanations  of  how  the  CPU  and  the  GPU  perform  the  calculation.    The  explanations  contain  a   review  of  the  relevant  source  code  from  each  implementation  that  is  involved  in  the  calculation.  

4.2     General  Implementation  Details   4.2.1     CUDA  Development   NVIDIA  uses  the  term  “device”  instead  of  the  term  “graphics  card”  when  describing  CUDA  hardware.     This  nomenclature  emphasizes  that  CUDA  is  an  architecture  for  general-­‐purpose  computation.    The   device  typically  is  a  graphics  card;  however  NVIDIA  also  sells  its  Tesla  product  line.    Tesla  cards  do   not  have  video  outputs  because  they  are  solely  for  high  performance  computing.   A  CUDA  device  is  a  separate  computation  focused  computer.    The  device  resides  inside  a  “host”   (typically  a  workstation).    A  GPU  controls  the  device;  a  CPU  controls  the  host.    The  device  has   substantial  memory  that  is  separate  from  the  host’s  main  computer’s  memory11.    Program  execution   on  the  device  is  completely  separate  from  the  host  process.   A  single  C  function  called  a  kernel  executes  on  all  of  the  SIMT  threads.    Each  thread  is  a  member  of  a   block  and  has  a  thread  id.    Each  thread  is  uniquely  identified  by  the  combination  of  its  thread  id  and   its  block  id.    The  kernel  invocation  call  passes  the  number  of  threads  per  block  and  the  number  of   blocks.    It  is  best  to  have  a  multiple  of  32  threads  per  block  because  the  32  thread  warp  is  the  unit  of   CUDA  execution.    A  block  can  contain  a  maximum  of  512  threads.    The  general  steps  in  CUDA   program  execution  are  as  follows:   •

On  the  host:   1. Allocate  memory  on  the  device  for  both  input  and  output.  

                                                                                                                        11

 The  GTX  260  used  in  this  project  has  896MB.  

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      





   

Page  32  of  119   September,  2010  

2. Copy  input  data  from  host  memory  to  device  memory.   3. Define  the  number  of  device  threads  per  block  and  the  number  blocks  that  will   execute  and  determine  how  the  threads  are  grouped.   4. Invoke  kernel  function  on  the  device.   On  the  device:   1. Each  thread  executes  the  kernel  function  in  SIMT  fashion.    The  threads  use  the   combination  of  block  id  and  thread  id  for  indexing.   2. Notify  host  when  all  threads  complete  execution.   On  the  host:   1. Copy  output  data  from  device  memory  to  host  memory.   2. Free  device  memory.  

Creating  a  CUDA  program  requires  two  compilers  because  execution  occurs  on  both  the  device  and   the  host.    The  CUDA  Toolkit  (NVIDIA  2010)  provides  the  nvcc  compiler  for  the  device.    Nvcc  is  a  C   compiler  with  a  small  set  of  extensions.    The  host  compiler  is  platform  specific:  either  Visual  C++  on   Windows  or  gcc  on  Linux  and  MacOS.    The  general  program  executing  on  the  host  must  link  to  the   CUDA  C  program’s  host  component.    The  host  component  will  perform  the  operations  above.   4.2.2     SSE  Development   It  is  necessary  to  use  SSE  instructions  for  optimal  CPU  computational  performance.    There  are  three   options  for  employing  SSE  (Wald  2004,  p.  80):   1. Relying  on  automatic  compiler  optimization.   2. Using  SSE  compiler  intrinsics  (Microsoft  Corporation  2010c).   3. Writing  Assembly  manually.   Wald  reports  that  compilers  frequently  do  not  recognize  program  sections  eligible  for  SSE   optimization.    This  project  uses  compiler  intrinsic  for  SSE  calls.    The  Milner  &  Grandison  (2008)   logistic  function  uses  manual  assembly  provided  by  the  authors.       SSE  also  requires  a  specific  memory  alignment  for  calculations.    A  SSE  vector  pipeline  instruction   operates  over  four,  four  byte  floating  point  numbers;  that  is  16  bytes  of  data  (4x4).    The  memory   address  for  the  beginning  of  the  four  number  vector  must  be  a  multiple  of  16.    This  alignment   requirement  precludes  the  use  of  the  C++  new  and  delete  operators.    SSE  requires  the  C  functions   _aligned_malloc()  and  _aligned_free()  to  allocate  and  release  the  memory  used  for  SSE   calculations.   4.2.3   The  Choice  of  Native  C++   CUDA  and  SSE  need  a  native  unmanaged  execution  environment  because  of  the  necessary  linking   and  memory  alignment  requirements.    Microsoft  offers  the  C++/CLI  language  as  a  bridge  language   between  native  C++  and  .NET  managed  environment  languages  such  as  C#  (Wikipedia  2010a).    C#   also  natively  supports  pointers.    Despite  the  options  C++/CLI  and  C#  offer  for  mixing  a  managed   environment  with  an  unmanaged  native  environment,  the  approach  is  impractical.       Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  33  of  119   September,  2010  

Managed  environments,  whether  Java  or  .NET,  have  a  specific  goal  of  maintaining  control  over  the   management  and  the  access  of  the  environment’s  memory  space.    This  is  antithetical  to  CUDA  and   SSE’s  linking  and  memory  layout  requirements     C++/CLI  and  C#  will  not  provide  direct  pointers  to  the  managed  memory  space.    The  languages   enforce  a  strict  separation  between  managed  and  unmanaged  memory  for  the  heap,  the  call  stack,   and  the  instruction  stack.    All  data  in  the  managed  environment  must  be  copied  to  the  unmanaged   environment  before  the  native  code  sections  can  use  them.       Moving  between  managed  and  unmanaged  execution  is  also  deleterious  to  performance.    When  a   managed  function  calls  an  unmanaged  function,  the  following  occurs  (Microsoft  Corporation  2010d):   1. The  function  call  arguments  are  marshaled  from  CLR  to  native  instances.   2. A  managed  to  unmanaged  thunk  (switch  from  a  managed  to  unmanaged  stack,  heap,  and   instruction  context)  occurs.   3. The  unmanaged  function  is  called  (using  the  marshaled  native  instances  of  the  arguments).   4. An  unmanaged  to  managed  thunk  occurs.   5. The  return  variable  and  any  output  arguments  are  marshaled  from  native  to  CLR  instances.   Thus,  not  only  does  the  boundary  between  managed  and  unmanaged  memory  require  constant   copying  between  the  two  regions,  but  also  constant  context  switching  between  managed  and   unmanaged  stack  frames.    An  additional  problem  is  double  thunking.    Functions  in  programs  created   for  a  mixed  managed/unmanaged  environment  by  default  have  both  managed  and  unmanaged   entry  points.    Double  thunking  occurs  when  a  managed  function  calls  another  managed  function’s   native  entry  point.    The  native  call  then  routes  the  call  to  the  managed  entry  point.    Two  thunks   occur,  when  none  were  necessary,  as  this  was  a  managed  to  managed  call.    If  an  unmanaged  entry   point  exists  for  a  virtual  method  in  a  managed  class,  double  thunking  will  always  occur  (Microsoft   Corporation  2010b).   Linking  managed  and  unmanaged  programs  is  also  intricate.    The  .NET  environment  uses  a  different   naming  and  argument  passing  protocol  (Microsoft  Corporation  2010a).       During  implementation  there  were  a  number  of  attempts  to  integrate  .NET  into  the  project.    Each   time  one  problem  was  solved,  another  cropped  up.    Ultimately,  .NET  was  excluded  from  the  current   implementation.    A  feasible  approach  to  integrating  for  .NET  in  the  future  is  to  use  C#  for  high-­‐level   application  tasks  such  as  user  interface  and  I/O.    C++/CLI  can  then  simply  serve  as  a  bridge  to   marshal  data  between  the  managed  and  unmanaged  environments.   4.2.4     Factory  Classes  and  Smart  Pointers   In  anticipation  of  the  need  for  multithreaded  NeuralNetEvaluator  and  NeuralNetTrainer   implementations  in  the  future  the  project  employs  the  Factory  pattern.    Multithreading  and   concurrency  frequently  require  complex  creation  logic.    The  Factory  pattern  decouples  object  use   from  object  creation  (Larman  2002,  pp.  346-­‐348).    The  clients  use  the  factory  classes;  they  do  not   create  instances  directly.   Since  Factory  classes  are  responsible  for  creating  NeuralNetEvaluator  and  NeuralNetTrainer   instances,  it  is  not  appropriate  for  the  calling  classes  to  destroy  them.    In  a  managed  environment,     Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  34  of  119   September,  2010  

this  is  not  an  issue.    Garbage  collection  automatically  destroys  objects  when  there  are  no  more   references.   Unmanaged  C++  does  not  offer  automatic  garbage  collection.    To  address  this  challenge  the  project   uses  Boost’s  smart_ptr  C++  library  (Boost  Project  2010).    The  factory  classes  return  smart  pointers  to   the  NeuralNetEvaluator  and  NeuralNetTrainer.    Boost  smart  pointers  are  not  a  panacea.    For   example  the  smart  pointer  library  does  not  automatically  handle  circular  references.    See  Boost   documentation  for  more  information.   4.2.5   Random  Numbers  and  Distributions   The  Bootstrap,  GeneticSelector,  and  EvolutionaryTrainer  use  Boost’s  random  library  to  generate   random  values.    The  library  supports  many  different  probability  distributions,  including  the  uniform,   normal,  and  Cauchy  distributions.  

4.3     Neural  Network  Output  Calculation   4.3.1   Base  Design   The  leaf  operation  for  the  system  is  the  calculation  of  neural  network  outputs  during  training.     Therefore,  this  is  where  a  GPU  implementation  can  provide  the  most  value.    The  calculation  is   intrinsically  parallel.   The  calculation  of  a  neural  network’s  output  (see  Section  2.2.4)  is  as  follows:   1. Set  the  output  node’s  input  value  to  zero  (there  is  only  one  output  node).   2. For  each  node  in  the  hidden  layer.   2.1. Set  the  hidden  layer  node’s  input  value  to  zero.   2.2. For  each  feature  value.   2.2.1.    Multiply  the  feature  value  by  the  appropriate  weight  value.   2.2.2.    Add  the  result  to  the  current  hidden  layer  node’s  input  value.   2.3. Add  the  current  hidden  layer  node’s  bias  to  the  node’s  input  value12.   2.4. Calculate  the  current  hidden  layer  node’s  activation  function.   2.5. Multiply  the  result  by  the  appropriate  weight.   2.6. Add  the  result  to  the  network  output  node’s  input  value.   3. Add  the  bias  to  the  network  output  node’s  input  value.   4. Calculate  the  network  output  node’s  activation  function.   5. The  result  is  the  neural  network’s  output,  end.   Figure  4-­‐1  on  the  following  page  shows  the  multiply,  add  and  assign  operations  from  lines  2.2.1  and   2.2.2  occurring  on  a  single  feature/node  connection  in  SIMD  fashion  (see  Section  2.4.1)  over  four   records.    The  calculation  uses  the  same  network  with  the  same  weight  vector.  The  feature  values   vary  because  they  are  from  four  different  records.    The  weight  value  is  broadcast  and  does  not  vary   because  the  same  network  and  weight  vector  is  calculating  all  four  records.    The  product  is  added   and  assigned  to  the  node’s  input.                                                                                                                           12

 The  bias  is  simply  a  threshold;  it  shifts  the  activation  function’s  curve  to  the  left  or  right  without  changing   the  curves  shape.     Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  35  of  119   September,  2010  

records  à     Feature  Value  

f0  

f1  

f2  

f3  

Multiplication  

*  

*  

*  

*  

Current  Weight   w0   w0   w0   w0   Add  and  Assign   +=   +=   +=   +=    Node  Value  

n0   n1   n2   n3  

 

Figure  4-­‐1  –  SIMD  Node  Multiply  Add  Assign  

 

4.4  

CPU  Implementation  

SSE  requires  a  16  byte  memory  alignment.    The  starting  address  must  be  a  multiple  of  16  (see   Section  4.2.2).    If  the  memory  is  not  correctly  aligned  a  CPU  instruction  level  error  occurs.    Only  the  C   functions  _aligned_malloc()  and  _aligned_free()  can  manage  aligned  memory;  the  C++  new  and   delete  operators  will  not  work.    This  eliminates  the  option  of  using  Boost’s  smart  pointer  library  for   direct  memory  management.    To  protect  against  memory  leaks,  the  project  places  the  aligned   memory  in  the  TestingData,  and  TrainingData  classes.    The  memory  is  allocated  during  construction,   and  released  during  destruction.    These  classes  can  be  managed  by  the  library.   Figure  4-­‐2  on  the  next  page  contains  the  source  code  for  the  CPU  implementation.    The  code   contains  special  data  structures  and  SSE  Intrinsics  calls  that  make  it  vary  from  typical  C/C++  in   appearance.    __m128  is  a  special  SSE  data  type  that  represents  four  aligned  floating  point  numbers.       _mm_set1_ps(0.0f)  initializes  all  four  vector  node  input  values  to  zero;  _mm_set1_ps(*w)  sets  the  same   weight  value  to  all  four  vector  elements.    _mm_add_ps(ipt4, _mm_mul_ps(w4, *d)) executes  the   multiply  addition.    SquashingFunctionP4(&ipt4)  transforms  the  input  value  to  an  output  value  by   using  Milner  &  Grandison  (2008)  logistic  sigmoid  implementation.    The  function  is  written  purely  in   Assembly;  it  is  also  fast  because  it  only  calls  nine  SSE  instructions  (see  SseGlobal.cpp  on  page  99).  

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report       void SseGlobal::EvaluateNN(float *dStart , int rowDim , int fieldDim , int wgtVectDim , int hidNodCnt , float *weight , float *output , int recCnt) { __m128 w4; // weights on the hidden __m128 wo4; // weights to the output __m128 ipt4; // input node values

}  

   

Page  36  of  119   September,  2010  

layer node

// iterate over 4 records at a time int inc = rowDim / 4; int wOutOff = (fieldDim + 1)*hidNodCnt; int wOff = (fieldDim + 2)*hidNodCnt + 1; float *weightEnd = &weight[wOff*wgtVectDim]; __m128 *dEnd = (__m128 *)&dStart[rowDim*fieldDim]; while(weight< weightEnd) { __m128 *opt4 = (__m128 *)output; float *data = dStart; for(int i = 0; i < recCnt; i += 4, data += 4) { float *w = weight; float *wo = &w[wOutOff]; // set four network output node input values to zero *opt4 = _mm_set1_ps(0.0f); // iterate over hidden nodes for(int j = 0; j < hidNodCnt; ++j) { // set four hidden layer input values to zero ipt4 = _mm_set1_ps(0.0f); // iterate over inputs for(__m128 *d = (__m128 *)data; d < dEnd; d += inc) { // store the same weight in four consecutive memory // locations for SSE operation w4 = _mm_set1_ps(*w); // execute multiply/add/assign ipt4 = _mm_add_ps(ipt4, _mm_mul_ps(w4, *d)); ++w; } // add bias w4 = _mm_set1_ps(*w); ipt4 = _mm_add_ps(ipt4, w4); ++w; // calcuate output using logistic sigmoid function SquashingFunctionP4(&ipt4); wo4 = _mm_set1_ps(*wo); // execute multiply/add/assign *opt4 = _mm_add_ps(*opt4, _mm_mul_ps(wo4, ipt4)); ++wo; } // add bias wo4 = _mm_set1_ps(*wo); *opt4 = _mm_add_ps(*opt4, wo4); ++wo; SquashingFunctionP4(opt4);// this is the 4 NNs' output // move to the next four records ++opt4; } output += rowDim; weight += wOff; }

Figure  4-­‐2  –  CPU,  SSE  Implementation  (SseGlobal.cpp)  

 

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

4.5  

   

Page  37  of  119   September,  2010  

GPU  Implementation  

4.5.1     The  Project’s  Implementation   There  are  two  components  to  the  GPU  implementation:  the  function  on  the  CPU  that  manages  the   host  execution  (Figure  4-­‐3  on  page  38),  and  the  kernel  function  that  executes  on  the  device  (Figure   4-­‐4  on  page  39).   Before  invoking  kernel  function  on  the  thread  dimensionality  must  be  set  up.    SIMT  device  threads   are  grouped  together  into  thread  blocks.    A  thread  block  is  a  set  of  threads  that  can  share  on  chip   memory  and  synchronization  calls.    Each  block  can  contain  up  to  512  threads.    The  combination  of   the  thread’s  block  id  and  thread  id  uniquely  identify  a  thread.    CUDA’s  data  structures  allow  the   thread  ids  to  be  up  to  three  dimensional  and  block  ids  to  be  up  to  two  dimensional.     In  Figure  4-­‐3  each  block  contains  the  maximum  512  threads  in  a  32x1x16  layout.    Each  block  will   calculate  32  records  over  16  weight  vectors.    The  number  of  records  and  the  number  of  different   weight  vectors  determine  the  number  of  blocks.    If  there  are  96  records  and  32  weight  vectors  to   evaluate  then  there  will  be  six  blocks  arranged  in  a  3x2  layout.    3,072  threads  will  execute;  each   thread  will  calculate  a  neural  network’s  output  for  a  unique  record  and  weight  vector  combination.     NVIDIA  describes  the  setup  of  the  thread  ids  and  block  ids  as  the  thread  hierarchy.    In  this  case  the   threads  are  in  a  block  with  block  dimensionality  of  32x1x16  and  the  blocks  are  in  a  grid  with  grid   dimensionality  of  3x2.   The  invocation  function  uses  the  helper  functions  in  Figure  4-­‐5  (page  40)  to  transfer  data  between   the  device  and  the  host.    The  CUDA  API  functions  are  similar  to  the  well  known  C  functions  for   memory  allocation,  release,  and  copy  operations.    The  cutilSafeCall  and  cutilCheckMsg  functions   are  included  with  the  CUDA  Toolkit.    The  functions  provide  error  checking  on  the  device.   Each  SIMT  device  thread  executes  the  kernel  function  in  Figure  4-­‐4.    The  first  task  is  to  determine   which  record  and  weight  vector  to  calculate  using  the  thread  id  and  the  block  id.  The  rest  of  the   program  is  a  straight-­‐forward  C  program  that  performs  the  calculations.    The  kernel  reads  more  like   a  typical  C  function  than  the  SSE  based  function  with  the  mixture  of  Assembly,  SSE  Intrinsics,  and   special  data  structures.  

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo     MSc  Advanced  Information  Systems  project  report         __host__ void BasicEvaluateNN(NNEvaluationData data) { // number of threads per block blkDim.x = 32; blkDim.y = 1; blkDim.z = 16;

Page  38  of  119   September,  2010  

int recCnt = hostDs.RecordCnt; int wgtCnt = data.WeightVectorDim; // number of blocks grdDim.x = (recCnt & (blkDim.x*blkDim.y - 1)) ? (recCnt / (blkDim.x*blkDim.y)) + 1 : recCnt / (blkDim.x*blkDim.y); grdDim.y = (wgtCnt & (blkDim.z - 1)) ? wgtCnt/blkDim.z + 1 : wgtCnt/blkDim.z; grdDim.z = 1; // load the data & workspace NNEvaluationData copy = data; data.Output[1] = data.Dataset[1] = data.WeightVectors[1] = 0; data.WeightSetDim = 1; for(int i = 0; i < copy.WeightSetDim; ++i) { data.Dataset[0] = copy.Dataset[i]; data.WeightVectors[0] = copy.WeightVectors[i]; data.Output[0] = copy.Output[i]; // load the data & workspace NNEvaluationData nn = LoadEvalData(data); // call kernel ExecEvaluateNN(); cutilCheckMsg("Kernel ExecGlobalMemoryEvaluateNN execution failed"); cudaThreadSynchronize(); // copy output GetOutputEvalData(nn, (float **)&data.Output); // free device memory UnloadEvalData(nn); } return; }

  Figure  4-­‐3  –  CUDA  Invocation  Function  (CudaBasic.cu)  

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo     Page  39  of  119   MSc  Advanced  Information  Systems  project  report     September,  2010       static __global__ void ExecEvaluateNN() { // determine which record and weight vector to process int recIdx = blockDim.x*(blockDim.y*blockIdx.x + threadIdx.y) + threadIdx.x; int wgtIdx = blockDim.z*blockIdx.y + threadIdx.z; // iterate through all of the datasets float opt = 0.0f; if(wgtIdx < Nn.WeightVectorDim) { float *wgt = &((float *)Nn.WeightVectors[0])[wgtIdx*Nn.WeightEleDim]; float *oWgt = &wgt[Nn.WeightOutputOffset]; float *datStrt = &((float *)Ds.Datasets[Nn.Dataset[0]])[recIdx]; float *datEnd = &datStrt[Ds.RecordDim*Ds.FieldDim]; for(int i = 0; i < Nn.HiddenNodeCnt; ++i) { float ipt = 0.0f; for(float *curDat = datStrt; curDat < datEnd; curDat += Ds.RecordDim) { // multiply/add/assign ipt += *curDat * *wgt; ++wgt; } // add bias ipt += *wgt; ++wgt; // multiply/add/assign opt += *oWgt/(1.0f + expf(-ipt)); ++oWgt; } // add bias opt += *oWgt; ++oWgt; // E squared fitness // save output to device memory ((float *)Nn.Output[0])[Ds.RecordDim*wgtIdx + recIdx] = (1.0f/(1.0f + expf(-opt))); } }

 

13

Figure  4-­‐4  –  CUDA  Kernel  

                                                                                                                        13

 There  are  slight  differences  between  Figure  4-­‐4  and  the  actual  source  code.    The  source  code  calculates  

recIdx  and  wgtIdx  using  function  calls.      Nvcc  inlines  all  device  functions;  therefore  these  variances  are  for  

descriptive  purposes  only  and  have  no  bearing  on  functionality.     Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo     MSc  Advanced  Information  Systems  project  report         static NNEvaluationData LoadEvalData(NNEvaluationData host) { NNEvaluationData dev = host;

Page  40  of  119   September,  2010  

// zero terminate arrays dev.WeightVectors[dev.WeightSetDim] = 0; dev.Output[dev.WeightSetDim] = 0; dev.Dataset[dev.WeightSetDim] = 0; int len0 = sizeof(float) * dev.WeightVectorDim * dev.WeightEleDim; int len1 = sizeof(float) * dev.WeightVectorDim * hostDs.RecordDim; for(int i = 0; i < host.WeightSetDim; ++i) { cutilSafeCall( cudaMalloc((void **) &dev.WeightVectors[i], len0) ); cutilSafeCall( cudaMemcpy((void *)dev.WeightVectors[i], (void *)host.WeightVectors[i], len0, cudaMemcpyHostToDevice) ); cutilSafeCall( cudaMalloc((void **) &dev.Output[i], len1) ); } cutilSafeCall( cudaMemcpyToSymbol("Nn", &dev, sizeof(dev)) ); return dev; } static void UnloadEvalData(NNEvaluationData data) { for(int i = 0; i < data.WeightSetDim; ++i) { cutilSafeCall( cudaFree((void *)data.WeightVectors[i]) ); cutilSafeCall( cudaFree((void *)data.Output[i]) ); } } void GetOutputEvalData(NNEvaluationData nn, float **output) { for(int i = 0; i < nn.WeightSetDim; ++i) { cutilSafeCall( cudaMemcpy(output[i], ((void *)nn.Output[i]), sizeof(float)*nn.WeightVectorDim*hostDs.RecordDim, cudaMemcpyDeviceToHost) ); } }

Figure  4-­‐5  –  Device  Load,  Unload,  and  Copy  Helper  Functions  (PROJ_MarungoF_Cuda.cu)  

4.5.2   Preliminary  GPU  Optimization  Efforts   There  were  preliminary  attempts  to  optimize  the  kernel  in  Figure  4-­‐4.    The  work  centered  around   using  pre-­‐fetching  data  from  the  device’s  global  memory  (located  on  separate  memory  chips  on  the   device)  and  storing  them  in  shared  memory  (located  on  the  actual  GPU  chip).    The  interesting  result   was  that  this  simple  implementation  performed  better  than  more  complicated  attempts  at   optimization.    For  more  details  see  Appendix  C.  

4.6  

Summary  

Every  CUDA  program  has  two  parts.    One  part  runs  on  the  host  (normally  a  workstation).    The  other   part  runs  on  the  device  (normally  a  graphics  card).    The  host  component  allocates  and  frees  memory   on  the  device,  copies  data  between  the  device  and  the  host,  and  invokes  the  kernel  on  the  device.     The  host  program  groups  threads  into  blocks  before  invoking  the  kernel.    There  are  a  maximum  of   512  threads  per  block.    The  kernel  invocation  passes  the  number  of  threads  per  block  and  the   number  of  blocks.    A  device  thread  is  uniquely  identified  by  the  combination  of  its  block  id  and  its   thread  id.   The  same  kernel  function  runs  in  each  thread  on  the  device  in  SIMT  fashion.    Because  a  CUDA   program  contains  both  host  and  device  components  two  compilers  are  necessary.    The  nvcc  compiler     Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  41  of  119   September,  2010  

generates  the  device  program.    The  host  compiler  is  platform  specific  (Visual  C++  on  Windows  and   gcc  on  Linux  and  MacOS).    The  rest  of  the  general  program  must  link  with  the  C  based  host  program.   Employing  SSE  often  requires  either  using  SSE  Intrinsics  (Microsoft  Corporation  2010c)  or  manually   writing  Assembly  (Wald  2004,  p.  80).    In  either  case,  SSE  has  very  specific  memory  alignment   requirements.    The  traditional  C++  new  and  delete  operators  are  unavailable  because  of  the   alignment  requirements.    Memory  management  requires  use  of  special  C  functions   _aligned_malloc()  and  _aligned_free().       CUDA’s  linking  needs  and  SSE’s  memory  alignment  requirements  necessitate  an  unmanaged  native   environment.    While  C++/CLI  and  C#  have  mechanisms  to  allow  interoperability  between  managed   and  unmanaged  environments,  mingling  the  two  environments  is  not  practical.    Problems  can   include:   1. Thunking  and  double  thunking  issues  (Microsoft  Corporation  2010d).   2. Linking  problems  due  to  different  naming  and  argument  passing  protocols  (Microsoft   Corporation  2010a).   3. The  need  to  marshal  data  between  managed  and  unmanaged  environments.   Given  the  intricate  issues  involved  in  interoperability,  it  is  best  to  use  C++/CLI  for  creation  of  a  bridge   layer  between  an  unmanaged  component  that  handles  processing  and  a  managed  component  that   handles  file  I/O  and  user  interaction.   The  CPU  implementation  uses  SSE  Intrinsics  and  the  Milner  &  Grandison  (2008)  logistic  sigmoid   Assembly  function.    The  SSE  Intrinsics  and  the  Assembly  make  the  source  code  difficult  to  read  and   understand.   The  GPU  implementation  is  more  readable.    The  host  program  uses  CUDA  functions  that  are  similar   to  traditional  C  functions  to  copy  data  and  manage  memory.    The  device  kernel  function  is  a   straight-­‐forward  C  function.    There  are  no  special  calls.    Each  SIMT  thread  executing  the  kernel  uses   the  block  id  and  the  thread  id  to  determine  which  record  and  weight  vector  combination  to   calculate.  

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  42  of  119   September,  2010  

Chapter  5 –  Testing  and  Results    

  Figure  5-­‐1–  Results  

 

 

5.1     Overview   5.1.1     The  Tests   The  project  uses  two  forms  of  testing:  functional  and  performance.    Figure  5-­‐1  above  displays  the   output  of  both  the  performance  and  functional  tests.    The  functional  tests  follow  the  decoupled   collaborations  described  in  Section  3.3.2.    The  NeuralNetEvaluator  tests  are  first,  then   NeuralNetTrainer  test  uses  the  confirmed  evaluators,  and  finally  the  GeneticSelector  test  uses  all  of   the  previously  verified  components.    Performing  testing  in  this  order  allows  bugs  to  be  isolated  and   fixed  and  allows  the  performance  results  of  earlier  tests  to  be  a  factor  in  deciding  settings  for   subsequent  tests.  

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  43  of  119   September,  2010  

Performance  testing  measures  execution  time.    There  are  two  measurement  levels.    The  first  level   looks  at  raw  performance.    It  measures  how  quickly  a  NeuralNetEvaluator  can  calculate  the  output   of  a  large  group  of  neural  networks.    The  GeneticSelector  performance  test  is  a  total  runtime  test.     The  test  measures  the  total  time  spent  in  each  part  of  the  program.   Figure  5-­‐1  displays  the  output  of  both  the  performance  and  functional  tests.    The  tests  read  from  the   bottom  up,  that  is  NeuralNetEvaluator  test  results  are  at  the  bottom,  then  NeuralNetTrainer  test   results  are  in  the  middle,  and  finally  GeneticSelector  test  results  are  at  the  top.   5.1.2     The  Datasets   Initially  the  project  goal  was  to  use  domain  specific  data  for  testing.    However  that  is  not  ideal   because  neural  networks  are  opaque  (see  Section  2.2.4).    When  stepping  through  a  program  during   debugging  it  is  difficult  to  tell  if  a  node’s  input  or  output  value  is  correct.   Rather  than  use  a  domain  specific  dataset,  which  can  have  unpredictable  values,  the  project  uses   automatically  generated  XOR  datasets.    When  weight  vectors  are  necessary  the  project  uses  two   weight  vectors;  both  weight  vectors  are  from  Negnevitsky  (2005,  pp.  183-­‐184)  and  will  solve  an  XOR   operation.    Using  an  XOR  dataset  with  known  weight  vectors  simplifies  debugging.    When  the   network  output  is  wrong  it  is  possible  to  compare  the  calculated  value  to  the  correct  value  at  each   step.    Appendix  B  contains  the  correct  node  input  and  output  values  for  the  two  weight  vectors.  

5.2  

The  NeuralNetEvaluator  Tests  

5.2.1     Functional  Test  Description  and  Results   The  NeuralNetEvaluator  functional  test  uses  a  network  with  two  hidden  nodes  and  the  two  proven   weight  vectors  for  network  output  calculation.    The  functional  test  compares  the  evaluator   calculated  output  with  the  correct  value.    The  test  performs  this  comparison  over  a  full   NeuralNetEvaluator  run  and  returns  the  maximum  difference  between  the  correct  and  calculated   values  over  all  of  the  network  outputs  calculated.   It  is  important  to  note  that  the  correct  output  is  the  result  of  a  floating  point  calculation;  it  is  never   exactly  zero  or  one.    Therefore  the  test  uses  the  maximum  difference  instead  of  a  hard  equality.     Different  methods  for  calculation  will  lead  to  slightly  different  results.    If  the  maximum  difference  is   very  small  then  the  calculated  value  is  always  “close”  to  the  correct  value.   The  results  of  the  tests  confirming  that  the  CPU  and  GPU  NeuralNetEvaluators  properly  calculate   neural  network  output  are  on  the  lines  starting  with  “Accuracy Test”  in  Figure  5-­‐1.    The  SSE   implementation  is  slightly  less  accurate  than  the  CUDA  calculation.    The  maximum  differences  are   .00127  and  0.00049  respectively.    This  result  is  not  surprising  as  the  CPU  implementation  trades   some  accuracy  for  speed  (Milner  &  Grandison  2008).   5.2.2     Performance  Test  Description  and  Results   The  NeuralNetEvaluator  performance  test  uses  an  entirely  random  dataset.    This  is  because  the  goal   is  to  measure  computing  speed,  not  accuracy.    The  settings  for  the  performance  test  are:   •

Number  of  Records:  

 

1024  

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

• • • •

Number  of  Features:     Number  of  Hidden  Nodes:   Number  of  Samples:     Number  of  Weight  Vector:    

   

Page  44  of  119   September,  2010  

64   2   100   750  

The  number  of  records  is  based  on  the  maximum  expected  number  of  data  points  in  a  DDSM   training  set  (see  Section  2.2.5).    The  number  of  features  varies  significantly  in  different  studies;   however  60  features  tends  to  be  towards  the  upper  range.    Therefore  the  system  can  manage  a   maximum  of  64  (the  number  of  bits  in  a  long  integer  data  type).    Frequently,  the  optimal  CADx   neural  networks  have  only  one  or  two  hidden  nodes  (see  Section  2.5.1).    The  number  of  weight   vectors  and  the  number  of  samples  were  chosen  to  provide  a  test  with  a  large  number  of  network   calculations.    In  total  this  test  calculates  the  output  for  76,800,000  networks  (1024*100*750).    This   test  can  measure  raw  performance.   Despite  all  of  the  optimizations  in  the  CPU  implementation  the  GPU  implementation  consistently   outperforms  in  calculating  neural  network  output  values.    Figure  5-­‐1  on  page  42  shows  almost  an   18x  speedup  on  raw  performance  (the  time  tests).    The  time  measurements  are  in  milliseconds.    The   GPU  implementation  takes  2.172  seconds  to  calculate  the  output.    The  CPU  implementation  takes   36.578  seconds  to  perform  the  same  calculation.  

5.3     The  NeuralNetTrainer  Tests   There  is  only  a  functional  test  for  the  NeuralNetTrainer;  performance  testing  is  performed  as  part  of   the  GeneticSelector  test  in  Section  5.4.    The  functional  test  uses  the  same  dataset  as  the  evaluator   test  but  does  not  include  weight  vectors.    The  trainer’s  job  is  to  generate  the  weight  vector.    The   functional  test  returns  the  classification  accuracy  of  the  top  performing  weight  vector  using  k  =  .5  as   the  cutoff.   The  evolutionary  training  test  used  the  following  settings:   • • • •

Population  Size:     Generation  Count:     Number  of  Records       Number  of  Weight  Sets:  

50   100   128   32  

The  population  size  and  generation  count  are  from  Porto,  et  al.  (1995),  see  Section  2.5.1.    The   number  of  records  and  weight  sets  are  based  on  the  threading  dimensionality  (see  Section  4.5.1).     With  the  32x1x16  block  layout  that  the  GPU  implementation  uses,  128  records  by  32  weight  sets  will   create  eight  blocks  in  a  4x2  block  grid.    The  goal  of  this  test  is  to  confirm  that  the  trainer  converges,   not  to  measure  speed.    Both  trainers  perform  well.    The  GPU  trainer  is  only  slightly  more  accurate   than  the  CPU  trainer  (96.875%  versus  96.0938%).  

5.4     The  GeneticSelector  Tests   5.4.1     Test  Description   The  GeneticSelector  test  combines  both  functional  and  performance  testing.    The  test  uses  a  XOR   dataset  with  additional  random  “noise”  features.    The  GeneticSelector  must  identify  the  two  genuine     Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  45  of  119   September,  2010  

features.    The  fitness  function  does  not  have  a  penalty  for  including  noise  fields.    The  performance   test  measures  the  total  time  in  different  levels  of  the  program.   The  test  settings  are:   • • • • • • •

Number  of  Records:       Number  of  Features:       Evolutionary  Population  Size:     Evolutionary  Generation  Count:     Number  of  Bootstraps:       Genetic  Population:       Genetic  Population  Generation:    

1024   64   50   100   10   50   20  

All  but  the  last  three  settings  are  previously  explained.    The  settings  for  the  genetic  algorithm  are   from  Campanini  &  Lanconelli  (2006).    They  state  that  genetic  algorithms  that  review  approximately   450  networks  can  provide  good  results.   Setting  the  number  of  bootstrap  samples  is  a  balancing  act  between  the  desire  for  useful  sampling   results  and  total  runtime.    The  NeuralNetEvaluator  performance  test  provides  metrics  for  the  time   required  to  perform  neural  network  calculations.    Setting  the  bootstrap  value  to  10  is  reasonable.     Many  studies  use  five  or  ten  samples  for  k-­‐fold  cross-­‐validation.    Not  only  is  leave-­‐one-­‐out   bootstrapping  similar  to  cross-­‐validation,  but  also  the  project’s  design  can  support  a  decision  to   switch  to  cross-­‐validation  (see  Section  3.3.4).    The  program  will  calculate  the  output  for   10,240,000,000  neural  networks  using  these  settings.    That  is  133x  the  number  of  networks  in  the   NeuralNetEvaluator  performance  test.    If  the  ratios  hold  then  the  total  estimated  calculation  time   for  the  GPU  is  280  seconds  (slightly  less  than  five  minutes)  and  4,877  seconds  (about  80  minutes)  for   the  CPU.   5.4.2     Functional  Test  Results   The  GeneticSelector  test  does  find  the  true  features;  one  and  three.    Every  one  of  the  top  performing   networks  contains  the  two  true  input  features.    The  selector  did  not  eliminate  noise  features  well   however.      In  part,  this  is  the  result  of  the  network  training.    A  well  trained  network  may  have  zero   weight  values  for  all  of  the  noise  features.    In  other  words  the  network  accepts  noise  as  input  but   ignores  it  during  processing.   Modifying  the  selector’s  algorithm  to  add  a  cost  for  each  additional  feature  creates  problems  as   well.    A  solution  that  contains  many  features  including  the  true  features  may  have  a  lower  fitness   value  than  a  solution  that  contains  only  a  few  features  even  if  all  of  the  features  are  noise.    An   experiment  using  a  cost  component  in  the  algorithm  found  that  the  cost  necessary  to  cause  the   selector  to  select  only  two  input  features  varies  with  the  number  of  input  features  available.    The   nature  of  filtering  out  noise  features  is  not  only  outside  of  the  scope  of  the  project  but  also  may   suggest  using  a  different  algorithm  for  feature  selection  in  true  experiments.    There  is  a  similar   problem  with  the  number  of  hidden  nodes.    Only  two  nodes  are  necessary  for  XOR.    Again  this  may   require  re-­‐examination  of  the  algorithm.  

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  46  of  119   September,  2010  

Another  odd  quirk  is  that  the  selector  results  vary  between  the  GPU  and  CPU  implementation.     However  this  appears  to  be  related  to  the  random  numbers  generated.    When  the  order  of   execution  is  switched,  the  GPU  has  consistent  fitness  values  of  100  and  none  of  the  CPU  values  were   exactly  100.   The  GeneticSelector  passes  the  test  because  every  top  performing  neural  network  it  returns  contains   the  two  true  features;  the  fact  that  the  selector  does  not  filter  the  noise  features  is  not  a  factor.    The   selector  never  returned  a  network  that  used  all,  or  even  a  majority,  of  the  features.  The  algorithms   idiosyncrasies  serve  to  emphasize  that  algorithm  development  is  a  trial  and  error  process.    The   ability  to  flag  unexpected  outcomes  in  the  algorithms  is  another  advantage  to  using  predictable  data   for  testing.   5.4.3     Performance  Test  Results   The  actual  results  for  the  GPU  is  340  seconds,  about  20%  more  than  predicted  by  the   NeuralNetEvaluator  performance  test.    This  is  not  surprising  given  that  the  individual  chunks  of  data   moving  to  the  device  are  smaller.    This  can  create  more  overhead  in  data  transfer.   The  CPU  results  are  much  more  interesting.    The  total  CPU  time  is  2,165  seconds  (about  36  minutes).     This  is  less  than  half  the  estimate  based  on  the  NeuralNetEvaluator  performance  test.    The  GPU  only   has  a  speedup  of  about  6x  versus  18x  in  the  evaluator  test.    The  total  runtime  advantage  is  less  than   4x.    This  is  still  a  significant  difference  in  absolute  terms.    The  total  runtime  for  the  GPU  is  about  11.5   minutes;  the  total  runtime  for  the  CPU  is  about  42  minutes.   While  it  is  difficult  to  prove,  it  is  reasonable  to  believe  that  increased  performance  comes  from  the   fact  that  the  CPU  is  able  to  use  its  cache  during  the  GeneticEvaluator  test.    This  test  repeatedly   operates  over  the  same  set  of  data.    The  NeuralNetEvaluator  test  operates  over  the  data  only  once.     A  high  cache  hit  ratio  is  one  of  the  most  important  factors  in  CPU  performance  (see  Section  2.4.2).     The  GPU  implementation  does  not  employ  caching  (CUDA  does  offer  some  caching  of  constant  and   texture  memory,  see  Appendix  C,  and  in  the  latest  version  caching  of  general  memory,  see  Appendix   A).    Preliminary  tests  with  very  low  settings  actually  had  the  CPU  outperforming  the  GPU.    In  that   case  the  CPU  was  probably  able  to  cache  the  entire  dataset.   Another  area  where  the  cache  may  have  an  impact  is  the  differences  in  performance  in  other  parts   of  the  algorithm.    The  CPU  implementation  has  lower  performance  times  than  the  GPU  version  in  all   of  the  other  sections  of  the  program.    These  sections  are  common  to  both  implementations.    This  is   probably  because  parts  of  the  cache  are  disturbed  during  the  constant  copying  of  data  between  the   host  and  the  device.  

5.5     CPU  Allocation   The  screen  shots  in  Figure  5-­‐2  on  the  following  page  show  the  CPU  usage  while  the  application  is   running.    Due  to  the  computational  intensity  of  the  application,  the  CPU  allocates  execution  to  a   single  core.    The  CPU  usage  on  that  core  remains  at  100%  throughout  the  execution.    The  50%  in  the   Processes  screen  represent  100%  usage  on  one  core  in  a  dual  core  machine.  

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

  Figure  5-­‐2  -­‐-­‐  Windows  Task  Manager  

   

 

Page  47  of  119   September,  2010  

   

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

 

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  48  of  119   September,  2010  

Chapter  6 –  Summary,  Conclusion,  Future  Work,  and  Evaluation     6.1     Summary   6.1.1   Background   The  goal  of  this  project  was  to  explore  using  CUDA  to  accelerate  breast  cancer  CADx  neural  network   classifier  design.    To  achieve  this  goal  the  project  presents  an  implementation  of  an  algorithm  that   performs  feature  selection,  network  architecture,  and  network  training  using  genetic  and   evolutionary  computing  techniques  as  well  as  leave-­‐one-­‐out  bootstrap  sampling.    To  provide  a  basis   of  comparison,  there  are  two  implementations  that  calculate  neural  network:  one  uses  the  GPU  and   the  other  uses  the  CPU.  

6.2     Conclusion   6.2.1   Overall  Conclusion   This  project  demonstrates  a  role  for  CUDA  in  computationally  intensive  CADx.    The  results  show  a   significant  speedup  over  using  a  CPU  only  implementation.    The  CUDA  implementation  is  intrinsically   parallel  and  will  therefore  also  have  automatic  performance  increases  with  future  hardware   upgrades  (see  Section  2.3.1).    The  CPU  implementation  requires  an  additional  multithreading  layer  in   order  to  have  a  chance  to  match  the  GPU’s  performance  and  to  have  processing  increases  in  the   future.    The  multithreading  component  is  not  a  trivial  addition  to  the  program.    There  is  also  no   guarantee  that  the  addition  of  multithreading  will  add  the  anticipated  gains  (see  Section  2.5.3).   The  sections  below  provide  conclusions  from  various  phases  of  the  project.   6.2.2     Design   One  area  this  project  tackles  that  is  not  as  commonly  explored  yet  in  research  is  integrating  CUDA   into  an  overall  domain  application  such  as  CADx.    Design  becomes  very  important  in  this  context.     CUDA  development  requires  that  all  memory  management  for  the  device  occurs  on  the  host.    In   addition  the  host  is  responsible  for  orchestrating  the  movement  of  input  and  output  data  between   the  device  and  the  host.   Programs  normally  have  an  Array  of  Structures  (AoS)  memory  layout.    Many  of  the  GPU  based  neural   network  programs  in  the  literature,  such  as  Steinkraus,  et  al.  (2005)  and  Jang,  et  al.  (2008),  maintain   this  layout  by  using  the  GPU  to  perform  matrix  multiplication  as  a  parallel  operation  on  only  one   network  at  a  time.    CADx  neural  networks  are  relatively  small;  they  do  not  have  the  large  numbers  of   hidden  nodes  to  make  this  approach  feasible.    There  is  no  ability  to  scale  when  performing  small   matrix  multiplication  calculations.    Scale  in  CADx  neural  networks  comes  from  calculating  multiple   network  outputs  simultaneously.    This  requires  Structure  of  Arrays  layout  (see  Section  3.3.3).   When  CUDA  is  part  of  a  higher  level  application,  transformations  between  the  AoS  data  layout  in  the   rest  of  the  program  and  the  SoA  data  layout  in  the  part  of  the  program  running  on  the  GPU  must   occur  at  some  point.    Without  a  decoupled  design  that  clearly  delimits  responsibilities  the     Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  49  of  119   September,  2010  

demarcation  points  where  the  transformations  occur  become  muddled.    Changes  in  the  surrounding   program  will  cascade;  this  will  require  changes  to  the  GPU  program  to  adjust  the  data  layout.     Employing  a  decoupled  design  removes  the  need  to  change  to  the  GPU  program  if  changes  are  made   in  a  modular  area  outside  of  where  the  transformation  takes  place.   This  project  manages  decoupling  in  two  ways.    The  interaction  with  CUDA  (or  SSE)  only  occurs  in  the   appropriate  NeuralNetEvaluator  class.    And  the  data  layout  requirements  are  managed  by  the   SamplingData,  TrainingSet,  and  TestingSet  classes.  This  decoupling  allows  changes  to  occur   throughout  the  program  without  a  need  to  modify  the  CUDA  program.   6.2.3     Implementation   Using  either  SSE  or  CUDA  presents  difficulties.    The  SSE  has  specific  memory  alignment  requirements   that  prevent  the  use  of  traditional  C++  approaches  to  memory  management.    The  requirement  for   low-­‐level  instruction  calls  creates  less  readable  programs.    CUDA  is  a  C  API  that  must  link  with  native   C++.    Therefore  a  decision  to  use  CUDA  or  SSE  is  a  decision  to  use  a  fair  amount  of  C++.   The  implementation  did  demonstrate  the  general-­‐purpose  applicability  of  CUDA.    The  kernel   function  looks  like  a  typical  C  function.    The  CUDA  functions  to  allocate,  free,  and  copy  memory  look   like  their  well  known  C  counterparts.    A  drawback  to  CUDA  is  that  the  GPU  program  runs  on  the   graphics  card.    This  makes  debugging  very  arduous.    It  is  not  possible  to  simply  print  out  or  step   through  the  execution.    CUDA  does  offer  an  emulation  mode  that  allows  debugging.    However  this   mode  is  very  limited.    Patience  and  program  simplicity  are  often  the  only  programming  tools   available14.   In  this  CADx  application,  the  CUDA  kernel  is  not  a  particularly  large  part  of  the  code  base.    This  is   probably  quite  typical.    Most  of  the  effort  in  CUDA  development  involves  orchestrating  the   interaction  between  the  host  and  the  device  (see  Section  4.2.1  and  Section  4.5).   6.2.4     Testing  and  Results   Based  on  when  the  technologies  were  released  and  placement  in  the  product  line,  a  CPU  roughly   comparable  to  the  GPU  in  this  test  may  have  up  to  four  hardware  threads.    If  all  the  threads  are  in   use  the  CPU  may  match  the  GPU’s  4x  speedup.    However  there  are  other  factors  to  consider.    To   obtain  comparable  CPU  performance  will  require  the  addition  of  multithreading.    Multithreading   adds  another  level  of  complexity  on  top  of  SSE.    The  additional  threads  also  may  not  provide  a  linear   speedup.    It  is  likely  that  much  of  the  CPUs  performance  came  from  caching.    CUDA’s  scaling  ability   appeared  to  be  more  predictable.   Another  potential  reduction  in  performance  comes  from  other  programs  running  on  the   workstation.    The  CADx  program  creates  near  full  utilization  in  a  hardware  thread  because  the   program  is  continuously  performing  calculations.    There  are  either  one  or  two  hardware  threads  per   CPU  core.    In  a  single  threaded  program  the  CPU  allocates  all  of  the  work  to  a  single  hardware   thread;  other  programs  running  on  the  workstation  use  the  other  available  thread(s).    If  the                                                                                                                           14

 NVIDIA  has  recently  released  a  new  development  tool  called  NSight  which  claims  to  alleviate  many  of  the   programming  difficulties.    However  NSight  requires  Windows  Vista  or  Windows  7.    The  workstation  used  for   developing  this  project  uses  Windows  XP;  therefore  this  project  does  not  review  it.     Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  50  of  119   September,  2010  

application  is  multithreaded  and  uses  all  of  the  available  hardware  threads  then  either  the  other   applications  will  freeze  or  the  performance  of  the  CADx  program  will  decline  as  the  CPU  must   perform  task  switching.    The  only  way  to  avoid  this  problem  is  to  continuously  monitor  CPU  usage  by   other  programs  on  the  machine.   CUDA  intrinsically  scales.    The  more  threads  the  better  the  performance.    This  is  the  opposite  of  the   CPU.    With  CUDA  there  is  no  need  to  monitor  the  available  usage.    Performance  gains  from   hardware  upgrades  are  also  immediate  there  is  no  need  to  modify  the  program  or  purchase  a  new   workstation.    More  powerful  cards  will  automatically  schedule  more  warps  to  run  simultaneously.  

6.3     Evaluation   My  overall  assessment  of  the  project  is  that  it  was  a  success.    The  project  provides  a  window  into  the   performance  consideration  in  the  context  of  CADx.    However  the  path  from  beginning  to  end  was   very  different  from  my  initial  expectations.    I  thought  that  .NET  would  be  integral  in  my  project.    That   turned  out  to  be  impractical.   I  also  did  not  anticipate  the  importance  of  design;  it  turned  out  to  be  crucially  important.    As  the   code  base  grew  during  implementation  the  project  required  constant  redesigning.    Without  it  the   project  would  have  stalled  due  to  complexity.    Decoupling  along  the  volatile  dimensions  of  possible   algorithm  implementation  and  execution  location  (host  or  device)  was  critical  in  maintaining  stability   as  the  project  progressed.   Another  characteristic  of  the  project  is  that  the  general  phases  overlapped  and  mutually  influenced   each  other.    The  last  third  of  the  design  phase  occurred  during  the  first  two  thirds  of  implementation   phase.    Practical  problems  in  implementation  would  lead  to  modifications  of  the  design.    For   example  the  package  based  approach  described  in  Section  3.3.2  was  abandoned  about  half  way   through  the  implementation.    The  change  allowed  me  to  continue  developing  the  program  without   constantly  editing  multiple  sections.   Testing  led  to  changes  in  implementation.    The  two  largest  modifications  during  testing  were  the   replacement  of  the  original  evolutionary  training  algorithm,  which  did  not  converge,  and  the   decision  to  use  a  binary  XOR  dataset.    Because  of  the  flexible  design  substituting  a  new  trainer  did   not  have  a  cascading  effect  on  the  rest  of  the  program.    The  decision  to  use  the  binary  XOR  dataset   occurred  during  the  overlap  of  the  late  stage  of  implementation  and  early  stage  of  testing.    At  that   point  the  need  for  a  predictable  and  well  understood  test  case  was  clear.   While  overall  I  do  not  have  any  regrets  in  the  project  there  were  tradeoffs.    The  decision  to  invest   significant  effort  in  the  CPU  implementation  crowded  out  the  exploration  of  some  areas  of  CUDA   optimization  (see  Appendix  C).    I  believe  this  investment  was  necessary  to  provide  a  true  benchmark   on  CUDA  performance;  but  there  were  quite  a  few  other  avenues  I  would  have  liked  to  explore.     There  is  not  much  that  I  would  do  differently,  but  I  would  have  liked  to  have  done  more.  

6.4  

Future  Work  

Based  on  the  results  of  this  project  the  decision  to  use  CUDA  in  CADx  implementations  is  a  decision   as  to  whether  or  not  to  use  C++.    If  the  decision  is  made  to  use  C++  then  the  benefits  of  using  CUDA   over  using  SSE  is  clear.    However  using  a  managed  language  such  as  C#  or  Java  has  considerable     Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  51  of  119   September,  2010  

benefits  over  C++.    Future  work  into  the  performance  of  a  managed  CADx  implementation  as   compared  to  CUDA  would  be  highly  beneficial.    This  is  especially  relevant  with  Microsoft’s  release  of   the  Task  Parallel  Library  (Microsoft  Corporation  2010a).    If  this  library  can  manage  CPU  threads   efficiently  and  scale  as  the  number  of  cores  increase  then  the  benefits  of  using  a  higher  level  may  be   worth  the  cost  of  better  performance  from  CUDA.   This  project  did  not  cover  coordinating  GPU  and  CPU  activity.    The  total  runtime  performance  test   revealed  that  the  program  spent  almost  roughly  the  same  amount  of  time  on  calculating  the  neural   network  output  using  the  GPU  as  on  performing  other  work  using  the  CPU.    The  current   implementation  of  the  program  does  not  utilize  this  time.    The  program  blocks  when  the  GPU  is   busy.    Modifying  the  program  to  allow  the  CPU  to  continue  executing  while  the  GPU  performs   processing  may  almost  half  the  total  runtime.   Appendix  C  contains  a  description  of  an  attempt  to  optimize  the  CUDA  kernel.    Based  on  the  results,   the  kernel  presented  in  the  project  appears  to  be  a  fairly  top  performing  one.    However,  there  are   still  a  few  optimizations  that  may  be  worthwhile.    For  example,  using  the  texture  memory  cache  may   create  some  performance  improvements.       Ultimately,  the  goal  of  this  project  is  to  accelerate  processing  a  domain  specific  problem.    The  design   and  architecture  of  the  system  is  built  around  the  needs  of  CADx.    To  see  this  project  used  to   accelerate  a  true  CADx  application,  even  as  it  is,  would  be  very  rewarding.  

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  52  of  119   September,  2010  

 

Bibliography   American  Cancer  Society  2009,  Breast  Cancer  Facts  &  Figures  2009-­‐2010,  American  Cancer  Society,   Atlanta,  GA.   American  College  of  Radiology  2009,  The  American  College  of  Radiology  BI-­‐RADS  ATLAS  and  MQSA:   Frequently  Asked  Questions,  viewed  29  August  2010,   .   Barney,  B  2010,  Introduction  to  Parallel  Computing,  viewed  17  June  2010,   .   Beck,  K  &  Cunningham,  W  1989,  'A  Laboratory  For  Teaching  Object  Oriented  Thinking',  OOPSLA  '89:   Conference  Proceedings  on  Object-­‐Oriented  Programming  Systems,  Languages  and  Applications,   ACM.   Benkrid,  K  2008,  'High  Performance  Reconfigurable  Computing:  From  Applications  to  Hardware',   IAENG  International  Journal  of  Computer  Science,  vol  35:1,  IJCS_35_1_04.   Bevilacqua,  A,  Campanini,  R  &  Lanconelli,  N  2001,  'Optimization  of  a  Distributed  Genetic  Algorithm   for  the  Detection  of  Microcalcifications',  International  Journal  of  Modern  Physics,  vol  12,  no.  1,  pp.   55-­‐70.   Bilhanan,  A  2004,  'High  Level  Synthesis  Of  An  Image  Processing  Algorithm  For  Cancer  Detection',   MSc  Thesis,  Department  of  Computer  Science  and  ,  University  of  South  Florida,  Florida,  USA.   Boost  Project  2010,  Boost  C++  Libraries,  .   Boujelben,  A,  Chaabani,  AC,  Tmar,  H  &  Abid,  M  2009,  'Feature  Extraction  from  Contours  Shape  for   Tumor  Analyzing  in  Mammographic  Images',  Digital  Image  Computing:  Techniques  and  Applications,   Conference  Publishing  Services,  Melbourne,  Austalia.   Campanini,  R  &  Lanconelli,  N  2006,  'Chapter  4:  Genetic  Algorithms  in  Mammography',  in  Recent   Advances  In  Breast  Imaging,  Mammography,  and  Computer-­‐Aided  Diagnosis  of  Breast  Cancer,  The   Society  of  Photo-­‐Optical  Instrumentation  Engineers,  Bellingham,  Washington.   Chai,  Z,  Sun,  J,  Cai,  R  &  Xu,  W  2009,  'Implementing  Quantum-­‐behaved  Particle  Swarm  Optimization   Algorithm  in  FPGA  for  Embedded  Real-­‐time  Applications',  2009  Fourth  International  Conference  on   Computer  Sciences  and  Convergence  Information  Technology,  pp.  886-­‐890.   Che,  S,  Li,  J,  Sheaffer,  JW,  Skadron,  K  &  Lach,  J  2008,  'Accelerating  Compute-­‐Intensive  Applications   with  GPUs  and  FPGAs',  Proceedings  of  the  2008  Symposium  on  Application  Specific  Processors,  pp.   101-­‐107.   D'Orsi,  CJ,  Bassett,  LW  &  Berg,  WA  2003,  Breast  Imaging  Reporting  and  Data  System:  ACR  BI-­‐RADS-­‐ Mammography  (ed  4),  American  College  of  Radiology,  Reston,  VA.     Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  53  of  119   September,  2010  

Duncan,  R  1990,  'A  Survey  of  Parallel  Computer  Architectures',  Computer,  vol  23,  no.  2,  pp.  5-­‐16.   Efron,  B  &  Tibshirani,  RJ  1998,  An  Introduction  to  the  Bootstrap,  CRC  Press  LLC,  Boca  Raton,  Florida.   Feist,  T  2009,  Following  the  road  from  ASIC  to  FPGA,  viewed  13  December  2009,   .   Flynn,  MJ  1972,  'Some  Computer  Organizations  and  Their  Effectiveness',  IEEE  Transactions  on   Computers,  pp.  948-­‐960.   Fogel,  DB,  Wasson  III,  EC  &  Boughton,  EM  1995,  'Evolving  Neural  Networks  for  Detecting  Breast   Cancer',  Cancer  Letters,  pp.  49-­‐53.   Frank,  A  &  Asuncion,  A  2010,  UCI  Machine  Learning  Repository.  Irvine,  CA:  University  of  California,   School  of  Information  and  Computer  Science,  viewed  13  June  2010,  .   Geer,  D  2005,  'IEEE:  Chip  Makers  Turn  to  Multicore  Processors',  Computer,  May  2005,  pp.  11-­‐13.   Gilbert,  F,  Astley,  S,  Gillan,  M,  Agbaje,  O,  Wallis,  M,  James,  J,  Boggis,  C  &  Duffy,  S  2008,  'Single   reading  with  computer-­‐aided  detection  for  screening  mammography',  New  England  Journal  of   Medicine,  no.  359,  pp.  1675  -­‐  84.   Giles,  M  2009,  Numerically  Intensive  Computing  In  Finance  -­‐-­‐  Lecture  Notes,  viewed  1  May  2010,   .   Graham,  P  &  Nelson,  B  1996,  'Genetic  algorithms  in  software  and  in  hardware-­‐-­‐-­‐A  performance   analysis  of  workstations  and  custom  computing  machine  implementations',  IEEE  Symposium  on   FPGAs  for  Custom  Computing  Machines,  pp.  216-­‐225.   Heath,  M,  Bowyer,  K,  Kopans,  D,  Kegelmeyer,  WP,  Moore,  R,  Chang,  K  &  MunishKumaran,  S  1998,   'Current  status  of  the  Digital  Database  for  Screening  Mammography',  Digital  Mammography,  pp.   457-­‐460.   Heath,  M,  Bowyer,  K,  Kopans,  D,  Moore,  R  &  Kegelmeyer,  WP  2001,  'The  Digital  Database  for   Screening  Mammography',  Proceedings  of  the  Fifth  International  Workshop  on  Digital  Mammograpy,   pp.  212-­‐218.   Intel  Corporation  2000,  Approximate  Math  Library  for  Intel  Streaming  SIMD  Extensions  Release  2.0,   viewed  2010  June  17,  .   Intel  Corporation  2009,  Vector  Math  Library  (VML)  Performance  and  Accuracy  Data,  viewed  30  April   2010,  .   Intel  Corporation  2010,  Intel  AVX,  viewed  23  August  2010,  .   Jang,  H,  Park,  A  &  Jung,  K  2008,  'Neural  Network  Implementation  using  CUDA  and  OpenMP',  Digital   Image  Computing:  Techniques  and  Applications,  pp.  155-­‐161.  

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  54  of  119   September,  2010  

Jiang,  Y,  Nishikawa,  R,  Schmidt,  R,  Metz,  CE,  Giger,  ML  &  Doi,  K  1999,  'Improving  breast  cancer   diagnosis  with  computer-­‐aided  diagnosis',  Academic  Radiology,  vol  6,  no.  1,  pp.  22-­‐33.   Jiang,  W  &  Simon,  R  2007,  'A  comparison  of  bootstrap  methods  and  an  adjusted  bootstrap  approach   for  estimating  the  prediction  error  in  microarray  classification',  Statistics  in  Medicine,  no.  26(29),  pp.   5320-­‐5334.   Kirk,  D  &  Hwu,  W  2008,  'Chapter  1:  Introduction',  in  Programming  Massively  Parallel  Processors,   Draft,  viewed  14  December  2009,  .   Kohavi,  R  1995,  'A  Study  of  Cross-­‐Validation  and  Bootstrap  for  Accuracy  Estimation  and  Model   Selection',  Proceedings  of  the  14th  international  conference  on  artificial  intelligence  (IJCAI)  ,  pp.   1137-­‐1143.   Land,  W,  McKee,  DW,  Anderson,  FR,  Masters,  T,  Lo,  JY,  Embrechts,  M  &  Heine,  J  2006,  'Chapter  10:   Using  Computational  Intelligence  For  Computer-­‐Aided  Diagnosis  Of  Screen-­‐Film  Mammograms',  in   Recent  Advances  In  Breast  Imaging,  Mammography,  and  Computer-­‐Aided  Diagnosis  of  Breast   Cancer,  The  Society  of  Photo-­‐Optical  Instrumentation  Engineers,  Bellingham,  Washington.   Larman,  C  2002,  Applying  UML  and  Pattern:  An  Introduction  to  Object-­‐Oriented  Analysis  and  Designt   and  the  Unified  Process,  2nd  Ed.,  Prentice-­‐Hall,  Inc.,  Upper  Saddle  River,  NJ.   Lewis,  TE  &  Magoulas,  GD  2009,  'Strategies  to  Minimise  the  Total  Run  Time  of  Cyclic  Graph  Based   Genetic  Programming  with  GPUs',  Proceedings  of  the  11th  Annual  Conference  on  Genetic  and   Evolutionary  Computation,  Association  for  Computing  Machinery,  Montreal,  Québec,  Canada.   Lindholm,  E,  Nickolls,  J,  Oberman,  S  &  Montrym,  J  2008,  'NVIDIA  Tesla:  A  Unified  Graphics  and   Computing  Architecture',  March/April  2008,  pp.  39-­‐55.   Lo,  JY,  Bilska-­‐Wolak,  AO,  Baker,  JA,  Tourassi,  GD,  Floyd,  CE  &  Markey,  MK  2006,  'Chapter27   Computer-­‐Aided  Diagnosis  in  Breast  Imaging:  Where  Do  We  Go  after  Detection?',  in  Recent   Advances  In  Breast  Imaging,  Mammography,  and  Computer-­‐Aided  Diagnosis  of  Breast  Cancer,  The   Society  of  Photo-­‐Optical  Instrumentation  Engineers,  Bellingham,  Washington.   Mangasrian,  OL,  Street,  WN  &  Wolberg,  WH  1995,  'Breast  Cancer  Diagnosis  and  Prognosis  via  Linear   Programming',  Operations  Research,  vol  43,  no.  4,  pp.  570-­‐577.   Marowka,  A  2007,  'Parallel  Computing  On  Any  Desktop',  Communications  of  the  ACM,  vol  50,  no.  9,   pp.  75-­‐78.   Marowka,  A  2009,  'Performance  Study  of  the  First  Three  Intel  Multicore  Processors',  Scalable   Computing:  Practice  and  Experience,  vol  10,  no.  4,  pp.  429-­‐41.   Marungo,  F  2010,  'A  Bootstrap  Linear  Regression  of  Temperatures  in  the  United  States',  Coursework,   Department  of  Computer  Science,  Birkbeck,  University  of  London,  London,  UK.  

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  55  of  119   September,  2010  

Marungo,  F  2010,  'A  Bootstrap  Linear  Regression  of  Temperatures  in  the  United  States',  Coursework   in  Computational  Intelligence  and  Visualisation,  Department  of  Computer  Science,  Birkbeck,   University  of  London,  London,  UK.   Microsoft  Corporation  2010a,  MSDN  -­‐-­‐  Argument  Passing  and  Naming  Conventions,  viewed  6   September  2010,  .   Microsoft  Corporation  2010b,  MSDN  -­‐-­‐  Double  Thunking,  viewed  15  August  2010,   .   Microsoft  Corporation  2010c,  MSDN  -­‐-­‐  MMX,  SSE,  and  SSE2  Intrinsics,  viewed  28  April  2010,   .   Microsoft  Corporation  2010d,  MSDN  -­‐-­‐  Performance  Considerations  for  Interop,  viewed  15  August   2010,  .   Microsoft  Corporation  2010a,  MSDN  -­‐-­‐  Task  Parallel  Library,  viewed  6  September  2010,   .   Milner,  JJ  &  Grandison,  AJ  2008,  'A  Fast,  Streaming  SIMD  Extensions  2,  Logistic  Squashing  Function',   Neural  Computation,  pp.  2967-­‐72.   Negnevitsky,  M  2005,  Artificial  Intelligence:  A  Guide  to  Intelligent  Systems  (2nd  Ed.),  Pearson   Education  Limited,  Essex,  England.   NVIDIA  2010,  CUDA  Toolkit  3.0,  viewed  13  August  2010,   .   NVIDIA  Corporation  2008,  'Technical  Brief  NVIDIA  GeForce  GTX  200  GPU  Architecrtual  Overview.',   Technical  Report  TB-­‐04044-­‐001_v01.   NVIDIA  Corporation  2010a,  NVIDIA  CUDA  C  Programming  Best  Practices  Guide  Version  3.0,  viewed   17  June  2010,  .   NVIDIA  Corporation  2010b,  NVIDIA  CUDA  Programming  Guide  Version  3.0,  viewed  17  June  2010,   .   NVIDIA  Corporation,  CUDA  and  Tesla  for  Breast  Cancer  detection  and  treatment,  viewed  13   December  2009,  .   Oliveira,  J,  Gueld,  M,  Araujo,  A,  Ott,  B  &  Deserno,  TM,  Towards  a  Standard  Reference  Database  for   Computer-­‐aided  Mammography,  viewed  28  April  2010,  .   Pande,  V.,  Stanford  University  2010,  FAQ-­‐NVIDIA-­‐GPU3,  viewed  31  July  2010,   .   Porto,  VW,  Fogel,  DB  &  Fogel,  LJ  1995,  'Alternative  Neural  Network  Training  Methods',  IEEE  Expert:   Intelligent  Systems  and  Their  Applications  ,  pp.  16-­‐22.  

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  56  of  119   September,  2010  

Rangayyan,  RM,  Paranjape,  RB,  Desautels,  JEL  &  Bryant,  H  2006,  'Chapter  3:  An  Indexed  Atlas  of   Digital  Mammograms  for  Computer-­‐Aided  Diagnosis  of  Breast  Cancer',  in  Recent  Advances  In  Breast   Imaging,  Mammography,  and  Computer-­‐Aided  Diagnosis  of  Breast  Cancer,  The  Society  of  Photo-­‐ Optical  Instrumentation  Engineers,  Bellingham,  Washington.   Richter,  J  2003,.NET  Column:  The  CLR's  Thread  Pool,  viewed  27  April  2010,   .   Rizzo,  BD  2010,  New  NVIDIA  GeForce  GTX  480  GPU  Cranks  Up  PC  Gaming  to  New  Heights,  viewed  17   June  2010,  .   Sargent,  D  2001,  'Comparison  of  artificial  neural  networks  with  other  statistical  approaches  -­‐-­‐  results   from  medical  data  sets',  Cancer,  no.  91(8),  pp.  1636-­‐1642.   Schwarzer,  G,  Vach,  W  &  Schumacher,  M  2000,  'On  the  misuses  of  artificial  neural  networks  for   prognostic  and  diagnostic  classification  in  oncology',  Statistics  in  Medicine,  no.  19,  pp.  541-­‐561.   Sickles,  E  1991,  'Periodic  Mammographic  Follow-­‐up  of  Probably  Benign  Lesions  Results  in  3,184   Consuctive  Cases',  Radiology,  1991,  pp.  463-­‐468.   Sickles,  E  1999,  'Probably  Benign  Breast  Lesions:  When  Should  Follow-­‐up  Be  Recommended  and   What  Is  the  Optimal  Follow-­‐up  Protocol',  Radiology,  October  1999,  pp.  11-­‐14.   Sonka,  M,  Fitzpatrick,  JM  (eds.)  2009,  Handbook  of  Medical  Imaging:  Medical  Image  Processing  and   Analysis,  S  P  I  E-­‐International  Society  for  Optical  Engineering,  Bellingham,  WA.   Steinkraus,  D,  Buck,  I  &  Simard,  PY  2005,  'Using  GPUs  for  Machine  Learning  Algorithms',  Proceeding   of  the  2005  Eight  International  Conference  on  Document  Analysis  and  Recognition.   Stoner,  M  2009,  Integrating  Fast  Math  Libraries  for  the  Intel  Pentium  4  Processor,  viewed  21  June   2010,  .   Street,  WN,  Wolberg,  WH  &  Mangasarian,  OL  1993,  'Nuclear  feature  extraction  for  breast  tumor   diagnosis',  International  Symposium  on  Electronic  Imaging:  Science  and  Technology,  IS&T/SPIE,  San   Jose,  CA.   Suckling,  J.,  et.  al.  1994,  'The  Mammographic  Image  Analysis  Society  Digital  Mammogram  Database',   Exerpta  Medica.  International  Congress  Series  1069,  pp.  375-­‐378.   Suri,  JS,  Reiser,  I,  Chandrasekhar,  R,  Wu,  DH,  Lanconelli,  N,  Campanini,  R,  Roffilli,  M,  Wong,  K,  Chang,   R,  Kshirsagar,  A,  Guo,  Y,  Sun,  Y,  Sivaramakrishna,  R,  Wirth,  M,  Tot,  T,  Cao,  A,  Acha,  B,  Serrano,  C,   Desautels,  JEL  &  Rangayyan,  RM  2006,  'Chapter  28:  The  Current  Status  and  Likely  Future  of  Breast   Imaging  CAD',  in  Recent  Advances  In  Breast  Imaging,  Mammography,  and  Computer-­‐Aided  Diagnosis   of  Breast  Cancer,  The  Society  of  Photo-­‐Optical  Instrumentation  Engineers,  Bellingham,  Washington.   Sutter,  H  2005,  A  Fundamental  Turn  Toward  Concurrency  in  Software,  viewed  29  June  2010,   .     Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  57  of  119   September,  2010  

Sutton,  MA  2009,  'Chapter  6:  Image  Segmentation  by  Fuzzy  Clustering:  Methods  and  Issues',  in  IN   Bankman  (ed.),  Handbook  of  Medical  Image  Processing  and  Analysis,  Second  Edition  edn,  Elsevier   Inc.,  London,  UK.   VanderSpek,  J  2008,  'The  CUDA  Compiler  Driver',  NVIDIA  Corporation.   Volkov,  V  &  Demmel,  JW  2008,  'Benchmarking  GPUs  to  Tune  Dense  Linear  Algebra',  Proceedings  of   the  2008  ACM/IEEE  conference  on  Supercomputing,  IEEE  Press,  Austin,  Texas,  Article  No.  31.   Wald,  I  2004,  'Realtime  Ray  Tracing  and  Interactive  Global  Illumination',  PhD  Thesis,  Computer   Graphics  Group,  Saarland  University,  Saarbrücken,  Germany,  Germany.   Wikipedia  2010a,  C++/CLI,  viewed  6  September  2010,  .   Wikipedia  2010a,  Flynn's  taxonomy,  viewed  19  June  2010,   y.Fitness; } static SampDatPtr createSmallSet(GeneticSelector::Chromosome &, SamplingData &, int trainMult); // public members // construction and destruction GeneticSelector::GeneticSelector(int popSize, int gen, int elite , SamplingData &data, NeuralNetTrainer::Factory &fact , float azCutOff) : PopulationSize(popSize), Generations(gen), EliteTopN(elite) , Data(data), factory(fact), AzCutOff(azCutOff) , Population(new Chromosome[PopulationSize]) , pairs(new Chromosome[PopulationSize - EliteTopN][2]) { } GeneticSelector::~GeneticSelector(void) { delete[] Population; delete[] pairs; } void GeneticSelector::Execute() { TotalTrainerEvalTime = TotalVerificationEvalTime = TotalTrainerTime = 0; TotalTrainerCalcFitness = TotalTrainerSortPopulatoin = TotalTrainerEvaluateNNsTime = 0; TotalTrainerCreateChildrenTime = TotalTrainerCreateParentsTime = 0; initPop(); calcFitness(true); // sort in declining order of fitness sort(Population, &Population[PopulationSize], ChromosomeGreater); for(int i = 1; i < Generations; ++i) { generatePairs(); executeCrossover(); calcFitness(false); sort(Population, &Population[PopulationSize], ChromosomeGreater); } }

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo     Page  78  of  119   MSc  Advanced  Information  Systems  project  report     September,  2010       // private members void GeneticSelector::initPop() { unsigned long mask = ~0ul; uniform_int hiddenGeneDist(1, MaxHiddenNodes); variate_generator hiddenGeneGen(Random, hiddenGeneDist); mask >>= 64 - Data.Trainingset.FieldDim; for(int i = 0; i < PopulationSize; ++i) { Chromosome &c = Population[i]; c.FeatureGenes = Random(); // this will generate a 32 bit number c.FeatureGenes rand1) idx1 += iAdd1--; } --idx0; // ... and remove later. --idx1; // crossover population member mating with itself // go back if(idx0 == idx1) { --i; continue; } // set up the pairs. pairs[i][0] = Population[idx0]; pairs[i][1] = Population[idx1]; } delete[] cumulative; } static int dbg; void GeneticSelector::executeCrossover() { const int pairsLen = PopulationSize - EliteTopN;   Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo     MSc  Advanced  Information  Systems  project  report         Chromosome *p = &Population[EliteTopN]; for(int i = 0; i < pairsLen; ++i, ++p) { unsigned long mask = 1ul; unsigned long &u = p->FeatureGenes; u = 0ul; // randomly cross over each bit for(int j = 0; j < Data.Trainingset.FieldDim; ++j) { u |= mask & pairs[i][Random() & 1].FeatureGenes; mask Trainingset, samp->Trainingset.FieldDim, Population[i].HiddenLayerGene); TotalTrainerTime -= GetTickCount(); tnr->TrainNNs(); TotalTrainerTime += GetTickCount(); EvolutionaryTrainer *et = dynamic_cast(tnr.get()); TotalTrainerEvalTime += et->NNEvalTime; TotalTrainerCalcFitness += et->CalcFitnessTime; TotalTrainerSortPopulatoin += et->SortPopulationTime; TotalTrainerEvaluateNNsTime += et->EvaluateNNsTime; TotalTrainerCreateChildrenTime += et->CreateChildrenTime; TotalTrainerCreateParentsTime += et->CreateParentsTime; NeuralNetTrainer::WgtsPtr wgts = tnr->GetWeights(); // then use the top weight vector from each sample to evaluate Az over the test // set of the sample // this can be done async with the addition of callback functionality to the // library. for(int j = 0; j < sampleSize; ++j) { sampleAz[j] = calcAz(wgts[j].get(), samp->Testingset, i, j); } sort(sampleAz, &sampleAz[sampleSize]); chrom[i].Fitness = 100.0f*sampleAz[cutOffIdx]; } }

delete[] sampleAz;

float GeneticSelector::calcAz(float *wgts, const TestingSet &testSamp, int chromIdx, int sampIdx) { Chromosome &chrom = Population[chromIdx];   Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo     MSc  Advanced  Information  Systems  project  report         const int testCnt = testSamp.RecordCnts[sampIdx]; int totalTrue = 0, totalNeg = 0;

Page  80  of  119   September,  2010  

OptGnd *optGnd = new OptGnd[testCnt]; processTestNN(testSamp, wgts, chromIdx, sampIdx, optGnd, totalTrue, totalNeg); struct {bool operator()(OptGnd &x, OptGnd &y){return x.opt > y.opt;}} Comp; sort(&optGnd[0], &optGnd[testCnt], Comp); int tt = 0; float tDelta = 1.0f/float(totalTrue); float fDelta = 1.0f/float(totalNeg); float retVal = 0.0f; // calculate the area under the curve. for(int i = 0; i < testCnt; ++i) { if(optGnd[i].gnd) ++tt; else retVal += fDelta*tDelta*tt; } delete[] optGnd; return retVal; }

//// this is designed to be a very fast calculation of the //// test data (unselected items of the bootstrap) //// the test data is only evaluated on the top network //// thus it can occur on the CPU while the evolutionary //// occuring over multiple generations and multiple //// on the GPU. void GeneticSelector::processTestNN(const TestingSet &testSamp, float * const wgts, int chromIdx, int sampIdx, OptGnd * const optGnd, int &totalTrue, int &totalNeg) { Chromosome &c = Population[chromIdx]; const int &hiddenNodeCnt = c.HiddenLayerGene; int recCnt = testSamp.RecordCnts[sampIdx]; float *output = (float *)_aligned_malloc(sizeof(float)*testSamp.RecordDims[sampIdx], 64); TotalVerificationEvalTime -= GetTickCount(); SseGlobal::EvaluateNN(testSamp.TestSets[sampIdx] , testSamp.RecordDims[sampIdx] , testSamp.FieldDim , 1 , hiddenNodeCnt , wgts , output , recCnt); TotalVerificationEvalTime += GetTickCount(); // set up optToGnd // calculate total trues and total false float *gnd = testSamp.GroundTruth[sampIdx]; totalTrue = 0; totalNeg = 0; for(int i = 0; i < recCnt; ++i) { if(gnd[i] > 0.5f) { ++totalTrue; optGnd[i].gnd = 1; } else { ++totalNeg; optGnd[i].gnd = 0; } optGnd[i].opt = output[i]; } _aligned_free(output); } vector GeneticSelector::Chromosome::GetInputFieldIndexes() { vector retVal;   Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report       unsigned long gene = FeatureGenes; for(int i = 0; i < 64; ++i) { if(gene & 1ul) retVal.push_back(i); gene >>= 1; } return retVal; }

   

Page  81  of  119   September,  2010  

SampDatPtr createSmallSet(GeneticSelector::Chromosome &c, SamplingData &s, int trainMult) { std::vector idxs = c.GetInputFieldIndexes(); int fldCnt = idxs.size(); int setCnt = s.Testingset.TestsetDim; // will use the CPU for test ANN evaluation. TestingSet *test = new TestingSet(setCnt, fldCnt, s.Testingset.RecordCnts, 4); // will use whichever evaluator provided. TrainingSet *train = new TrainingSet(setCnt, fldCnt, s.Trainingset.RecordCnt, trainMult); // training data is rectangular, testing data is jagged. int trainRecCnt = s.Trainingset.RecordCnt; int trainRecDim = s.Trainingset.RecordDim; for(int i = 0; i < setCnt; ++i) { int testRecCnt = s.Testingset.RecordCnts[i]; int testRecDim = s.Testingset.RecordDims[i]; float* trainSet = s.Trainingset.Samples[i]; float* trainGnd = s.Trainingset.GroundTruth[i]; float* testSet = s.Testingset.TestSets[i]; float* testGnd = s.Testingset.GroundTruth[i]; float *beg, *end; float *dest; for(int j = 0; j < fldCnt; ++j) { beg = &trainSet[idxs[j]*trainRecDim]; end = &beg[trainRecCnt]; dest = &train->Samples[i][j*train->RecordDim]; copy(beg, end, dest); beg = &testSet[idxs[j]*testRecDim]; end = &beg[testRecCnt]; dest = &test->TestSets[i][j*test->RecordDims[i]]; copy(beg, end, dest); } beg = &trainGnd[0]; end = &beg[trainRecCnt]; dest = train->GroundTruth[i]; copy(beg, end, dest); beg = &testGnd[0]; end = &beg[testRecCnt]; dest = test->GroundTruth[i]; copy(beg, end, dest); } return SampDatPtr(new SamplingData(*train, *test)); }

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  82  of  119   September,  2010  

Global.h   #pragma once #ifndef _GLOBAL_H #define _GLOBAL_H #include namespace PROJ_MarungoF { namespace Lib { boost::mt19937 Random; } } typedef struct { // All fixed length array are // elements size_t WeightVectors[1024]; size_t Output[1024]; size_t Dataset[1024]; int int int int int } NNEvaluationData; #endif

0-terminating thus can only contain a max of 1023 // [WeightSetDim][WeightVectorDim][WeightEleDim] // [WeightSetDim][WeightVectorDim][RecordDim] // [WeightSetDim] points to the matching Dataset

WeightSetDim; WeightVectorDim; WeightEleDim; WeightOutputOffset; HiddenNodeCnt;

// // // //

# of Weightsets, max val 1023 Population size of evolutionary algo == (EvalFldDim + 2) * HiddenNodeCnt + 1 == (# of EvalFlds + 1) * # of hidden nodes

// this is the data for the evaluation call

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Global.cpp   #include "Global.h" using namespace PROJ_MarungoF::Lib; using namespace boost; static int init(); static int dummy = init(); static int init() { Random = mt19937(0); return 0; }

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Page  83  of  119   September,  2010  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

NeuralNetEvaluator.h   #pragma once #ifndef _NEURAL_NET_EVALUATOR_H #define _NEURAL_NET_EVALUATOR_H #include namespace PROJ_MarungoF { namespace Lib { class TrainingSet; class NeuralNetEvaluator { public: struct WeightData { float **WeightVectors; float **Output; int *DatasetMapping; int WeightSetDim; int WeightVectorDim; int WeightEleDim; int HiddenNodeCnt; }; typedef boost::shared_ptr Ptr; // factory class class Factory { public: Factory() {} virtual Ptr GetEvaluator() = 0; private: Factory &operator=(const Factory &); Factory (Factory &); }; virtual int GetRecordDimMultiple() = 0; virtual void Evaluate(WeightData &) = 0; virtual void SetDataset(TrainingSet &) = 0; virtual void ReleaseDataset() = 0; NeuralNetEvaluator(void); virtual ~NeuralNetEvaluator(void); }; } } #endif  

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Page  84  of  119   September,  2010  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

NeuralNetEvaluator.cpp   #include "NeuralNetEvaluator.h" using namespace PROJ_MarungoF::Lib; NeuralNetEvaluator::NeuralNetEvaluator(void) { } NeuralNetEvaluator::~NeuralNetEvaluator(void) { }

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Page  85  of  119   September,  2010  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  86  of  119   September,  2010  

NeuralNetTrainer.h   #pragma once #ifndef _NEURAL_NET_TRAINER_H #define _NEURAL_NET_TRAINER_H #include #include #include #include



namespace PROJ_MarungoF { namespace Lib { class TrainingSet; class NeuralNetTrainer { public: // member types typedef boost::shared_ptr Ptr; typedef boost::shared_array WgtsPtr; // factory class class Factory { public: Factory() {} virtual Ptr GetTrainer(TrainingSet &, int fldCnt, int hidNodeCnt) = 0; virtual int GetRecordDimMultiple() = 0; private: Factory &operator=(const Factory &); Factory (Factory &); }; virtual virtual virtual virtual virtual

int GetFieldCnt() = 0; int GetHiddenNodeCnt() = 0; TrainingSet &GetData() = 0; WgtsPtr GetWeights() = 0; void TrainNNs() = 0;

// [SampleDim][WeightElementDim]

virtual ~NeuralNetTrainer(void);

protected: NeuralNetTrainer(void); private: NeuralNetTrainer(const NeuralNetTrainer &); const NeuralNetTrainer &operator=(const NeuralNetTrainer &); }; } } #endif    

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

NeuralNetTrainer.cpp   #include "NeuralNetTrainer.h" using namespace PROJ_MarungoF::Lib; NeuralNetTrainer::NeuralNetTrainer(void) { } NeuralNetTrainer::~NeuralNetTrainer(void) { }

 

 

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Page  87  of  119   September,  2010  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  88  of  119   September,  2010  

OrigEvolutionaryTrainer.h   #pragma once #ifndef _ORIG_EVOLUTIONARY_TRAINER_H #define _ORIG_EVOLUTIONARY_TRAINER_H #include "NeuralNetTrainer.h" #include "NeuralNetEvaluator.h" #include namespace PROJ_MarungoF { namespace Lib { class TrainingSet; class OrigEvolutionaryTrainer : public NeuralNetTrainer { public: // member types typedef boost::shared_array FloatArrPtr; class Factory : public NeuralNetTrainer::Factory { public: Factory(int popSize, int genCnt, NeuralNetEvaluator &eval) : PopSize(popSize), GenCnt(genCnt), Eval(eval){} const int PopSize; const int GenCnt; NeuralNetEvaluator &Eval; virtual Ptr GetTrainer() = 0; // {return Ptr(new OrigEvolutionaryTrainer(PopSize, GenCnt, Eval));} private: Factory(const Factory &); Factory &operator=(const Factory &); }; // construction and destruction OrigEvolutionaryTrainer(int popSize, int generations, NeuralNetEvaluator &eval); ~OrigEvolutionaryTrainer(void); // member fields const int PopulationSize; const int Generations; FloatArrPtr Fitness;// [SampleSetDim][PopulationSize] FloatArrPtr Sigma; // [SampleSetDim][PopulationSize][WeightEleDim] WgtsPtr Weights; // [SampleSetDim][PopulationSize][WeightEleDim] // member mehtods virtual void TrainNNs(TrainingSet &data, int hidNodeCnt); virtual int GetRecordDimMultiple(); virtual WgtsPtr GetWeights(); //protected: // memeber methods virtual void evaluateNNs(bool calcParents); // member fields float **gndTruths; int wgtsLen; int sampleDim; int fldDim; int wgtEleDim; int recDim; int recCnt; int hidNodeCnt; int childWgtsLen; float **output; // [SampleSetDim][PopulationSize][RecordDim] int popRecDimDim; NeuralNetEvaluator &evaluator; //private: // member fields float c0, c1; // coeffecients for mutation // member methods void initPopulations(); void createChildren(); void sortByFitness();   Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo     MSc  Advanced  Information  Systems  project  report         void calcFitness(bool calcParents = true); // unused copy constructor and assignment operator OrigEvolutionaryTrainer(const OrigEvolutionaryTrainer &); OrigEvolutionaryTrainer & operator=(const OrigEvolutionaryTrainer &); }; } } #endif

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Page  89  of  119   September,  2010  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  90  of  119   September,  2010  

OrigEvolutionaryTrainer.cpp   #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include

"OrigEvolutionaryTrainer.h" "Global.h" "TrainingSet.h"

using namespace PROJ_MarungoF::Lib; using namespace boost; using namespace std; mt19937 &Random(Random); static uniform_real wgtDist(-1.0f, +1.0f); static cauchy_distribution cDist; static normal_distribution nDist;

OrigEvolutionaryTrainer::OrigEvolutionaryTrainer(int popSize, int gen, NeuralNetEvaluator &eval) : PopulationSize(popSize), Generations(gen), evaluator(eval) {} OrigEvolutionaryTrainer::~OrigEvolutionaryTrainer(void)

{}

void OrigEvolutionaryTrainer::TrainNNs(TrainingSet &data, int hidNodeCnt) { evaluator.SetDataset(data); gndTruths = (float **)data.GroundTruth; sampleDim = data.SampleDim; Weights = WgtsPtr(new shared_array[sampleDim]); Fitness = FloatArrPtr(new shared_array[sampleDim]); Sigma = FloatArrPtr(new shared_array[sampleDim]); output = new float *[sampleDim]; fldDim = data.FieldDim; wgtEleDim = (fldDim + 2)*hidNodeCnt + 1; wgtsLen = wgtEleDim*PopulationSize; childWgtsLen = wgtsLen >> 1; recDim = data.RecordDim; recCnt = data.RecordCnt; this->hidNodeCnt = hidNodeCnt; popRecDimDim = PopulationSize*recDim; int optSze = sizeof(float)*popRecDimDim; c0 = 1.0f/(sqrtf(2.0f*wgtEleDim)); c1 = 1.0f/(sqrtf(2.0f*sqrtf((float)wgtEleDim))); // initialize values for(int i = 0; i < sampleDim; ++i) { Weights[i] = shared_array(new float[wgtsLen]); Fitness[i] = shared_array(new float[PopulationSize]); Sigma[i] = shared_array(new float[wgtsLen]); output[i] = (float *)_aligned_malloc(optSze, 64); } initPopulations(); calcFitness(true); sortByFitness(); for(int i = 1; i < Generations; ++i) { createChildren(); calcFitness(false); sortByFitness(); } for(int i = 0; i < sampleDim; ++i) {   Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report       _aligned_free(output[i]); } delete[] output; evaluator.ReleaseDataset(); }

   

Page  91  of  119   September,  2010  

void OrigEvolutionaryTrainer::initPopulations() { // initialize weights & sigma // initial weights uniformly distributed between -1.0 and +1.0, sigma initially 1.0 static variate_generatorwgtGen(Random, wgtDist); for(int i = 0; i < sampleDim; ++i) {

}

for(float *wgt = Weights[i].get(), *sig = Sigma[i].get(), * const wgtsEnd = &wgt[wgtsLen]; wgt < wgtsEnd; ++wgt, ++sig) { *wgt = wgtGen(); *sig = 1.0f; }

} // This function mutates the top 50% of previous generations population to create the // children // See Project Report // wi' = wi + C*sigi' // sigi' = sigi * ext(c0*N(0,1) + c1*Ni(0,1) // c0 = 1.0/sqrt(2*WeightVectorDim), c1 = 1.0/sqrt(2*sqrt(WeightVectorDim)) void OrigEvolutionaryTrainer::createChildren() { static variate_generatornGen(Random, nDist); static variate_generatorcGen(Random, cDist); for(int i = 0; i < sampleDim; ++i) { const float *pw = Weights[i].get(); float *cw = (float *)&pw[childWgtsLen]; const float *ps = Sigma[i].get(); float *cs = (float *)&ps[childWgtsLen]; const float *le = &pw[wgtEleDim]; const float * const childWgts = cw; for(; pw < childWgts; le += wgtEleDim) { float N0 = nGen(), C = cGen(); for(;pw < le; ++pw, ++cw, ++ps, ++cs) { *cs = *ps * exp(c0*N0 + c1*nGen()); *cw = *pw + C*(*cs); } } } } void OrigEvolutionaryTrainer::calcFitness(bool calcParents) { //int t0 = GetTickCount(); evaluateNNs(calcParents); //int t1 = GetTickCount(); //int deltT0 = t1 - t0; //cout > 1]; }   Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo     MSc  Advanced  Information  Systems  project  report         float *opt = output[i]; const float *gdTh = gndTruths[i]; const float * const gndEnd = &gdTh[recCnt]; const int padding = recDim - recCnt; while(fit < fitEnd) { *fit = (float)recCnt; for(float *gnd = (float *)gdTh; gnd < gndEnd; ++gnd, ++opt) { *fit -= (*gnd - *opt)*(*gnd - *opt); } *fit *= 100.0f/recCnt; ++fit; opt += padding; } } //t1 = GetTickCount(); //int deltT1 = t1 - t0; } void OrigEvolutionaryTrainer::sortByFitness() { static struct IdxToFit {int idx; float fit;}; static struct Greater { bool operator()(const IdxToFit &x, const IdxToFit &y) {return x.fit > y.fit;} } compare; IdxToFit *idxToFit = new IdxToFit[PopulationSize]; for(int i = 0; i < sampleDim; ++i) { float * const wgts = Weights[i].get(); // old weights float * const fit = Fitness[i].get(); float * const sig = Sigma[i].get(); for(int j = 0, idx = 0; j < PopulationSize; ++j) { idxToFit[j].idx = j; idxToFit[j].fit = fit[j]; } sort(idxToFit, &idxToFit[PopulationSize], compare); //float *curWgt = wgtsBuf; //float *curSig = sigBuf; float *curWgt = new float[wgtsLen]; float *curSig = new float[wgtsLen]; for(int j = 0; j < PopulationSize; ++j, curWgt += wgtEleDim, curSig += wgtEleDim) { copy(&wgts[idxToFit[j].idx*wgtEleDim], &wgts[(1 + idxToFit[j].idx)*wgtEleDim], curWgt); copy(&sig[idxToFit[j].idx*wgtEleDim], &sig[(1 + idxToFit[j].idx)*wgtEleDim], curSig); fit[j] = idxToFit[j].fit; } Weights[i] = shared_array(curWgt); Sigma[i] = shared_array(curWgt); //copy(wgtsBuf, &wgtsBuf[wgtsLen], wgts); //copy(sigBuf, &sigBuf[wgtsLen], wgts); } } int OrigEvolutionaryTrainer::GetRecordDimMultiple() { return evaluator.GetRecordDimMultiple(); } void OrigEvolutionaryTrainer::evaluateNNs(bool calcParents) { NeuralNetEvaluator::WeightData wgt; int nNCnt = calcParents ? PopulationSize : PopulationSize >> 1; int offset = (calcParents ? 0 : PopulationSize >> 1) * wgtEleDim; wgt.WeightSetDim = sampleDim; wgt.HiddenNodeCnt = hidNodeCnt; wgt.WeightVectorDim = nNCnt; wgt.WeightEleDim = (fldDim + 2) * hidNodeCnt + 1;   Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Page  92  of  119   September,  2010  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

wgt.WeightVectors = new float *[sampleDim]; wgt.Output = new float *[sampleDim]; wgt.DatasetMapping = new int[sampleDim]; copy(output, &output[sampleDim], wgt.Output); for(int i = 0; i < sampleDim; ++i) { wgt.WeightVectors[i] = Weights[i].get() + offset; wgt.DatasetMapping[i] = i; } evaluator.Evaluate(wgt);

}

delete[] wgt.WeightVectors; delete[] wgt.Output; delete[] wgt.DatasetMapping;

NeuralNetTrainer::WgtsPtr OrigEvolutionaryTrainer::GetWeights() { return Weights; }

 

 

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Page  93  of  119   September,  2010  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

SamplingData.h   #pragma once #ifndef SAMPLING_DATA_H #define SAMPLING_DATA_H namespace PROJ_MarungoF { namespace Lib { class TrainingSet; class TestingSet; class SamplingData { public: TrainingSet TestingSet

&Trainingset; &Testingset;

SamplingData(TrainingSet &, TestingSet &); ~SamplingData(); private: SamplingData(const TrainingSet &); const SamplingData &operator=(const TrainingSet &); }; } } #endif

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Page  94  of  119   September,  2010  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  95  of  119   September,  2010  

SamplingData.cpp   #include "SamplingData.h" #include "TrainingSet.h" #include "TestingSet.h" using namespace PROJ_MarungoF::Lib; SamplingData::SamplingData(TrainingSet &trSet, TestingSet &teSet) : Trainingset(trSet), Testingset(teSet) { } SamplingData::~SamplingData() { delete &Trainingset; delete &Testingset; }

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

SseEvaluator.h   #pragma once #ifndef _SSE_EVALUATOR_H #define _SSE_EVALUATOR_H #include "NeuralNetEvaluator.h" namespace PROJ_MarungoF { namespace Lib { class SseEvaluator : public NeuralNetEvaluator { public: class Factory : public NeuralNetEvaluator::Factory { public: virtual Ptr GetEvaluator(); Factory(){} private: Factory &operator=(const Factory &); Factory (Factory &); }; SseEvaluator(void); virtual ~SseEvaluator(void); virtual int GetRecordDimMultiple(); virtual void SetDataset(TrainingSet &); virtual void ReleaseDataset(); virtual void Evaluate(WeightData &); protected: TrainingSet *data; }; } } #endif

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Page  96  of  119   September,  2010  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

SseEvaluator.cpp   #include "SseEvaluator.h" #include "SseGlobal.h" #include "TrainingSet.h" using namespace PROJ_MarungoF::Lib; SseEvaluator::SseEvaluator(void) { } SseEvaluator::~SseEvaluator(void) { } int SseEvaluator::GetRecordDimMultiple(){return 4;} void SseEvaluator::SetDataset(PROJ_MarungoF::Lib::TrainingSet &data) {this->data = &data;} void SseEvaluator::ReleaseDataset(){} void SseEvaluator::Evaluate(WeightData &wgtDat) { int rowDim = data->RecordDim; int fldDim = data->FieldDim; int hidNodCnt = wgtDat.HiddenNodeCnt; int recCnt = data->RecordCnt; int wgtVectDim = wgtDat.WeightVectorDim; for(int i = 0; i < wgtDat.WeightSetDim; ++i) { float *dat = data->Samples[wgtDat.DatasetMapping[i]]; float *wgt = wgtDat.WeightVectors[i]; float *out = wgtDat.Output[i]; SseGlobal::EvaluateNN(dat , rowDim , fldDim , wgtVectDim , hidNodCnt , wgt , out , recCnt); } } NeuralNetEvaluator::Ptr SseEvaluator::Factory::GetEvaluator() { return Ptr(new SseEvaluator()); }

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Page  97  of  119   September,  2010  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  98  of  119   September,  2010  

SseGlobal.h   #pragma once #ifndef _SSE_GLOBAL_H #define _SSE_GLOBAL_H union __m128; namespace PROJ_MarungoF { namespace Lib { struct SseGlobal { // data and output must be aligned on 16 byte boundaries // this function evaluates the same weight on all of the networks void static EvaluateNN(float *data // [fieldDim][rowDim] , int rowDim // must be a multiple of 4 , int fieldDim , int wgtVectDim , int hidNodCnt , float *weight // [(fieldDim + 2)*hidNodCnt + 1] , float *output // [rowDim] , int recCnt ); void static __fastcall SquashingFunctionP4(__m128* fin);

} } #endif

private: SseGlobal(void); ~SseGlobal(void); };

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  99  of  119   September,  2010  

SseGlobal.cpp   #include "SseGlobal.h" #include #include using namespace PROJ_MarungoF::Lib; // SquashingFunctionP4 and constant declarations from // A Fast, Streaming SIMD Extensions 2, Logistic Squashing Function (Published in Neural Computation) //J. J. Milner //[email protected] //A. J. Grandison //[email protected] //School of Computing and Mathematical Sciences, University of Greenwich, //30 Park Row, Greenwich, London SE10 9SL, UK //doi:10.1162/neco.2008.10-06-366 _declspec(align(64)) static const float MAX[4]={87.0f,87.0f,87.0f,87.0f}; _declspec(align(64)) static const float MIN[4]={-87.0f,-87.0f,-87.0f,-87.0f}; _declspec(align(64)) static const float p4shiftexp[4]= {-(8388608.0f/0.6931471806f),-(8388608.0f/0.6931471806f),-(8388608.0f/0.6931471806f), -(8388608.0f/0.6931471806f)}; _declspec(align(64)) static const float p4shiftbias[4]= {1065353216.0f ,1065353216.0f,1065353216.0f,1065353216.0f}; _declspec(align(64)) const float p4ones[4]={1.0f,1.0f,1.0f,1.0f}; _declspec(align(64)) const float p4zeros[4]={0.0f,0.0f,0.0f,0.0f}; void SseGlobal::EvaluateNN(float *dStart , int rowDim , int fieldDim , int wgtVectDim , int hidNodCnt , float *weight , float *output , int recCnt) { __m128 w4; // weights on the hidden layer __m128 wo4; // weights to the output node __m128 ipt4; // input node values // iterate over 4 records at a time int inc = rowDim / 4; int wOutOff = (fieldDim + 1)*hidNodCnt; int wOff = (fieldDim + 2)*hidNodCnt + 1; float *weightEnd = &weight[wOff*wgtVectDim]; __m128 *dEnd = (__m128 *)&dStart[rowDim*fieldDim]; while(weight< weightEnd) { __m128 *opt4 = (__m128 *)output; float *data = dStart; for(int i = 0; i < recCnt; i += 4, data += 4) { float *w = weight; float *wo = &w[wOutOff]; *opt4 = _mm_set1_ps(0.0f); // iterate over hidden nodes for(int j = 0; j < hidNodCnt; ++j) { ipt4 = _mm_set1_ps(0.0f); // iterate over inputs for(__m128 *d = (__m128 *)data; d < dEnd; d += inc) { w4 = _mm_set1_ps(*w); ipt4 = _mm_add_ps(ipt4, _mm_mul_ps(w4, *d)); ++w; } // add bias w4 = _mm_set1_ps(*w); ipt4 = _mm_add_ps(ipt4, w4); ++w; SquashingFunctionP4(&ipt4); wo4 = _mm_set1_ps(*wo); *opt4 = _mm_add_ps(*opt4, _mm_mul_ps(wo4, ipt4)); ++wo; }   Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo     MSc  Advanced  Information  Systems  project  report         // add bias wo4 = _mm_set1_ps(*wo); *opt4 = _mm_add_ps(*opt4, wo4); ++wo; SquashingFunctionP4(opt4); // this is the 4 NNs' output ++opt4; } output += rowDim; weight += wOff; }

Page  100  of  119   September,  2010  

} // SquashingFunctionP4 and constant declarations from // A Fast, Streaming SIMD Extensions 2, Logistic Squashing Function (Published in Neural Computation) //J. J. Milner //[email protected] //A. J. Grandison //[email protected] //School of Computing and Mathematical Sciences, University of Greenwich, //30 Park Row, Greenwich, London SE10 9SL, UK //doi:10.1162/neco.2008.10-06-366 __declspec (naked) void __fastcall SseGlobal::SquashingFunctionP4(__m128* fin) { __asm { movaps xmm1,[ecx] mulps xmm1, [p4shiftexp] addps xmm1, [p4shiftbias] cvtps2dq xmm0,xmm1 movdqa [ecx], xmm0 movaps xmm1, [ecx] addps xmm1, [p4ones] rcpps xmm0, xmm1 movaps [ecx],xmm0 ret

;load 4 single precision values ;shift y into high order bits ;add the (pre-shifted) bias ;convert 4 floats to integers ;store 4 integers ;reload as floats, this is e-y ;add one ;reciprocal ;store 4 results

} }

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  101  of  119   September,  2010  

TestingSet.h   #pragma once #ifndef TEST_SET_H #define TEST_SET_H #include namespace PROJ_MarungoF { namespace Lib { class TestingSet { public: // construction, destruction TestingSet(int testsetDim, int fieldDim, int recCnts[1024], int recDimMult = 32); virtual ~TestingSet(void); // elements

All fixed length array are 0-terminating thus can only contain a max of 1023

(float *)TestSets[1024]; (float *)GroundTruth[1024]; const int TestsetDim; const int FieldDim; int RecordDims[1024];

// [TestsetDim][FieldDim][RecordDim] // [TestsetDim][RecordDim] // # of bootstraps, max val 1023 // will be a multiple of RecordDimMultiple

int RecordCnts[1024];

// this will be the true record count

const int RecordDimMultiple; const int Alignment;

// must be a power of two

private: TestingSet(const TestingSet &); TestingSet & operator()(const TestingSet &); }; } } #endif

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Page  102  of  119   September,  2010  

TestingSet.cpp   #include "TestingSet.h" #include #include using namespace PROJ_MarungoF::Lib; using namespace std; TestingSet::TestingSet(int testsetDim, int fieldDim, int recCnts[1024], int recDimMult) : TestsetDim(testsetDim) , FieldDim(fieldDim) , RecordDimMultiple(recDimMult) , Alignment(64) { copy(&recCnts[0], &recCnts[1024], &RecordCnts[0]); TestSets[TestsetDim] = 0; GroundTruth[TestsetDim] = 0; RecordDims[0] = 0; RecordCnts[TestsetDim] = 0; for(int i = 0; i < TestsetDim; ++i) { RecordDims[i] = recCnts[i] & (recDimMult - 1) ? (recCnts[i] | (recDimMult - 1)) + 1: recCnts[i]; size_t tsLen = sizeof(float)*FieldDim*RecordDims[i]; size_t gndLen = sizeof(int)*RecordDims[i]; TestSets[i] = (float *)_aligned_malloc(tsLen, Alignment); GroundTruth[i] = (float *)_aligned_malloc(gndLen, Alignment); //zero padded values. if(RecordDims[i] != RecordCnts[i]) { float *f = TestSets[i]; for(int j = 0; j < FieldDim; ++j, f += RecordDims[i]) for(int k = RecordCnts[i]; k SetDataset(ts); eval->Evaluate(wgtDat); eval->ReleaseDataset();

   

Page  114  of  119   September,  2010  

for(int i = 0; i < sampDim; ++i) { for(int j = 0; j < wgtVectDim; ++j) { float *out = &output[i][j*ts.RecordDim]; for(int k = 0; k < ts.RecordCnt; ++k) { retVal = max( retVal , fabsf(out[k] - XOR_OUT[(i + j) & 1][3 & (i + k)]) ); } } } for(int i = 0; i < sampDim; ++i) { _aligned_free(weightVects[i]); _aligned_free(output[i]); } delete[] weightVects; delete[] output; delete[] dsMap; return retVal; } void initTrainingSet(TrainingSet &ts) { for(int i = 0; i < ts.SampleDim; ++i) { for(int j = 0; j < ts.RecordCnt; ++j) { int val = i + j; ts.Samples[i][j] = (float)(val & 1); ts.Samples[i][ts.RecordCnt + j] = (float)((val & 2) >> 1); ts.GroundTruth[i][j] = (float)(((val & 1) ^ ((val & 2) >> 1)) & 1); } } } float TestClass::TestTrainer(NeuralNetTrainer::Factory &fact) { TrainingSet train(32, 2, 128, 32); initTrainingSet(train); return TestTrainer(fact, train, 2); } float TestClass::TestTrainer(NeuralNetTrainer::Factory &fact, TrainingSet &train, int hidNodeCnt) { NeuralNetTrainer::Ptr pTnr = fact.GetTrainer(train, train.FieldDim, hidNodeCnt); NeuralNetTrainer &tnr = *pTnr.get(); tnr.TrainNNs(); NeuralNetTrainer::WgtsPtr pWgts = tnr.GetWeights(); NeuralNetEvaluator::WeightData wgtData; wgtData.WeightSetDim = train.SampleDim; wgtData.WeightVectorDim = 1; wgtData.WeightEleDim = (train.FieldDim + 2)*hidNodeCnt + 1; wgtData.HiddenNodeCnt = hidNodeCnt; wgtData.DatasetMapping = new int[train.SampleDim]; wgtData.Output = new float *[train.SampleDim]; wgtData.WeightVectors = new float *[train.SampleDim]; for(int i = 0; i < train.SampleDim; ++i) { wgtData.Output[i] = new float[wgtData.WeightVectorDim*train.RecordDim]; wgtData.WeightVectors[i] = new float[wgtData.WeightEleDim*wgtData.WeightVectorDim];   Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo     Page  115  of  119   MSc  Advanced  Information  Systems  project  report     September,  2010       copy(pWgts[i].get(), &pWgts.get()[i][wgtData.WeightEleDim*wgtData.WeightVectorDim], wgtData.WeightVectors[i]); wgtData.DatasetMapping[i] = i; }

NeuralNetEvaluator::Ptr pEval = CudaEvaluator::Factory(&BasicEvaluateNN).GetEvaluator(); NeuralNetEvaluator &eval = *pEval.get(); eval.SetDataset(train); eval.Evaluate(wgtData); eval.ReleaseDataset(); int right = 0, total = 0; for(int i = 0; i < train.SampleDim; ++i) { for(int j = 0; j < train.RecordCnt; ++j) { if((wgtData.Output[i][j] < 0.5f && !train.GroundTruth[i][j]) || (wgtData.Output[i][j] > 0.5f && train.GroundTruth[i][j])) ++right; ++total; } } for(int i = 0; i < train.SampleDim; ++i) { delete[] wgtData.Output[i]; delete[] wgtData.WeightVectors[i]; } delete[] wgtData.DatasetMapping; delete[] wgtData.Output; delete[] wgtData.WeightVectors; return (float)right/(float)total; } PtrGS TestClass::TestGpuGeneticSelector() { // create a data array with 6 fields, and 128 records // the 1 and 3 fields are XORed to create the ground truth, the rest are dummies. float data[64][1024]; float gnd[1024]; for(int i = 0; i < 64; ++i) for(int j = 0; j < 1024; ++j) { if(i == 1) data[i][j] = (float)(j & 1); else if (i == 3) data[i][j] = (float)((j & 2) >> 1); else data[i][j] = rand() & 1; gnd[j] = (j & 1) ^ ((j & 2) >> 1); } Bootstrap bs(1024, 10); Bootstrap::DataPtr pData = bs.CreateSamplingData((float*)data, (float *)gnd, 1024, 4, 64); CudaEvaluator::Factory evalFact(&BasicEvaluateNN); NeuralNetEvaluator::Ptr eval = evalFact.GetEvaluator(); EvolutionaryTrainer::Factory trainFact(50, 100, *eval); GeneticSelector &gs = *(new GeneticSelector(50, 20, 5, *pData, trainFact, 0.1f)); GsExecuteTime = -GetTickCount(); gs.Execute(); GsExecuteTime += GetTickCount(); return PtrGS(&gs); } PtrGS TestClass::TestCpuGeneticSelector() { // create a data array with 6 fields, and 128 records // bootstrap 10 samples // changed to 6 fields, and 128 records // the 1 and 3 fields are XORed to create the ground truth, the rest are dummies. float data[64][1024]; float gnd[1024]; for(int i = 0; i < 64; ++i) for(int j = 0; j < 1024; ++j)   Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo     Page  116  of  119   MSc  Advanced  Information  Systems  project  report     September,  2010       { if(i == 1) data[i][j] = (float)(j & 1); else if (i == 3) data[i][j] = (float)((j & 2) >> 1); else data[i][j] = rand() & 1; gnd[j] = (j & 1) ^ ((j & 2) >> 1); } Bootstrap bs(1024, 10); Bootstrap::DataPtr pData = bs.CreateSamplingData((float*)data, (float *)gnd, 1024, 4, 64); SseEvaluator::Factory evalFact; NeuralNetEvaluator::Ptr eval = evalFact.GetEvaluator(); EvolutionaryTrainer::Factory trainFact(50, 100, *eval); GeneticSelector &gs = *(new GeneticSelector(50, 20, 5, *pData, trainFact, 0.1f)); GsExecuteTime = -GetTickCount(); gs.Execute(); GsExecuteTime += GetTickCount(); return PtrGS(&gs); }

  Using  a  CUDA-­‐enabled  Graphics  Card  to  Accelerate  Neural  Network  Design  for  Breast  Cancer  Computer-­‐aided  Diagnosis  

Fumbeya  Luis  Marungo   MSc  Advanced  Information  Systems  project  report      

   

Testing.cpp   // Testing.cpp : Defines the entry point for the console application. #include #include #include #include #include #include #include #include #include #include #include #include

"stdafx.h" "TestClass.h" "SseEvaluator.h" "CudaEvaluator.h" "EvolutionaryTrainer.h" "CudaBasic.h" "TrainingSet.h" "TestingSet.h"

#include #include static int init(){srand(0); return 0;} static int dummy = init(); using namespace PROJ_MarungoF::Lib; using namespace PROJ_MarungoF::Testing; using namespace std; const int FUNC_CNT = 1; const CudaEvaluator::Function FUNCS[FUNC_CNT] = {&BasicEvaluateNN}; const char FUNC_NAMES[FUNC_CNT][50] = {"BasicEvaluateNN"}; void setupWgtData(NeuralNetEvaluator::WeightData &, int); void destroyWgtData(NeuralNetEvaluator::WeightData &); void testGpuGs(); void testCpuGs(); NeuralNetEvaluator::Ptr trainEval; int _tmain(int argc, _TCHAR* argv[]) { testGpuGs(); cout