A Distributed Kernel Summation Framework for ... - Google Sites

A Distributed Kernel Summation Framework Dongryeol Lee Kernel Methods

A Distributed Kernel Summation Framework for Machine Learning and Scientific Applications

Solution Preview

Thesis Outline Problem Definition Thesis Statement and Contributions

Scaling Kernel Summations Approximation Methods Distributed and Shared Memory Parallelism

Distributed Averaging/Random Feature-based GPR/KPCA Distributed Averaging

Conclusion Omitted Research Work

Dongryeol Lee Ph.D. Candidate in Computer Science School of Computational Science and Engineering College of Computing, Georgia Institute of Technology

May 4, 2012 10 AM - 12 PM, KACB 1212

A Distributed Kernel Summation Framework

Outline

Dongryeol Lee Kernel Methods Solution Preview





1 Kernel Methods

Solution Preview


Outline

Dongryeol Lee Kernel Methods

1 Kernel Methods

Solution Preview

Solution Preview





2 Thesis Outline

Problem Definition Thesis Statement and Contributions


Outline


1 Kernel Methods

Solution Preview

Solution Preview





2 Thesis Outline

Problem Definition Thesis Statement and Contributions 3 Scaling Kernel Summations

Approximation Methods Distributed and Shared Memory Parallelism


Outline


1 Kernel Methods

Solution Preview

Solution Preview





2 Thesis Outline


Approximation Methods Distributed and Shared Memory Parallelism 4 Distributed Averaging/Random Feature-based GPR/KPCA

Distributed Averaging


Outline


1 Kernel Methods

Solution Preview

Solution Preview



Distributed Averaging/Random Feature-based GPR/KPCA

2 Thesis Outline






5 Conclusion

Omitted Research Work


Outline


1 Kernel Methods

Solution Preview

Solution Preview




2 Thesis Outline






5 Conclusion



Kernel Methods



A kernel function k : X × X → R: Defines a similarity measure between a pair of objects. 0.6

Scaling Kernel Summations

0.4


0.2



-0.6

-0.4

0.2

-0.2

0.4

0.6

⇔

-0.2 -0.4 -0.6

Example: Gaussian {Ki,j }1≤i,j≤N = e −||xi −xj ||

2 /(2h2 )


Kernel Summations






“Kernel summations is the computational bottleneck ubiquitous in kernel methods and scientific algorithms.” Kernel function which outputs a real number given a tuple of points. Kernel summation computes ≈ average similarity.


Distributed Data If the disk space is cheap, why can’t we store everything on one machine?

Solution Preview

Thesis Outline

P0

P1

P2

P3

P4

P5

P6

P7




• More cost-effective to distribute data on a network of less

powerful nodes than storing everything on one powerful node.


• Allows distributed query processing for high scalability.

Conclusion

• In some cases, all of the data cannot be stored on one


node due to privacy concerns.


Why are Kernel Methods Hard to Scale?




k Hx i ,x j L

K




The problem is inherently super-quadratic in the number of data points. How do we break up the pairwise/higher-order interaction?


Solution Preview 1: Distributed Tree






P0

P1

P2

P3

A Distributed Kernel Summation Framework Dongryeol Lee Kernel Methods Solution Preview

Solution Preview 2: Approximation Methods Different approximation methods.

Thesis Outline à


à

æ cQ

æ

à

æ cR

æ

æ

æ

æ

æ


cR L æ

à

æ cR

æ

æ

æ cR R

cQ L æ

àà

à

æ cQ

àà



æ æ


Conclusion

æ

cR æ æ

æ

æ æ æ

cQ æ æ

æ

æ

æ


æ

æ

æ

æ

æ

æ æ æ

æ

æ cQ R






Solution Preview 3: Distributed Averaging


Outline


1 Kernel Methods

Solution Preview

Solution Preview




2 Thesis Outline






5 Conclusion


A Distributed Kernel Summation Framework Dongryeol Lee

Kernel Methods for Density Estimation

Kernel Methods Solution Preview


If the kernel is a pdf, then we can do density estimation. Kernel density estimation [Parzen 1962]: Weighted column P wi k(q, ri ) = K · w ). average of the kernel matrix (map q∈Q ri ∈R




⇒


Kernel Methods



A kernel k is a similarity function. If it in addition satisfies the Mercer’s conditions (K 0), then it corresponds to a dot-product: k(xi , xj ) = φ(xi )T φ(xj ). 1.0


0.5

-1.0

0.5

-0.5

1.0

φ →

-0.5



-1.0

High-dimensional mapping φ can be automatically provided by k, i.e. kernel trick.


Kernel Methods for Nonlinear Feature Extraction Kernel PCA [Scholkopf 1996]: eigendecompose (K = UΣU T )

Solution Preview


≈



KPCA on MNIST60k of 28 by 28 images 0 1

10 0 −10

−10


−20 −5

0 0

5

10 10


Kernel Methods for Regression

Dongryeol Lee

Gaussian process regression: solve a large linear system. Kernel Methods Solution Preview


⇒

K −1 y



4

2

-6

-4

2

-2

-2


-4

4

6


Scientific Motivation for Kernel Methods Sloan Digital Sky Survey:

Solution Preview





Collects around 200 GB of data/day. Has collected photometric data of around 500 M objects and spectra for more than 1 M objects.


Kernel density estimator:

Scientific Motivation for Kernel Methods: Redshift Prediction P wj k(q, rj )

map

q∈Q rj ∈R P




NadarayaWatson:

map

Gaussian process regression:

K −1 y

P

q∈Q

rj ∈R

wj k(q,rj )

6

6

5

5

4

4

3

3

2

2

1

1 1

Conclusion

wj yj k(q,rj )

(rj ,yj )∈R

-1

2

3

4

5

1

2

3

4

5

-1


Large-scale redshift prediction of galaxies and quasars.


Problem: Scaling Kernel Methods



k Hx i ,x j L

K




Parameter optimization is computationally intensive. Weighted column average (KDE): O(N 2 ) / parameter. Eigendecomposition (KPCA): O(N 3 ) / parameter. Solving linear systems (GPR): O(N 3 ) / parameter.




P0

P1

P2

P3

P4

P5

P6

P7





Depending on the kernel (long-range), the communication could be a bottleneck. Weighted column average (KDE): O(N 2 ) / parameter. Eigendecomposition (KPCA): O(N 3 ) / parameter. Solving linear systems (GPR): O(N 3 ) / parameter.








MNIST Handwritten digit recognition data of 28 × 28 images. Training set: 60K points ⇒ requires 28 GB to store the kernel matrix.








For the moment, assume that the kernel hyperparameters are fixed. We are given the set of query points Q = {q} and the set of data points R. P For KDE: Given > 0, approximate Φ(q; R) = k(q, r) with r∈R e R) such that Φ(q; e R) − Φ(q; R) ≤ Φ(q; R) as fast as Φ(q; possible. Can also put probabilistic error bounds (Chapter 5 of the thesis).


Thesis Statement






“Utilizing the best general-dimension algorithms, approximation methods with error bounds, the distributed and shared memory parallelism can help scale kernel methods.”


Thesis Contributions


A parallel kernel summation framework that utilizes:

Solution Preview





• A recursive general-dimension divide-and-conquer

algorithm using various types of deterministic and probabilistic approximations (Chapter 3, 4, 5). • Indexing the data using any multi-dimensional binary tree

with both distributed memory (MPI) and shared memory (OpenMP/Intel TBB) parallelism (Chapter 8). • A dynamic load balancing scheme to adjust work

imbalances during the computation (Chapter 8). • A combination of distributed averaging and random

feature extraction for scaling GPR and KPCA (Chapter 9).


Methods That Can be Easily Parallelized Within the Framework






• Kernel density estimation. • Kernel regression/local polynomial regression. • Gaussian process regression. • Kernel PCA. • Kernel SVM (test phase). • Kernel k-means. • Kernel conditional density estimation. • Kernel orthogonal centroid. • Kernel (...).


Earlier Related Work






• “A framework for parallel tree-based scientific

simulations,” in Proceedings of 26 th International Conference on Parallel Processing, pp. 137-144, 1997. [Liu and Wu 1997] • THOR: A Parallel N-body Data Mining Framework [Boyer,

Riegel, Gray 2007] • The PEGASUS peta-scale graph mining framework [Kang,

Tsourakakis, Faloutsos 2009] Our work: more comprehensive arsenal of approximation methods + data structures; uses MPI/OpenMP/Intel TBB from the ground-up. Caveat: still a work in progress.


The First Part of the Talk


Density estimation KDE: O N 2 / parameter.

Solution Preview


⇒

Clustering Mean shift O N 2 / iter / parameter.




Regression: GPR. O(N 3 ) / parameter.

Nonlinear feature extraction: KPCA. O(N 3 ) / parameter.

⇒

≈


The Other Two Claim: GPR/KPCA computations ≈ kernel summations Density estimation KDE: O N 2 / parameter.





Clustering Mean shift O N 2 / iter / parameter.

⇒



⇒

≈


Outline


1 Kernel Methods

Solution Preview

Solution Preview




2 Thesis Outline






5 Conclusion



Contribution 1: Unified Framework for Kernel Summation






• Use multidimensional trees for full generality. • Divide-and-conquer using trees via approximations.


Multidimensional Trees






kd-tree [Bentley 1975]: recursively split the data points using an axis-aligned split.


Generalized N-body Framework [Gray/Moore 2000] For problems of the form L N k(q, r ) where ⊕ and ⊗ are associative, q∈Q r ∈R commutative binary operators:





DualTree(Q, R) if CanSummarize(Q, R) then Summarize(Q, R) else if Q and R are leaf nodes then DualTreeBase(Q, R) else DualTree(Q L , R L ),DualTree(Q L , R R ) DualTree(Q R , R L ),DualTree(Q R , R R ) end if end if


Contribution 2: Deterministic/Probabilistic Approximations


à à

Thesis Outline

æ cQ

æ


à

æ cR

æ

æ

æ

æ

æ

cR L æ

à

æ cR

æ

æ

æ cR R

cQ L æ

àà

à

æ cQ

æ

æ cQ R

àà




æ æ æ

cR æ æ

æ

æ

æ

æ æ æ

cQ æ æ

æ

æ æ æ

æ

æ

æ æ

æ

My research prior to year 2008 has focused on this aspect (Chapter 3, 4, 5 of the thesis); 16K lines of open-sourced C++ series expansion code.


Parallelizing Kernel Summations


Thesis Outline

P0

P1

P2

P3

P4

P5

P6

P7





Based on the work (Chapter 8 of the thesis): D. Lee, R. Vuduc, and A. G. Gray. A Distributed Kernel Summation Framework for General-Dimension Machine Learning. In Proceedings of SIAM International Conference on Data Mining, 2012. Best Paper Award.

Extended version invited for a SIAM journal.


Recursive Doubling

Dongryeol Lee

6


7

2


4



5

0 4-7 0-3 0-3

0-3

0-1 0-7

4-7

0-3 4-7 4-7

6-7

2-3 4-5 4-5

0-1

1


2-3

3

Thesis Outline

6-7

0-7

0-7

0-7 0-7 0-7

0-7

0-7

• Used for constructing trees in parallel. • Passing data/messages among MPI processes.


Contribution 3: Distributed Tree




P0

P1

P2

P3




Black line: the global tree shared by all processes. Red line: the local tree owned by process P0. Each process owns the global tree and its local tree.


Distributed Memory Parallelism in Tree Building


P0

P1

P2

P3

P2

P3

P4

P5

P6

P7

P4

P5

Solution Preview


P0




P0

P0

P1

P1

P2

P1

P2

P4

P3

P3

P4

P6

P7

P5

P6

P5

P6

P7

P7

Build the first log p levels in parallel. Overall complexity on the hypercube topology: DN O p log Np + O (Dtw mbound (2p − log 4p)) + w) O 2N(D+t log p + O t2s log p (log p + 3) p


Shared Memory Parallelism in Tree Building


1


1/2


1/4


1/8



Assign available number of threads to reduction processes necessary in computing the bounding primitive.


What Can be Indexed in Parallel?


Any multi-dimensional binary tree.





@0,2L

Tree type kd-trees metric trees

@2,5L

@5,8L

@8,9L

Bound type

Rule(x)

hyper-rectangle {bd,min , bd,max }D d=1 hyper-sphere B(c, r ), c ∈ D R ,r > 0

xi ≤ si for 1 ≤ i ≤ D, bd,min ≤ si ≤ bd,max ||x − pleft || < ||x − pright || for pleft , pright ∈ RD


Contribution 4: Parallelization


Thesis Outline

Q´R



Q L ,L ´ R

QL´ R

QR´ R

Q L ,R ´ R

Q R ,L ´ R



Pre-divide and spawn off independent computations.

Q R ,R ´ R


Queueing up Independent Tasks





0

1

2

3

0

1

2

3

0

1

2

0

2

3

3

0

Distributed Averaging 0

2


1

3

3

1

2

3


Overall Strong Scaling 10 million subset of SDSS Data Release 6. Overall strong scaling Parallel efficiency 1.0


0.8

0.6



0.4

0.2

Cores 24

Tree buildin g

96

384

Tree walk

1536

Computation


Raw numbers: (13.52, 339.36, 2371), (7.41, 24.38, 244), (2.93, 2.78, 98.78), (1.10, 0.27, 39.51)


Analysis of Strong Scaling






Building a distributed multidimensional tree is different from building a geometrically constrained data structure such as octrees (communication costs in green). N DN log + O (Dtw mbound (2p − log 4p)) O p p t 2N(D + tw ) s +O log p + O log p (log p + 3) p 2

• Requires multiple all-reduce operations which hurt

scalability. • It is possible to trade the quality of the constructed tree

for scalablity.


Outline


1 Kernel Methods

Solution Preview

Solution Preview




2 Thesis Outline






5 Conclusion




GPR and KPCA Density estimation KDE: O N 2 / parameter. Clustering Mean shift O N 2 / iter / parameter.

⇒






⇒

≈








Each process maintains an estimate of the global average of the numbers in the network and iterates the following difference equation. P µi (t + 1) = µi (t) + (µj (t) − µi (t)) j∈Ni

Each state on each process converges to the average of the initial values on all nodes.








Popular technique used in literature on algorithms running on wireless sensor networks. A type of gossip-based algorithm.


Contribution 5: Distributed Averaging + Random Features






Use random features to linearize the problem and apply distributed averaging on the localized averages (Chapter 9 of the thesis). Regression: GPR. O(N 3 ) / parameter.


⇒

≈


Distributed Averaging GPR Plot of the relative error distribution between the centralized estimates and the decentralized estimates.

Solution Preview



Relative error ç

ç ç

ç ç

ç

100

0.1

ç

ç

ç

ç

ì

ì

ò ô à ò

ò ô

à

ô

ì ò à ô

ì ò ô à

ì

ì

ì

ì

ò à ô

ò à ô

ò à ô

ò

10 - 4

à

ô

ç

ç

ç

ì

ì

ò à ô

ò à

ô

ì

ì

ò à ô

ò à

ô

ì

ì

ò à ô

ò à

ô

ç

ç

ì

ì

ò à ô

ò à

ô

ç

ç

ì

ì

ò à ô

ò à

ô

ç

ç

ì

ì

ò à ô

ò à

ô

ç

ç

ç

ì

ì

ò à ô

ò à ô

ç

ì

ì

ò à ô

ò à

10 - 7

ç

ç

ì

ì

ò à ô

ò à

ô æ

æ æ


ç

Min

à

Lower quartile

ì

Mean

ò

Median

ô

Upper quartile

ç

Max

ç

ç

ç

ì

ì

ì

ò à ô

ò

ì

à


ç

æ

à

à ò ô

ô

ô

æ æ æ

æ

æ

æ

æ æ

10 -10

æ

æ

æ

æ æ

æ

Conclusion

æ

æ æ

æ

æ æ

æ


æ æ

æ æ æ

æ

5

10

15

20

25

30

Iteration numb


Distributed KPCA: Strong Scaling

Dongryeol Lee Kernel Methods Solution Preview Strong Scalability (Total size: 102,400 points of 784 dimensions)

Thesis Outline

60


50



40 Elapsed time


Flat MPI 2 threads / MPI 4 threads / MPI 8 threads / MPI

30

20

10

0 0 10

1

10

Number of cores


2

10


Distributed KPCA: Weak Scaling


Weak Scalability (100K points / core)

65

Solution Preview Flat MPI

Thesis Outline

64




4 threads/MPI 8 threads/MPI 63

Total computation time


2 threads/MPI

62

61

60

59


58 0 10

1

10

Number of cores

2

10


Outline


1 Kernel Methods

Solution Preview

Solution Preview




2 Thesis Outline






5 Conclusion



Conclusion






Summary of contributions: • Parallel multidimensional trees. • Adopted algorithmic strategies from wide range of fields:

distributed averaging (networking), random features (machine learning), series expansion (computational physics). • Open-source contribution: MLPACK - more than 45K+

lines of contributed code. Paving the way for the next generation. “Utilizing the best general-dimension algorithms, approximation methods with error bounds, the distributed and shared memory parallelism can help scale kernel methods.”


List of Other Research Work


Included in the thesis, but omitted in this presentation: Fast nonparametric clustering (Chapter 6): with Ping Wang, James M. Rehg, Alexander G. Gray. AISTATS 2007.




120 original converged 100

80

P 60

map

40

q∈Q

20

Distributed Averaging 0


−20 −80

−60

−40

−20

0

20

40

60

80

wj rj k(q,rj )

rj ∈R

P rj ∈R

wj k(q,rj )


List of Other Research Work Included in the thesis, but omitted in this presentation: Higher-order extension of the kernel summation (Chapter 7): with Arkadas Ozakin, Alexander G. Gray. Submitted to Journal of Computational Physics.

Solution Preview


æ



æ

cR L æ æ

æ cR

æ

æ

æ R1

æ

æ

æ

R2 cQ L

cQ

cQ Q

æ

Q

æ R2

æ æ æ

Conclusion æ æ

c R1

cQ R

æ

æ R1

æ



æ

cR 2

æ

æ cR R

æ æ

æ cR

æ

æ æ

cQ


Collaborators






• Professor Edmond Chow (Georgia Tech) • Professor Alexander G. Gray (Georgia Tech) • Kihwan Kim (NVIDIA Research) • Professor Richard Vuduc (Georgia Tech) • William March (Georgia Tech) • Nishant Metha (Georgia Tech) • Hua Ouyang (Georgia Tech) • Parikshit Ram (Georgia Tech) • Ryan Riegel (Georgia Tech) • Nikolaos Vasiloglou (Georgia Tech)


List of Other Collaborations Published but not included in the thesis chapters: • Run-time analysis of N-body problems:

with Parikshit Ram, William B.

March, Alexander G. Gray. NIPS 2009.





• Rank-approximate NN:

With Parikshit Ram, Hua Ouyang, Alexander G. Gray. NIPS

2009.

• GPR for motion trajectory analysis:

With Kihwan Kim, Irfan Essa. ICCV

2011. (post-proposal).

• Time-constrained NN:

With Parikshit Ram, Alexander G. Gray. SDM 2012.

(post-proposal).

• GPR for camera motion automation:

With Kihwan Kim, Irfan Essa.

CVPR 2012. (post-proposal).

On-going collaborations involving significant code transfer: • Parallel n-point correlation:

Other collaborations omitted...

With William B. March et al.






MLPACK: Open-source ML NIPS 2008 Demonstration, NIPS 2011 Big Learning Workshop


Pre-Proposal Publication List






Dongryeol Lee, Alexander G. Gray, and Andrew W. Moore. Dual-Tree Fast Gauss Transforms. In: Advances in Neural Information Processing Systems, 2005. Dongryeol Lee and Alexander G. Gray. Faster Gaussian Summation: Theory and Experiment. In: Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence, 2006. Ping Wang, Dongryeol Lee, Alexander G. Gray, and James M. Rehg. Fast Mean Shift with Accurate and Stable Convergence. In: Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, 2007. Dongryeol Lee and Alexander G. Gray. Fast High-dimensional Kernel Summations Using the Monte Carlo Multipole Method, In: Advances in Neural Information Processing Systems, 2008. Dongryeol Lee, Alexander G. Gray, and Andrew W. Moore. Dual-Tree Fast Gauss Transforms (arXiv). Parikshit Ram, Dongryeol Lee, Hua Ouyang, and Alexander G. Gray. Rank-Approximate Nearest Neighbor Search: Retaining Meaning and Speed in High Dimensions. In: Advances in Neural Information Processing Systems, 2009. Parikshit Ram, Dongryeol Lee, William B. March, and Alexander G. Gray. Linear-time Algorithms for Pairwise Statistical Problems. In: Advances in Neural Information Processing Systems, 2009. Spotlight Presentation.


Post-Proposal Publication List






Dongryeol Lee, Arkadas Ozakin, and Alexander G. Gray. Multibody Multipole Methods. In Journal of Computational Physics, 2011. Kihwan Kim, Dongryeol Lee, and Irfan Essa. Gaussian Process Regression Flow for Analysis of Motion Trajectories. In: Proceedings of IEEE International Conference on Computer Vision, 2011. William B. March, Arkadas Ozakin, Dongryeol Lee, Ryan Riegel, and Alexander G. Gray. Multi-Tree Algorithms for Large-Scale Astrostatistics. In Advances in Machine Learning and Data Mining for Astronomy, Chapman and Hall/CRC Press, 2012. Dongryeol Lee, Richard Vuduc, and Alexander G. Gray. A Distributed Kernel Summation Framework for General-Dimension Machine Learning. To appear in SIAM International Conference on Data Mining, 2012. Best Paper Award. Parikshit Ram, Dongryeol Lee, and Alexander G. Gray. Nearest-Neighbor Search on a Time Budget via Max-Margin Trees. To appear in SIAM International Conference on Data Mining, 2012. Kihwan Kim, Dongryeol Lee, and Irfan Essa. Detecting Regions of Interest in Dynamic Scenes for Camera Motion. To appear in IEEE Conference on Computer Vision and Pattern Recognition, 2012. William B. March, Kent Czechowski, Marat Dukhan, Thomas Benson, Dongryeol Lee, Andy J. Connolly, Richard Vuduc, Edmond Chow, and Alexander G. Gray. Optimizing the Computation of N-Point Cor- relations on Large-Scale Astronomical Data. To appear in Proc. ACM/IEEE Conf. Supercomputing (SC), 2012 Nishant A. Mehta, Dongryeol Lee, and Alexander G. Gray. Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL. To appear in Advances in Neural Information Processing Systems, 2012.

A Distributed Kernel Summation Framework for ... - Google Sites

A Distributed Kernel Summation Framework for ... - Google Sites

Suggest Documents

A kernel-Based Framework for Image Collection ... - Google Sites

Distributed Quadratic Programming Solver for Kernel ... - Google Sites

PrIter: A Distributed Framework for Prioritized Iterative ... - Google Sites

A wireless distributed framework for supporting ... - Google Sites

W-EHR: A Wireless Distributed Framework for secure ... - Google Sites

A Microscopic Framework For Distributed Object ... - Google Sites

A Process Maturity Framework for Managing Distributed ... - Google Sites

Beyond Triangles: A Distributed Framework for ... - Google Sites

Generalized kernel framework for unsupervised

Kernel Regression Framework for Machine

Optimizing GPGPU Kernel Summation for Performance and Energy ...

Sensitivity summation theorems for stochastic ... - Google Sites

a resilient distributed framework

DECK: A new model for a distributed executive kernel integrating

A Framework for Distributed Collaborative Software ...

A Distributed Network Monitoring Framework for ...

A Distributed Resource Management Framework for ... - CiteSeerX

A Distributed Computing Framework for Iterative ... - CiteSeerX

DISTRIBUTED TEAM INNOVATIONâA FRAMEWORK FOR

Distributed Workflows: A Framework for Electronic Commerce

A Generic Framework for Distributed Multirobot

MobileSpaces: A Framework for Building Adaptive Distributed

DISTRIBUTED TEAM INNOVATIONâA FRAMEWORK FOR ...

A FRAMEWORK FOR BUILDING DISTRIBUTED ... - Semantic Scholar

A Distributed Kernel Summation Framework for ... - Google Sites