Widura Schwittek Stefan Eicker
Paper presentation on CBSE 2013 conference, Vancouver, CA 06/18/2013
© paluno
A Study on Third Party Component Reuse in Java Enterprise Open Source Software
Agenda • Background • Study implementation • Study results • Threats to validity
CBSE 2013 conference, Vancouver, CA
2
© paluno
• Wrap up & Future Work
Background Why studying third party component reuse?
• Known from practice: Third party components (TPCs) are heavily reused in industry projects • Because of
[2]
• reduced costs, • faster time-to-market, and • better software quality
• Some say: TPC reuse is a key success factor in software development [1]
[1] Gartner. 2008. The Evolving Open-source Software Model. Predicts from December 2008.
CBSE 2013 conference, Vancouver, CA
3
© paluno
[2] Li, J., Conradi, R., Bunse, C., Torchiano, M., Slyngstad, O. P. N., and Morisio, M. 2009. Development with Off-the-Shelf Components: 10 Facts. IEEE Softw. 26, 2, 80–87.
Background A few studies exist on third party component reuse (excerpt)
• Study 1
:
[3]
• Sample: 20 heterogenic Open Source Java projects
• Study 2
:
[4]
• Sample: 106 curated heterogenic Open Source Java projects + 178 proprietary Java based systems
• Study 3
:
[10]
• Sample: apps in the Android Market.
Many applications heavily rely on third party components [3] Heinemann, L et al. On the Extent and Nature of Software Reuse in Open Source Java Projects. In Top productivity through software reuse (International Conference on Software Reuse) 6727. Springer, Berlin, 207–222.
[10] Ruiz, I. J. M., et al. 2012. Understanding reuse in the Android Market. In 2012 20th IEEE International Conference on Program Comprehension (ICPC). Proceedings ; June 11-13, 2012, Passau, Germany, IEEE, 113–122.
CBSE 2013 conference, Vancouver, CA
4
© paluno
[4] Raemaekers, S. et al. 2012. An Analysis of Dependence on Third-party Libraries in Open Source and Proprietary Systems. In Proceedings of the Sixth International Workshop on Software Quality and Maintainability (SQM 2012).
Background Our study goals
Contribute further empirical data to software reuse research Be specific to one application type: Enterprise web applications
CBSE 2013 conference, Vancouver, CA
5
© paluno
Build a basis for further research – Explorative study
Study implementation Sample Selection
• Enterprise Open Source web applications • "Open Source" because it's simply available • "Enterprise" because we’re interested in high quality software • "web applications" because of the third party component “jungle”
• 36 projects selected based on an internet survey
CBSE 2013 conference, Vancouver, CA
6
© paluno
• Enterprise domain such as • Business Intelligence, • Knowledge Management, • Content Management • Popularity: Used by big companies (Testimonials, References etc.), large download rate • Still actively developed
Study implementation Retrieval of reuse data
2
3
• Identifying third party components from artifacts • Use available meta-data (e.g. from maven repository) to support identification
• Remove artifacts that do not belong to a third party component
CBSE 2013 conference, Vancouver, CA
7
© paluno
1
• download and extract WAR file • locate third party component artifacts
Study implementation Retrieval of reuse data
• (Semi-)Automatic approach has advantages • Retrieval is repeatable • Large number of projects can be analyzed
• Major challenge: mapping artifacts to third party components • Mapping table
icu4j
Third party component name com.ibm.icu
icu4j-3_8_1
com.ibm.icu
icu4j-charsets
com.ibm.icu
icu4j-localespi
com.ibm.icu
slf4j-api
slf4j
slf4j-jcl
slf4j
slf4j-log4j12
slf4j
ICU third party component
SLF4J third party component
CBSE 2013 conference, Vancouver, CA
8
© paluno
Artifact name
Study results What have we found out?
• 36 web applications reuse 3311 unique artifacts • 3311 unique artifacts could be mapped to 651 third party components • Using 863 mappings
• Average of 70 components per web application have been reused
CBSE 2013 conference, Vancouver, CA
9
© paluno
• From 16 to 161
Study results What have we found out?
• Which of the analyzed applications relied on the highest number of TPCs? Alfresco
Version
Dom ain
# of Comp.
Web Application
Version
Dom ain
# of Comp.
6.0.0RC1
CRM
68
2.0
Other
66
4.2c
CM
161
Hipergate
Liferay Portal CE
6.1.1GA2
CM
144
Kuali Mobility
XWiki Enterprise
4.5RC1
SM
130
hippoCMS
7.7.0
CM
61
dotCMS
2.2
CM
127
openWGA CE
6.0.7
CM
60
Pentaho CE
4.8.0
BI
118
DSpace JSPUI
3.1
CM
60
openKM
6.2.2
KM
97
Tntnconcept
0.21.16
Other
55
jallInOne SOA
2.8.2
ERP
91
Nexus
2.3.1_01
SD
52
Kuali People Management
1.2.2
HR
89
Bonita Open Solutions
5.9.1
BPM
49
Kuali Coes
5.0.1
Other
89
JRoller
5.0.1
SM
44
2.0.0M5
Other
87
hippoCMS site
7.7.0
CM
42
openCMS
8.5.1
CM
83
Daisy
2.4.2
CM
41
openOLAT
8.3.3
LM
79
Walrus CMS
1.5
CM
41
logicalDOC
6.6.1
CM
75
Ametys site
3.4.0
CM
37
Magnolia
4.5.7
CM
71
jallInOne
2.8.2
ERP
29
Jenkins
1.501
SD
70
vosaoCMS
0.9.14
CM
28
Ametys
3.4.0
CM
70
Jamwiki
1.2.4
SM
27
3.1
CM
69
Agorum
7.0.4
CM
16
Kuali Student
DSpace XMLUI
CBSE 2013 conference, Vancouver, CA
10
© paluno
Web Application
Study results What have we found out?
• Which of the analyzed applications relied on the highest number of TPCs? # of Comp.
Version
Domain
4.2c
CM
161
Liferay Portal CE
6.1.1GA2
CM
144
XWiki Enterprise
4.5RC1
SM
130
2.2
CM
127
Pentaho CE
4.8.0
BI
118
openKM
6.2.2
KM
97
jallInOne SOA
2.8.2
ERP
91
Kuali People Management
1.2.2
HR
89
Kuali Coes
5.0.1
Other
89
2.0.0M5
Other
87
8.5.1
CM
83
Alfresco
dotCMS
Kuali Student openCMS
CBSE 2013 conference, Vancouver, CA
11
© paluno
Web Application
Study results What have we found out?
• Why does Alfresco CMS use 161 TPCs?
CBSE 2013 conference, Vancouver, CA
12
© paluno
• 25% document processing such as PDF, Office • 21% XML processing and WS • 16% accessing external services such as Google Docs, Facebook, Twitter and SlideShare
Study results What have we found out?
• Which TPC was reused most (top 40)? # reused
Library
# reused
commons-collections
34
org.springframework
21
commons-codec httpcomponents
34 32
org.apache.poi
21
aopalliance
20
commons-lang
32
org.antlr
19
commons-beanutils
32
org.objectweb.asm
19
commons-io
31
org.codehaus.woodstox
19
commons-fileupload
29
org.bouncycastle
19
commons-logging
28
javax.xml.xml-apis
18
dom4j
26
commons-compress
18
commons-digester
25
net.java.dev.rome
18
org.apache.log4j
25
xpp
17
slf4j org.apache.xerces
25 25
geronimo.specs hibernate
17 16
org.jdom
25
net.sf.cglib
15
org.apache.lucene
24
15
commons-pool
23
com.thoughtworks. xstream org.apache.xalan
net.sf.ehcache
23
jaxen
15
javax.mail
22
stax
15
org.apache.oro
22
net.sourceforge.nekohtml
14
commons-dbcp
21
org.apache.pdfbox
14
• 19 Apache Foundation TPCs • 9 XML processing libs • Building blocks: caching, web, ORM, crypto, RSS, search
15
CBSE 2013 conference, Vancouver, CA
13
© paluno
Library
Study results What have we found out?
• Which TPC was reused most (top 40)? # reused
Library
# reused
commons-collections
34
org.springframework
21
commons-codec httpcomponents
34 32
org.apache.poi
21
aopalliance
20
commons-lang
32
org.antlr
19
commons-beanutils
32
org.objectweb.asm
19
commons-io
31
org.codehaus.woodstox
19
commons-fileupload
29
org.bouncycastle
19
commons-logging
28
javax.xml.xml-apis
18
dom4j
26
commons-compress
18
commons-digester
25
net.java.dev.rome
18
org.apache.log4j
25
xpp
17
slf4j org.apache.xerces
25 25
geronimo.specs hibernate
17 16
org.jdom
25
net.sf.cglib
15
org.apache.lucene
24
15
commons-pool
23
com.thoughtworks. xstream org.apache.xalan
net.sf.ehcache
23
jaxen
15
javax.mail
22
stax
15
org.apache.oro
22
net.sourceforge.nekohtml
14
commons-dbcp
21
org.apache.pdfbox
14
• 19 Apache Foundation TPCs • 9 XML processing libs • Building blocks: caching, web, ORM, crypto, RSS, search
15
CBSE 2013 conference, Vancouver, CA
14
© paluno
Library
Study results What have we found out?
• Which TPC was reused most (top 40)? # reused
Library
# reused
commons-collections
34
org.springframework
21
commons-codec httpcomponents
34 32
org.apache.poi
21
aopalliance
20
commons-lang
32
org.antlr
19
commons-beanutils
32
org.objectweb.asm
19
commons-io
31
org.codehaus.woodstox
19
commons-fileupload
29
org.bouncycastle
19
commons-logging
28
javax.xml.xml-apis
18
dom4j
26
commons-compress
18
commons-digester
25
net.java.dev.rome
18
org.apache.log4j
25
xpp
17
slf4j org.apache.xerces
25 25
geronimo.specs hibernate
17 16
org.jdom
25
net.sf.cglib
15
org.apache.lucene
24
15
commons-pool
23
com.thoughtworks. xstream org.apache.xalan
net.sf.ehcache
23
jaxen
15
javax.mail
22
stax
15
org.apache.oro
22
net.sourceforge.nekohtml
14
commons-dbcp
21
org.apache.pdfbox
14
• 19 Apache Foundation TPCs • 9 XML processing libs • Building blocks: caching, web, ORM, crypto, RSS, search
15
CBSE 2013 conference, Vancouver, CA
15
© paluno
Library
Study results What have we found out?
• Which TPC was reused most (top 40)? # reused
Library
# reused
commons-collections
34
org.springframework
21
commons-codec httpcomponents
34 32
org.apache.poi
21
aopalliance
20
commons-lang
32
org.antlr
19
commons-beanutils
32
org.objectweb.asm
19
commons-io
31
org.codehaus.woodstox
19
commons-fileupload
29
org.bouncycastle
19
commons-logging
28
javax.xml.xml-apis
18
dom4j
26
commons-compress
18
commons-digester
25
net.java.dev.rome
18
org.apache.log4j
25
xpp
17
slf4j org.apache.xerces
25 25
geronimo.specs hibernate
17 16
org.jdom
25
net.sf.cglib
15
org.apache.lucene
24
15
commons-pool
23
com.thoughtworks. xstream org.apache.xalan
net.sf.ehcache
23
jaxen
15
javax.mail
22
stax
15
org.apache.oro
22
net.sourceforge.nekohtml
14
commons-dbcp
21
org.apache.pdfbox
14
• 19 Apache Foundation TPCs • 9 XML processing libs • Building blocks: caching, web, ORM, crypto, RSS, search
15
CBSE 2013 conference, Vancouver, CA
16
© paluno
Library
Study results What are resulting follow-up questions?
• 70 components per web application in average • How can this be effectively managed regarding updates, especially those that are security fixes? • How risky is it to not keep track?
• Some TPC are often used, some less; some web applications use more TPC than others
CBSE 2013 conference, Vancouver, CA
17
© paluno
• Why is it like this? Can patterns be identified? • Can this knowledge contribute to the component identification process in other projects?
Threats to validity Internal validity
• Manual input • Grouping artifacts to TPC • Removing non-TPC artifacts
CBSE 2013 conference, Vancouver, CA
18
© paluno
making manual input explicit and traceable
Threats to validity External validity
• Small sample size • drawing general conclusions on software reuse not possible • but we get a good impression on TPC reuse in Java based Open Source web applications
• Preselection is biased by the authors
CBSE 2013 conference, Vancouver, CA
19
© paluno
extent study (see next chapter)
Wrap up • Conducted a study on TPC reuse in Enterprise OS web apps based on Java • A tool to support the data generation has been developed which supported the survey • The study results showed that reuse happens supporting other studies and the practitioner's impression/guess • Further questions developed
CBSE 2013 conference, Vancouver, CA
20
© paluno
• How to cope with huge amounts of TPCs? • Can patterns in the data support the identification process of TPC?
Future Work • Address the new questions • Web platform for Reuse Documentation/Architectural Documentation • Recommender System to support TPC identification
CBSE 2013 conference, Vancouver, CA
21
© paluno
• Extend/Replicate study to other platforms/other application types/commercial software
Thanks for your attention! Feel free to ask questions!
CBSE 2013 conference, Vancouver, CA
22
© paluno
Widura Schwittek
[email protected] http://www.paluno.de