Open Source Certification Simon Pickin1 and Peter T. Breuer2 1
2
Depto. Telem´ atica, Universidad Carlos III, Legan´es, Madrid 28911, SPAIN
[email protected] School of Computing, University of Birmingham, Birmingham B15 2TT, UK
[email protected]
Summary. In this short paper we suggest an external service operating independently as the likely central medium for the certification of open source code.
1 Discussion As authors of an analysis suite [6, 7, 8, 9, 10, 11, 12, 13] which has been used to scan the Linux kernel source for vulnerabilities, and as authors for many years of open source [23, 21] software projects with provenance both within [4, 5] the Linux kernel [24] and without [3], we have experience of the potential for and the difficulties inherent in the certification of open source software. The most noticable and possibly the distinguishing characteristic of a successful open source software project from a purely statistical point of view is: •
The rapid turnover and diverse origins of its code.
Code is typically offered to a vibrant project by many authors, then criticised, revised or assembled by one of a few trusted maintainers or developers, before becoming incorporated at that stage “officially”. It is then subjected over its lifetime to test, criticism, and review by the many contributers and helpers who associate with the project. The code is “open”, as the name “open source ” suggests, so it can be scanned by all who care to do so. That openess is inherently unidirectional, however. All who may care to contribute patches which solve a problem of their own need or finding can attempt to do so, but their contribution waits on the pleasure of the project’s approved maintainers and developers as to whether it be rejected, accepted, or modified. The turnover of code is only uncontrolled, therefore, in the sense that there is and can be no particular development environment imposed on the many authors and contributers. Control always exists at least insofar as changes in the source code are recorded and documented via change logs in the source code control system used by the maintainers and developers, usually CVS
2
S. Pickin and P.T. Breuer
[14] or Subversion, and the reasons for those changes are documented in the extensive online – and open – discussions on mailing lists, wikis, and discussion boards which accompany the project development and maintenance. Even in a project wih only a single main author plus a client class of users, the development will be documented by the history of changes recorded in the frequent public releases that must accompany a successful open source project. However, the rapid turnover implies that code-based certification is inherently a difficult prospect because •
Only a single version of the code from a single date and time can realistically be certified within a substantial time period, say a year.
The project code base moves on inexorably meantime, so there is a major question at once over even the usefulness of certification, given such a constraint. It may be supposed that an active project may have a release cycle of about a month, at most, so only at most one in twelve releases would even ever be certified, and the latest certified version would on average be six months and six releases (and six known vulnerabilities, perhaps) out of date. There may be a use in certifying all the code at a specific version issued with a specific operating system version of Linux, for example, if it were possible (there are hundreds of millions of lines of code in a single distribution, in many different languages, and many different compilers used), but there are also hundreds of different distributions. Concentrating on a single “hardened” distribution and certifying its code as being free from the imperfections detected by the certification process may have value to a particular and possibly rarefied market, but normal users in general will move on from that distribution as soon as it becomes outdated, which will be more or less immediately. Moreover, new vulnerabilities, unknown at the time of certification, will be corrected in later versions of the software, which renders the meaningfulness and utility of continuing to use the old but certified software moot – it will be certified only as having been free of known vulnerabilities from the time of the certification and the fact of the matter will be that it is vulnerable to newly developed attacks. Certification must not only guard against: •
The inadvertent introduction of errors that are not easily found in code reviews or testing,
but also against: •
The deliberate surreptitious insertion of back doors into public domain software.
It should be admitted that there is indeed some opportunity for the latter to occur, at least in principle. Since the vigilance and skill of the approved developers and maintainers is the first and major defense against such attempts, it is conceivable that code from a habitual contributer that contains a disguised malware section that is difficult to understand might be taken in
Open Source Certification
3
on trust by a maintainer, in view of a previous exemplary record. The good faith of the maintainer also has to be taken on trust. In practice, it is unlikely that any well-intentioned maintainer or developer would permit such a thing to occur, because obscurity or bulk in a code contribution is in itself a reason for rejecting it. Code contributions to the Linux kernel are frequently rejected on those grounds alone by Linus Torvalds, for example. Nevertheless, our analysis suite has in the past detected problems which have lain undetected in the kernel code for years at a time (in the sound driver layer in particular) in plain view of all. On the face of it, those problems went undetected because nobody reviewed the code apart from the original maintainer, who missed the problem, having inadvertently created it in the first instance. The contributions in question were so massive (the sound system as a whole, in this case) that nobody apparently had the stamina or desire to review all of it, perhaps not even the author and maintainer. Thus it appears clear that a badly intentioned but trusted developer or maintainer may suceed in injecting malware into an open source project despite the public scrutiny, and that it may remain undetected there for some considerable time. As a matter of interest, some of the particular problems detected by our analysis in he Linux kernel were averred to have been caused through global “find and replace” editing operations, when a change in the lower level kernel interfaces mandated a change in a higher level. It should be noted that maintenance based on syntax alone is not generally successful when subtle changes of semantics are involved, although syntactic changes may form an important part of the necessary maintenance, and, for that reason, experienced maintainers generally try and accompany a semantic change with a syntactic one . . . changing the name of a record field, for example, so that attempted uses of the old field will show up as syntactic errors. Less experienced maintainers may not know to follow that kind of procedure. At any rate, if a maintainer or habitual contributer is subverted, perhaps by bribery, or else is deliberately badly intentioned, it would be possible to embed a back door into open source software. Against that, the only apparent defense is: •
An automated scan of the source code, repeated at each issue of the code, or even more frequently, by a trusted outside agency.
That is essentially what we propose as a certification mechanism. An alternative methodology would be to certify the process for the production of code instead of the code produced by the process. This is a mechanism adopted in the proprietary software industry, where ISO and other standards regulate the certification of the production process. We do not believe it to be applicable to the open source production process, however, because the open source raison d’ˆetre is fundamentally liberty, and regulation is a curtailment on liberty. It may be possible to present a certification process to authors as a thing to be desired in terms of giving the software more cachet, but it will be impossible to impose. It is difficult to conceive of a mode in which process-
4
S. Pickin and P.T. Breuer
oriented certification may operate successfully in the context of open source software. What we propose is an •
independent service
The service would be voluntary at first – just another open source project. It guarantees to scan, say, the Linux kernel source for known classes of code defects which may by utilised as vulnerabilities by malware. These scans should be •
semantically based
but in practical terms their internal mechanisms may tend towards the syntactic since syntactic analysis is much faster and scales better. When a malware attack becomes known and the modus operandi analysed, that usually leads to a pattern of use in the code that can be looked for in an automated tool. Such a pattern may also include semantic elements, for example, that a file handle is possibly accessed again by the code after the file itself has been closed. In such cases a semantic analysis is the only kind of analysis that may in principle uncover all occurrences of the problem. The Coverity checker [17] is to some extent alredy being used in this capacity by its creaters. Coverity is a commercial tool based on an extensible version of the open source standard C compiler, gcc. It’s operators add specialist domain-based knowledge to their “compiler” in order to allow it to detect coding problems which have their origins in higher level constructs that lines of code. A typical piece of domain knowledge might be, for example, that the standard C library call “close” has the effect of closing a file handle, or that the C library call “malloc” provides a pointer to a chunk of memory on the heap. Reference to the file handle after close and before “open” or “reopen” etc. is illegal (and offers an attack vector against the software). Coverity itself is proprietary, and its innards are not accessible to review and for that reason we have to guess at how it works in detail, which may be unacceptable as part of an open source certifacition process. However, the model of interaction that they are following with the Linux project is otherwise what we suggest as part of a viable certification process because they have built up a level of trust despite the closed nature of their own code. “Who watches the watchers” is a maxim that ought to be taken into account, and the certification procedure should both be open, and conducted, insofar as it is automated, using open source software tools. Indeed, the certification process should be bundled into a •
software download
so that potential certificatees can run the toolset over their code to try and ensure that it will pass. That should not compromise the certification process itself, because its product must in the end merely consist of a certain version of the certification process having been applied to a certain version of the
Open Source Certification
5
certified software and our vision is that source code can be checked against a given version of the certification suite by any interested party to see that it does indeed pass, as the author of the software may have claimed. Semantically based analysis should render the certification process proof against being fooled by changes of variable name and code structure that might successfully gammon a syntactic analysis. But can semantic analysis also be fooled by some kinds of code changes? In principle, splitting functions up into, for example, different parallel threads which each do a portion of the intended work ought to make the intention of the code impossible to divine. Parallel threads working on a common area of memory may in principle do anything at all. Therefore the certification process should enforce •
coding standards
which forbid that kind of software design, at least on the grounds that it renders analysis effectively impossible. A digital signature for the tested software and the testing process software may be kept centrally at the certification site, and any interested party may choose to check merely that the certified software matches the signature stored (as does the certifying software). A single signature may be used jointly in the software itself – that is, the certificate for the target software may have been affixed to the target source code already signed with the signature of the testing suite. Alternatively again, the author of the code may sign his own code with a signature obtained from the central authority that says that that software has been tested and certified, thus making the source code carry its own certification. It is possible that issued certificates could be revoked when insufficiencies in the testing process itself are discovered, requiring authors to recertify, but that hardly seems a good idea as all certifications that are not absolutely up to date should be recognized as being possibly misleading. Nevertheless, the idea of a certificate that has a fixed time to live, say one month, is a possibility. It would require old software to be continually recertified via a •
ongoing recertification process
in which a certification manager on the user’s machine connected to the net and obtained new certificates for the (digital signatures of) its existing software. But could any central certification process handle the huge volume of recertification requests for old and different versions of the multitudinous projects in existence? Not without help. But •
trusted remote computing
may allow the certification process to be run on a remote computer, in complete confidence. That would permit certification to use the computing resources of the interested party and the public at large in order to perform the certification itself, which is naturally computationally intensive.
6
S. Pickin and P.T. Breuer
In trusted remote computing (unlike what is meant by the industry term “TC”, wich nominally stands for “trusted computing” but really means the hardware-mediated refusal of a computer to run code unsigned by the operating system maanufacturer), the memory and processor of a remote computer are used to carry out a computation in complete security, safe from observation and tampering by the owner of the machine. This is possible either using special memory hardware (which encrypts its inputs and outputs) or purely software transformational techniques (essentially emulation of an encrypted state machine, which is slower). This technology, when and if it arrives, will make the certification process envisaged practical at a large scale. In the meantime, voluntary scans by a central authority of important open source software systems such as the Linux kernel, Apache web server, Firefox browser, OpenOffice office suite etc. would serve as a path to engender trust and familiarity in a certification process. Importantly also, it might help generate contributions from the user community at large towards the tests incorporated in the certfication process, something which has not occurred so far at least in the work Coverity is conducting in connection with Linux kernel development. There it has been noticable that Coverity’s tests, though numerous and ingenious, seem to have been generated from the Coverity organisation, not by the kernel community. That may be because the Coverity code is inaccessible (both in the sense of being physically unavailable, and presumably in being of the type that would be unlikely for ordinary users and authors to want to become adept in, because of its limited applicability), something which may be remedied by an open source testing procedure at the heart of an open source certification process. However, whatever the merits and drawbacks of the Coverity tool, the success of Coverity’s Scan project, to which 265 open source projects have so far signed up, see [18], does serve to illustrate that open source developers are indeed interested in improving and then advertising the quality of their products, so that if a solid certification process can be developed, motivation to participate in it should not be lacking.
2 Conclusion New technological advances, such as trusted remote computing, may be necessary to make open source certification practical in the face of the rapid turnover inherant in the open source development process. Given that codecil, an independent agency, itself operating an open sourced semantically-based certification process, appears to be a possible answer to the question of what should an open source certification process consist of at its heart. Quite what a certificate itself should look like is a technical question that can be left till later, but it should certify a given version of the software against a given version of the testing process, and may have a short time to live built in.
Open Source Certification
7
References 1. T. Ball and S. K. Rajamani, “The SLAM project: Debugging system software via static analysis,” in Proc. POPL ’02: Proceedings of the ACM SIGPLAN-SIGACT Conference on Principles of Programming Languages, 2002. 2. Valeria Bertacco, “Scalable Hardware Verification with Symbolic Simulation” Springer USA, 2006. 3. P.T. Breuer and J.P. Bowen. A PREttier Compiler-Compiler: Generating higher order parsers in C. Software — Practice & Experience, 25(11):1263–1297, November 1995. 4. P.T. Breuer, A. Mar´ın Lopez, A. Garc´ıa. ‘The Network Block Device’. http://www2.linuxjournal.com/lj-issues/issue73/3778.html. The Linux Journal, #73, May 2000. 5. Peter T. Breuer and Arne Wiebalck ‘Intelligent Networked Software RAID’, In Proc. IASTED Intl. Conference on Parallel and Distributed Computing and Networks (PDCN 2005), 23rd IASTED Intl. Multi-Conference on Applied Informatics, Eds. T. Fahringer, M.H. Hamza, (ISBN 0-88986-468-3) pp 517-522, Innsbruck, Austria, ACTA Press, February 2005. 6. P. T. Breuer and M. Garci´ a Valls, “Static deadlock detection in the linux kernel,” in Proc. Reliable Software Technologies – Ada-Europe 2004, 9th Ada-Europe International Conference on Reliable Software Technologies, ser. LNCS, A. Llamos´ı and A. Strohmeier, Eds., no. 3063. Springer, June 2004, pp. 52–64. 7. P. T. Breuer and S. Pickin, “One million (LOC) and counting: Static analysis for errors and vulnerabilities in the Linux kernel source code,” in Proc. Reliable Software Technologies – Ada-Europe 2006, 11th Ada-Europe International Conference on Reliable Software Technologies, ser. LNCS, L. M. Pinho and M. G. Harbour, Eds., no. 4006. Springer, June 2006, pp. 56–70. 8. P. T. Breuer and S. Pickin, “Symbolic Approximation: An Approach to Verification in the Large” Innovations in Systems and Software Engineering – A NASA Journal, Oct. 2006, Springer, London. Online version http://www.springerlink. com/content/b36u3k3822u4t614. 9. P. T. Breuer and S. Pickin. Abstract Interpretation meets Model Checking near the 106 LOC mark. In Proc 5th International Workshop on Automated Verification of Infinite-State Systems (AVIS’06) Vienna, Austria, 1 April, 2006 (final version to appear in Elsevier Electronic Notes in TCS), Eds. Ramesh Bharadwa and Supratik Mukhopadhyay. 10. P. T. Breuer and S. Pickin. Detecting Deadlock, Double-Free and Other Abuses in a Million Lines of Linux Kernel Source In Proc. 30th Annual IEEE/NASA Software Engineering Workshop (SEW-30) Loyola College Graduate Center, Columbia, MD, USA 25-27 April 2006. 11. P. T. Breuer and S. Pickin. Checking for Deadlock, Double-Free and Other Abuses in the Linux Kernel Source Code, pp 765-772 in Part IV Proc. Workshop on Computational Science in Software Engineering (CSSE’06), Reading, UK, 28-31 May 2006, Eds. V.. Alexandrov, G. Dick van Albada, P. M. A. Sloot, J. Dongarra, Springer LNCS 3994, Oxford, ISBN 3-540-34385-7. 12. S. Pickin and P. T. Breuer. Symbolic Approximation: A Technique and a Tool for Verification in the Large, pp. 97-104 in Vol. 2, Proc. 2nd International Conference on Intelligent Computer Communication and Processing Engineering (ICCP 2006), 1-2 September, 2006, TU of Cluj-Napoca, Romania, Ed. Ioan Alfred Letia, U. T. Pres, Cluj-Napoca, 2006.
8
S. Pickin and P.T. Breuer
13. P. T. Breuer and S. Pickin. Verification in the Large via Symbolic Approximation, to appear in 2nd International Symposium on Leveraging Applications of Formal Methods, Verification and Validation (ISOLA 2006), 15-19 November, 2006, Paphos, Cyprus, 14. Per Cederqvist et al., “The CVS manual — Version Management with CVS”, Network Theory Ltd. 15. S. Chaki, E. Clarke, A. Groce, S. Jha and H. Veith, “Modular verification of software components in C,” in Proc. International Conference on Software Engineering, May 2003, pp. 385-389. 16. P. Cousot and R. Cousot, “Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints,” in Proc. 4th ACM Symposium on the Principles of Programming Languages, 1977, pp. 238–252. 17. D. Engler, B. Chelf, A. Chou and S. Hallem, “Checking System Rules Using System-Specific, Programmer-Written Compiler Extensions”, in Proc. 4th Symposium on Operating System Design and Implementation (OSDI 2000), Oct. 2000, pp. 1–16. 18. http://www.scan.coverity.com/rungAll.html, accessed 15/07/08 19. J. S. Foster, M. F¨ ahndrich, and A. Aiken, “A theory of type qualifiers,” in Proc. ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’99), May 1999. 20. J. S. Foster, T. Terauchi, and A. Aiken, “Flow-sensitive type qualifiers,” in Proc. ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’02), June 2002, pp. 1–12. 21. S. McConnell, ‘Open-source methodology: ready for prime time’ ? IEEE Software, 16(4), pp 6-8, 1999. 22. Mockus, A., Fielding, R.T. & Herbsleb, J, ‘A case study of open source software development: the Apache server’. In Proceedings of the 22nd International Conference on Software Engineering (ICSE 2000), pp 263-272 Limerick, Ireland, ACM Press, 2000. 23. E.S. Raymond, The Cathedral and the Bazaar, Cambridge: O’Reilly & Associates, 1999. 24. A. Rubini. Linux Device Drivers. O’Reilly, Sebastopol CA, Feb. 1998. 25. Hao Chen, D. Dean and D. Wagner, “Model checking one million lines of C code,” in Proc. 11th Annual Network and Distributed System Security Symposium, Feb. 2004. 26. R. Johnson and D. Wagner, “Finding user/kernel pointer bugs with type inference,” in Proc. 13th USENIX Security Symposium, Aug. 2004. 27. D. Wagner, J. S. Foster, E. A. Brewer, and A. Aiken, “A first step towards automated detection of buffer overrun vulnerabilities,” in Proc. Network and Distributed System Security (NDSS) Symposium 2000, Feb. 2000.