Hiding Signatures in Variable Names Yinjie Su, Jiahui Liu, and Dong Li School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China {suyinjie2011,hitljh}@pact518.hit.edu.cn,
[email protected]
Abstract. With the development of software technology, the copyright of the software is increasingly important. One aspect is the copyright of the source code. This paper proposed a new algorithm named HSVN (hiding signatures in variable names) to hide the copyright signature or watermark in the source code. It belongs to the static watermark. The basic idea of HSVN algorithm is adding signature bytes into the variable names which are located by random sequence, and the difference between the alongside two chars of the specific variable name is the hidden signature byte. HSVN Algorithm can hide the signature more easily and more invisibly. Moreover, it can hide a large amount of information with little redundancy added to the program. Keywords: watermark, signature, random sequence, variable name.
1
Introduction
In recent years, many products spread on the Internet in the form of electronic version, such as sound, images, documents and software. The copyright protection of these digital products becomes one research focus. Several digital watermarking techniques were studied to solve this problem for the past several years [1]. When used in software field (such as source code, middle code, executable file, etc.), digital watermark is called software watermark [2]. As the important property of the Software corporations, protecting the source code and proving its copyright are essential. Software watermark technique provides the guarantee [3,4,5,6]. According to the embedded position and method, software watermark is divided into static software watermark and dynamic software watermark. Static watermark is embedding the authentication information into the code. Most of the source code watermarking technique is static. Dynamic software watermark isembedding the authentication information in the execution state of the program. We can extract the dynamic software watermark by tracking the software's running process [7]. In the premise of keeping a consistent feature of the software before and after embedding the watermark, we use the following criteria to evaluate the watermarking techniques: (1) stealth. This criterion expresses the invisibility of the embedded watermark for the observer. (2) robustness. A robust software watermark can still be extracted correctly though suffering a strong attack. (3) redundancy. It refers to the modification to the source code. Y. Yuan, X. Wu, and Y. Lu (Eds.): ISCTCS 2012, CCIS 320, pp. 333–340, 2013. © Springer-Verlag Berlin Heidelberg 2013
334
Y. Su, J. Liu, and D. Li
Based on the above criteria, we proposed the HSVN algorithm to protect the copyright of the source code. It belongs to the static software watermark. With the implementation of the algorithm, we can randomly embed the authentication information into the name of the specified function's local variables.Our algorithm can achieve a good performance on stealth, robustness and redundancy. To the best of our acknowledgement, HSVN algorithm and its extensions mentioned later are all first proposed in this paper. The paper is organized as follows. The next Section describes the previous related work in software watermarking. In Section 3 we describe our method and present the overview of the implementation process, and in Section 4 we analyze the performance of this method and mention its extension. This paper concludes in Section 5.
2
Related Works
Davidson and Myhrvold proposed the first software watermarking algorithm [8]. By reordering the basic blocks of the program, this algorithm can embed a watermark into a program. At that time, some patented software watermarking algorithms [9, 10] were studied based on the idea of code replacement which can replace a predetermined portion of code or data in a program with the watermark value. Later, the watermarking algorithms based on register allocation were proposed. The watermarking algorithm proposed by Qu and Potkonjak in [11] (QP algorithm) is one of these algorithms which can be applied to a graph coloring register allocator. The basic idea of QP is that edges are added between chosen vertices in a register allocation graph based on the watermark sequence. In[12]Myles and Collberg indicated that, under this algorithm, a graph embedded in a different bit sequence may produce the same results, which can lead to a wrong extraction of the watermark. They also proposed an improved algorithm named QPS(QP for SandMark) to correct this error. They analyzed that the inaccurate message recovery is due to the unpredictability of the coloring of the vertices. So QPS places additional constraints on which vertices can be selected for a triple. Using the QPS algorithm, the selected triples are isolated units so that they will not affect other vertices in the graph. Watermarking technique based on graph theory is also an important research direction. Collberg and Thomborson proposed a new software watermarking technique in [13] (CT Algorithm) in which a dynamic graphic watermark is stored in the execution state of a program.CT algorithm uses the topology of the graph to represent a large integer N which can be split into the multiplication of two large prime integers. When embedding, N is coded into a graph G according to the special encoding algorithm. Then split G into several sub-graphs embedded into the program. But the graph detection of this algorithm needs to run the entire program. So it is fragile to the module removal attack [14]. The first static graph watermarking scheme Graph Theoretic Watermarking (GTW)was proposed by Venkatesan et al. [15]. The basic idea is to encode a watermark value in a reducible permutation graph and convert it into a control flow graph, which is then merged with the program control flow graph by adding control flow edges between the two. But Collberg et al. [16] found that watermarks of
Hiding Signatures in Variable Names
335
up to 150 bits increased program size by between 40% and 75%, while performance decreased by between 0% and 36%. Based on equation reordering, Mohammad and Sajad proposed an algorithm [17] which has a very small modification to the source code and has no negative effect on the running of the program. The main idea of this algorithm is to reorder the instructions that can be swapped with each other while preserve the original functionality of the program. The authentication sequence is embedded into these swapped instructions. The weakness of this method is that it can hide a small amount of information.
3
HSVN Algorithm
The main idea of our method is that: According to the theory of large integer decomposition difficult question, we choose a large natural number N and embed it into the program. Due to N can be decomposed to two primes P and Q which are large enough, only the legal owner can extract N from the program and provide the two large prime factors. 3.1
Asmuth-Bloom Lemma
≤
Key cryptography share idea is dividing the key K into n parts, and t (t n) parts of them can recover K. C. Asmuth and J. Bloom proposed a method based on the Chinese remainder theorem in 1983 to implement that idea [18]. Assume that q,d1,d2,...,dn, p>K, and satisfy the following conditions:
d1 < d 2 < d3 < ... < d n i, j ∈ {1, 2,..., n} , ∀i, gcd ( p, di ) = 1, ∀i ≠ j , gcd ( di , d j ) = 1 d1 × d 2 × d3 ... × dt > p × d n −t + 2 × d n −t +3 ... × d n
(1)
N = d1 × d 2 × d3 ... × dt
(2)
Let
SoN/p is greater than the product of t-1 pieces of di. Randomly select an integer r satisfying0 r N/p-1, calculate
≤≤
k' = K + r × p
(3)
ki = k ' mod di , i = 1, 2,..., n
(4)
So the i-th share ki is:
k1,k2,...,kn is the n parts of K. Now we recover K through t pieces of ki. First use the Chinese remainder theorem to solve the t congruence equations:
336
Y. Su, J. Liu, and D. Li
ki j ≡ k ' mod di j ,1 ≤ i ≤ n, j = 1, 2,..., t
(5)
This congruence equations has a unique solution x in the range of 0, d i × d i × ...di . 1 2 t Since di × di × ...di ≥ N , we can uniquely identify k' wherek'=xmod N. Then we can 1
2
t
recover Kfrom the Eq. (6).
K = k' − r × p
(6)
This method is called Asumth-Bloom lemma. 3.2
Watermark Embedding Algorithm
In order to enhance the robustness of the watermark to resist the watermark removal attack [19], HSVN adopts Asmuth-Bloom lemma to divide the watermark, and keep p, di as the copyright evidence. Watermark embedding process is as follows: (1) Select the appropriate natural number N. (2) Convert N into n parts using the Asmuth-Bloom method. Record the digits of each part with a vector V=(v1,v2,...,vn). Such as 100 is 3-digits, and 1000 is 4-digits. Connect thenparts of Ninto a string, and transform this string into a binary sequence M with 4-bits per digit. Such as,the 4-bits binary sequence 0010 represents the digit 2. If M has d bytes, we can divide M into 2d sub-watermarks m0,m1,m2,...,m2d-1, with 4bits per sub-watermark. An example is presented in Fig. 1, if the nparts of N are 24,25,100,4,89,2 and 2000. (3) According to the definition order in the source file, we assign a serial number to all the functions. For example, the total number of functions is C, then the serial numbers are 0, 1,…, C-1. Using random number generator and seed k, produce the random sequence p0,p1,...,p2d-1, and pi {0, 1,…, C-1}. If pi s, we embed miin the function that has the serial numbers. Embedding method is as follows.
∈
=
According to the definition order, find the first local variable in this function. If there is no variable, then add one as the chosen variable. If this is the first appearance of pi in the random sequence (the random sequence may have repeated numbers), we add one character behind the first character of the chosen variable name. Due to miis a 4bits binary data, it ranges from 0 to 15. The added character can be calculated through looking up the Table 1 and Table 2. For example if we want to add character behind ‘Y’ and the sub-watermark mi is 0010, the added character is ‘a’; if we want to add character behind ‘x’ and the sub-watermarkmi is 0011, the added character is ‘A’. If this is the second appearance of pi in random sequence, we add one character behind the second character of the chosen variable name, and if this is the j-thappearance of pi in random sequence, we add one character behind the j-thcharacter of the chosen variablename. Then according to this way, modify the names of all the appearances of this variable in this function. (4) After all the embedding processes, keepk, C,V and das the key of this source code.
Hiding Signatures in Variable Names
337
n parts of N is 24,25,100,4,89,2,2000
String together
“242510048922000”
V=(2,2,3,1,2,1,4) Keep record V
Change each digit into a 4-bits binary sequence Digits
Binary sequences
2
0010
4
0100
2
0010
5 1 0
0101 0001 0000
……
……
…… ……
M=001001000010010100010000 m0=0010,m1=0100,m2=0010,m3=0101,
Fig. 1. The process of n parts of N
Table 1. Serial number and the character
Serial number
Character
0 1 2 … 25 26 27 28 … 51
A B C … Z a b c … z
Table 2. Adding character behindR
mi
0000 0001 0010 0011 … 1100 1101 1110 1111
The serial number of the added character R (R+1) mod 52 (R+2) mod 52 (R+3) mod 52 … (R+12) mod 52 (R+13) mod 52 (R+14) mod 52 (R+15) mod 52
The adding character is chosen this way: The character α is added behind the letter β which has serial number R. The embedded sub-watermark is mi. First, getα’s serial number through Table 2, and then translate it into character by looking up Table 1.
338
3.3
Y. Su, J. Liu, and D. Li
Watermark Extracting Algorithm
The extracting process is similar to the embedding process: First, contrast the embedded source code to the original code. If the embedded source code lack some functions, add the empty functions at the lacked positions; if it has the extra functions, just delete them. With the known number C, generate the same random sequence p0,p1,...,p2d-1again using the seed k and the same random generator. Find the function that located by pi, and find out its first variable. Under the rules described in the embedding process, look up Table 1and Table 2 to find the hiding sub-watermark mi in the name of this variable. If this function is an empty function, the extracted subwatermark mi is none. With the help of Vwe can get the divided parts ofN. According to the Asmuth-Bloom lemma proved above, if we can extract at least rparts, we can restore N, and only the owner can give the two large prime factors of Nwhich can prove the ownership of this source code.
4
Evaluation and Extension
We evaluate the performance of HSVN algorithm through the following criteria. (1) Redundancy. Our method is embedding the watermark through modifying the name of some variables. The modification to the source code is very little. There is no extra overhead to the running of the source code. (2) Stealth. We embed the watermark in the name of the variables which are the very common and important part of the source code. It is very hard for the attacker to find the embedding positions through observation or control flow analysis. (3) Robustness. Because we adopt the Asmuth-Bloom lemma, HSVN can defend the function removal attack. Due to the use of the original source code when extracting the watermark, function adding attack doesn't affect the extracting result. Semantic keeping attack is adjusting the order of the instructions while not changing the functionality. Since our watermark is embedded in the name of the variables, this attack is useless, too. The first letter in the chosen variable name is random, so it is also very hard to crack our watermark through collusion attack which is trying to find the embedding watermark through comparing two or more embedded source codes. Table 3. Performance comparison of several software watermarking algorithms
Algorithm
Invisibility
Robustness
Modification
Datarate
HSVN GWT Equation reordering Code replacement
+ +
+ +
+ -
+ -
Impact on performance + -
+
+
+
-
+
-
-
-
+
+
Hiding Signatures in Variable Names
339
In Table 3 we contrast the performance of several software watermarking algorithms that can be used in source code copyright protection. ‘+’ means a good performance in that criteria and ‘-’ means a bad performance. Our method has a good scalability. After a slight change, it can be used in other forms of the software products and improved in the effectiveness. The algorithm mentioned above is embedding the watermark in the name of the variables. We can alsoembed the watermark in the name of the functions or the name of the formal parameters of the function using the same method. By combining the three methods together, we can increase the difficulty of cracking the watermark and data amountthat can be embedded in the source code. If we choose the variables in the same way, embed the sub-watermarks in the initial values of them rather than the names, and assign the real value to them before they are first used, we can extract the watermark from an execution of the program by tracking the initial value after the variable is defined. This extension of the algorithm can be used in the copyright protection of the executable file.
5
Conclusion
In this paper, we presented a software watermarking algorithm HSVN based on the name of the variables to protect the copyright of the source code. This algorithm can lead to a little redundancy, good stealth and strong robustness of the watermarked source code. And it can easily turn into other watermarking algorithms through a slight change, which can increase the confusion of the watermark. So it is very difficult for the attacker to find the real watermark and the actual watermark algorithm. Through this algorithm and its extensions, we can obtain the software watermarking idea that where there is the code, there is the place to embed the watermark. Acknowledgment. This work is partially supported by the High-Tech Research and Development Plan ofChina (Grant No. 2010AA012504,2011AA010705) the National Natural Science Foundation of China (Grant No.61173145); National Grand FundamentalResearch 973 Program of China (Grant No. 2011CB302605).
;
References 1. Swanson, M.D., Kobayashi, M., Tewfik, A.H.: Multimedia Data-embedding and Watermarking Technologies. Proc. of the IEEE 86(6), 1054–1087 (1998) 2. Collberg, C.S., Thomborson, C.: Watermarking,Tamper-proofing, and Obfuscation-tools for Software Protection. IEEE Transactions on Software Engineering 28, 735–746 (2002) 3. Hamilton, J., Danicic, S.: A Survey of Static Software Watermarking. In: Internet Security, pp. 100–107 (2011) 4. Dai, P., Wang, C., Yu, Z., Yue, Y., Wang, J.: A Software Watermark Based Architecture for Cloud Security. In: Sheng, Q.Z., Wang, G., Jensen, C.S., Xu, G. (eds.) APWeb 2012. LNCS, vol. 7235, pp. 270–281. Springer, Heidelberg (2012)
340
Y. Su, J. Liu, and D. Li
5. Chroni, M., Nikolopoulos, S.D.: Encoding Watermark Numbers as Cographs using Selfinverting Permutations. In: 12th International Conference on Computer Systems and Technologies, pp. 142–148 (2011) 6. Zhang, S., Zhu, G., Wang, Y.: A Strategy of Software Protection based on Multiwatermarking Embedding. In: 2nd International Conference on Control, Instrumentation and Automation, pp. 444–447 (2011) 7. Collberg, C.S., Thomborson, C.: On the Limits of Software Watermarking. Technical Report. 164 (August 1998) 8. Davidson, R.I., Myhrvold, N.: Method and System for Generating and Auditing a Signature for a Computer Program (September 1996) 9. Holmes, K.: Computer Software Protection. International Business Machines Corporation (February 1994) 10. Samson, P.R.: Apparatus and Method for Serializing and Validating Copies of Computer Software (February 1994) 11. Qu, G., Potkonjak, M.: Analysis of Watermarking Techniques for Graph Coloring Problem. In: Proceedings of the 1998 IEEE/ACM International Conference on Computer-aided Design, pp. 190–193 (1998) 12. Myles, G., Collberg, C.S.: Software Watermarking Through Register Allocation: Implementation, Analysis, and Attacks. In: Lim, J.-I., Lee, D.-H. (eds.) ICISC 2003. LNCS, vol. 2971, pp. 274–293. Springer, Heidelberg (2004) 13. Collberg, C.S., Thomborson, C.: Software Watermarking: Models and Dynamic Embeddings. In: Conference Record of the Annual ACM Symposium on Principles of Programming Languages, pp. 311–324 (1999) 14. Collberg, C.S., Huntwork, A., Carter, E., Townsend, G.: Graph Theoretic Software Watermarks: Implementation, Analysis, and Attacks. In: Fridrich, J. (ed.) IH 2004. LNCS, vol. 3200, pp. 192–207. Springer, Heidelberg (2004) 15. Venkatesan, R., Vazirani, V.V., Sinha, S.: A Graph Theoretic Approach to Software Watermarking. In: Moskowitz, I.S. (ed.) IH 2001. LNCS, vol. 2137, pp. 157–168. Springer, Heidelberg (2001) 16. Collberg, C.S., Huntwork, A., Carter, E., Townsend, G.: Graph Theoretic Software Watermarks: Implementation, Analysis, and Attacks. In: Fridrich, J. (ed.) IH 2004. LNCS, vol. 3200, pp. 192–207. Springer, Heidelberg (2004) 17. Shirali-Shahreza, M., Shirali-Shahreza, S.: Software Watermarking by Equation Reordering. In: 3rd International Conference on Information and Communication Technologies: From Theory to Applications, ICTTA (2008) 18. Asmuth, C., Bloom, J.: AModular Approach to Key Safeguarding. IEEE Transactions on Information Theory IT-29, 208–210 (1983) 19. Myers, A.C., Liskow, B.: Protecting Privacy Using the Decentralized Label Model. ACM Transactions on Software Engineering and Methodology 9(4), 410–442 (2000)