ELYSIUM JOURNAL OF ENGINEERING RESEARCH AND MANAGEMENT SEPTEMBER 2014 | VOLUME 01.NO 01 | SPECIAL ISSUE 01
ELYSIUM JOURNAL OF ENGINEERING RESEARCH AND MANAGEMENT VOL.1 - NO.1 S.NO 1.
SEPTEMBER 2014
SPECIAL ISSUE - 1
TABLE OF CONTENTS Integrated Security in Cloud Computing Environment S. Srinivasan, Dr. K. Raja
2.
A Supervised Web-Scale Forum Crawler Using URL Type Recognition A. Anitha, Mrs. R. Angeline
3.
A Robust Data Obfuscation Approach for Privacy Preserving Data Mining S. Deebika , A. Sathyapriya
4.
E-Waste Management – A Global Scenario R. Devika
5.
An Adequacy Based Multipath Routing In 802.16 WIMAX Networks K.Saranya Dr. M.A. Dorai Rangasamy
6.
Calculation of Asymmetry Parameters for Lattice Based Facial Models M. Ramasubramanian Dr. M.A. Dorai Rangaswamy
7.
Page No 1 6 16 22 24 29
Multi-Scale and Hierarchical Description Using Energy Controlled Active Balloon Model
34
T. Gandhimathi M. Ramasubramanian M.A. Dorai Rangaswamy 8.
Current Literature Review - Web Mining K. Dharmarajan Dr.M.A.Dorairangaswamy
9.
The Latchkey of the Research Proposal for Funded Mrs. B. Mohana Priya
10.
A Combined PCA Model for Denoising of CT Images Mredhula.L , Dorairangaswamy M A
11.
RFID Based Personal Medical Data Card for Toll Automation Ramalatha M, Ramkumar.A K, Selvaraj.S, Suriyakanth.S
12.
Adept Identification of Similar Videos for Web Based Video Search Packialatha A. Dr.Chandra Sekar A.
13.
38 43 46 51 56
Predicting Breast Cancer Survivability Using Naïve Baysein Classifier and C4.5 Algorithm
61
R.K.Kavitha, Dr. D.Dorairangasamy 14.
Video Summarization Using Color Features and Global Thresholding Nishant Kumar , Amit Phadikar
15.
Role of Big Data Analytic in Healthcare Using Data Mining K.Sharmila , R.Bhuvana
64 68
16.
The Effect of Cross-Layered Cooperative Communication In Mobile AD HOC Networks
71
N. Noor Alleema , D.Siva kumar, Ph.D 17.
Secure Cybernetics Protector in Secret Intelligence Agency G.Bathuriya , D.E.Dekson
18.
Revitalization of Bloom’s Taxonomy for the Efficacy of Highers Mrs. B. Mohana Priya
19.
Security and Privacy-Enhancing Multi Cloud Architectures R.Shobana Dr.Dekson
20.
Stratagem of Using Web 2.0 Tools in TL Process Mrs. B. Mohana priya
21.
The Collision of Techno- Pedagogical Collaboration Mrs. B. Mohana priya
22.
No Mime When Bio-Mimicries Bio-Wave J.Stephy Angelin, Sivasankari.P
23.
A Novel Client Side Intrusion Detection and Response Framework Padhmavathi B, Jyotheeswar Arvind M, Ritikesh G
24.
History Generalized Pattern Taxonomy Model for Frequent Itemset Mining Jibin Philip , K. Moorthy
25.
IDC Based Protocol in AD HOC Networks for Security Transactions K.Priyanka , M.Saravanakumar
26.
76 80 85 89 94 98 100 106 109
Virtual Image Rendering and Stationary RGB Colour Correction for Mirror Images
115
S.Malathy , R.Sureshkumar , V.Rajasekar 27.
Secure Cloud Architecture for Hospital Information System Menaka.C, R.S.Ponmagal
28.
Improving System Performance Through Green Computing A. Maria Jesintha, G. Hemavathi
29.
124 129
Finding Probabilistic Prevalent Colocations in Spatially Uncertain Data Mining in Agriculture using Fuzzy Logics
133
Ms.latha.R , Gunasekaran E . 30.
Qualitative Behavior of A Second Order Delay Dynamic Equations Dr. P.mohankumar, A.K. Bhuvaneswari
31.
140
Hall Effects On Magneto Hydrodynamic Flow Past An Exponentially Accelerated Vertical Plate In A Rotating Fluid With Mass Transfer Effects Thamizhsudar.M, Prof Dr. Pandurangan.J
143
32.
Detection of Car-License Plate Using Modified Vertical Edge Detection Algorithm
150
S.Meha Soman, Dr.N.Jaisankar 33.
Modified Context Dependent Similarity Algorithm for Logo Matching and Recognition
156
S.Shamini, Dr.N.Jaisankar 34.
A Journey Towards: To Become The Best Varsity Mrs. B. Mohana Priya
35.
Extraction of 3D Object from 2D Object Diya Sharon Christy , M. Ramasubramanian
36.
Cloud Based Mobile Social TV Chandan Kumar Srivastawa, Mr.P.T.Sivashankar
37.
Blackbox Testing of Orangehrmorganization Configuration Subburaj.V
167 164 170 174
INTEGRATED SECURITY IN CLOUD COMPUTING ENVIRONMENT 1 1
S. Srinivasan, 2Dr. K. Raja
Research Scholar & 1Associate Professor Research & Development Center, Bharathiar University & 1 Department of M.C.A, K.C.G College of Technology, Chennai, Tamil Nadu, India 2 Dean Academics, Alpha College of Engineering, Chennai, Tamilnadu, India. 1
[email protected] 1
Abstract-Cloud computing is a standard futuristic computing model for the society to implement Information Technology and associated functions with low cost computing capabilities. Cloud computing provide multiple, unrestricted distributed site from elastic computing to on-demand conditioning with vibrant storage and computing requirement ability. Though, despite the probable gains attained from cloud computing, the security of open-ended and generously available resources is still hesitant which blows the cloud implementation. The security crisis becomes enlarged under the cloud model as an innovative measurement enter into the problem size related to the method, multitenancy, layer confidence and extendibility. This paper introduces an in-depth examination of cloud computing security problem. It appraises the problem of security from the cloud architecture perspective, cloud delivery model viewpoint, and cloud characteristics manner. The paper examined quite a few of the key research confront of performing cloud-aware security exposition which can reasonably secure the transforming and dynamic cloud model. Based on this investigation it present a consequent comprehensive specification of cloud security crisis and main features that must be covered by proposed security solution for the cloud computing. Keywords-Cloud computing security; Cloud Security model; I. INTRODUCTION Cloud computing [1] is a resource delivery and usage model, it means to obtain resource where by shared software, hardware, and other information are provided to computers and other devices as a metered service via network. Cloud computing is the next development of distributed computing [2] paradigm which provides for extremely resilient, resource pooling, storage, and computing resources. Cloud computing [2] has motivated industry, academia, businesses to implement cloud computing to host heavy computationally exhaustive applications down to light weight applications and services. The cloud providers should focus on privacy and security issues as an affair of high and urgent priority. The cloud providers have Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as
1 | Page
Service (SaaS) and many services to present. A cloud service has distinct characteristics such as on-demand self service, ubiquitous network access, resource pooling, rapid elasticity and measured service. A cloud can be private or public. A public cloud sells services to anyone on the Internet. A private cloud is a proprietary network that supplies hosted services to a limited number of people. When a service provider uses public cloud resources to create their private cloud, the result is called a virtual private cloud. Cloud computing services afford fast access to their applications and diminish their infrastructure costs. As per Gartner survey [3], the cloud market was worth USD138 billion in 2013 and will reach USD 150 billion by 2015. These revenues imply that cloud computing is a potential and talented platform. Even though the potential payback and revenues that could be realized from the cloud computing model, the model still has a set of open questions that force the cloud creditability and reputation. Cloud security [3] is a large set of policies, technologies, controls, and methods organized to protect data, applications, and the related infrastructure of cloud computing. The major multiple issues [4] in cloud computing are: Multi-tenancy Cloud secure federation Secure information management Service level agreement Vendor lock-in Loss of control Confidentiality Data integrity and privacy Service availability Data intrusion Virtualization vulnerability Elasticity In this paper we analyze the few security issues involved in the cloud computing models. This paper is organized as follows. Section II discusses several security risks in cloud environment. In section III, analysis a short description of few related precise issues of cloud security. September 2014, Volume-1, Special Issue-1
In section IV, describes integrated security based architecture for cloud computing. Section V shows current solutions for the issues of cloud environment. Finally, section VI concludes the paper with conclusion and describes the future work for secure cloud computing. II. SECURITY RISKS IN CLOUD ENVIRONMENT Although cloud service providers can provide benefits to users, security risks play a vital role in cloud environment [5]. According to a current International Data Corporation (IDC) survey [6], the top dispute for 75% of CIOs in relation to cloud computing is security. Protecting the information such as sharing of resources used by users or credit card details from malicious insiders is of critical importance. A huge datacenter involves security disputes [7] such as vulnerability, privacy and control issues related to information accessed from third party, integrity, data loss and confidentiality. According to Tabaki et al. [8], in SaaS, cloud providers are responsible for security and privacy of application services than the users. This task is relevant to the public than the private the cloud environment because the users require rigorous security requirements in public cloud. In PaaS, clients are responsible for application which runs on the different platform, while cloud providers are liable for protecting one client’s application from others. In IaaS, users are responsible for defending operating systems and applications, whereas cloud providers must afford protection for client’s information and shared resources [9]. Ristenpartetal. [9] insists that the levels of security issues in cloud environment are different. Encryption techniques and secure protocols are not adequate to secure the data transmission in the cloud. Data intrusion of the cloud environment through the Internet by hackers and cybercriminals needs to be addressed and cloud computing environment needs to secure and private for clients [10]. We will deal with few security factors that mainly affect clouds, such as data intrusion and data integrity. Cachin et al. [11] represents that when multiple resources such as devices are synchronized by single user, it is difficult to address the data corruption issue. One of the solutions that they [11] propose is to use a Byzantine fault tolerant replication protocol within the cloud. Hendricks et al. [12] state that this solution can avoid data corruption caused by some elements in the cloud. In order to reduce risks in cloud environment, users can use cryptographic methods to protect the stored data and sharing of resources in cloud computing [12]. Using hash function [13] is a solution for data integrity by keeping short hash in local memory.
service, is data intrusion. Amazon allows a lost password to be reset by short message service (SMS), the hacker may be able to log in to the electronic mail id account, after receiving the new reset password. Service hijacking allows attackers to concession the services such as sessions, email transactions there by launching malicious attacks such as phishing, and exploitation of vulnerabilities. III. ISSUES OF CLOUD SECURITY There are many security issues associated with number of dimensions in cloud environment. Gartner [15] states that, specific security issues: multi-tenancy, service availability, long-term viability, privileged user access and regulatory compliance. Multi-tenancy shows sharing of resources, services, storage and applications with other users, residing on same physical or logical platform at cloud provider’s premises. Defense-in-depth approach [16] is the solution for multi-tenancy involves defending the cloud virtual infrastructure at different layers with different protection mechanisms. Another concern in cloud services is service availability. Amazon [17] point out in its licensing agreement that it is possible that the service may be unavailable from time to time. The users request service may terminate for any reason, that will break the cloud policy or service fails, in this case there will be no charge to cloud provider for this failure. Cloud providers found to protect services from failure need measures such as backups, Replication techniques [18] and encryption methods such as HMAC technology are combined together to solve the service availability issue. Another cloud security issue is long-term viability. Preferably, cloud computing provider will never go broke or get acquired and swallowed up by large company. But user must be ensuring their data will remain available even after any event may occur. To secure and protect the data in reliable manner through combining service level agreements or law enforcement [17], and establishment of legacy data centers. Privileged user access and regulatory compliance is major concern in cloud security. According to Arjun kumar et al. [19], Authentication and audit control mechanism, service level agreements, cloud secure federation with single sign on [20], session key management and Identity, Authentication, Authorization, and Auditing (IAAA) mechanisms [21], will protect information and restrict unauthorized user access in cloud computing.
Garfinkel [14], another security risk that may occur with a cloud provider, such as the Amazon cloud
2 | Page
September 2014, Volume-1, Special Issue-1
IV. INTEGRATED SECURITY BASED CLOUD COMPUTING MODEL The integrated security based model for cloud environment is ensuring security in sharing of resources to avoid threats and vulnerabilities in cloud computing. To ensure security on distribution of resources, sharing of services, service availability by assimilate cryptographic methods, protective sharing algorithm and combine JAR files (Java ARchive) and RAID (redundant array of inexpensive or independent disk) technology with cloud computing hardware and software trusted computing platform. The integrated security based cloud computing model is shown in Figure 1.
infrastructure security. The software security provides identify management, access control mechanism, anti spam and virus. The platform security holds framework security and component security which helps to control and monitor the cloud environment. The infrastructure security make virtual environment security in integrated security based cloud architecture. The cloud service provider controls and monitor the privileged user access and regulatory compliance by service level agreement through auditing mechanism. We can use the protective sharing algorithm and cryptography methods to describe security and sharing of resources and services on cloud computing: Bs=A(user-node); Ds=F*Bs + Ki A(.) : Access to user nodes, an application server of the system is denoted by user-node in the formula; Bs : Byte matrix of the file F; Ds : Byte of data files in global center of system; Ki : User key
Figure 1. Integrated security based cloud computing model The model uses a hierarchical protecting architecture with two layers. Each layer has its own tasks and is incorporate with each other to ensure data security and to avoid cloud vulnerabilities in integrated security based cloud environment. The authentication boot and access control mechanism layer, gives the proper digital signatures, password protective method, and one time password method to users and manages user access permission matrix mechanism. An authenticated boot service monitor the software is booted on the computer and keeps track of audit log of the boot process. The integration of protective sharing algorithm and cryptography methods with redundant array of inexpensive disk layer advances the technologies in service availability. The model improves the efficiency of multi-tenancy and protecting the information provided by the users. The protective cloud environment provides an integrated, wide-ranging security solution, and ensures data confidentiality, integrity and availability in integrated security based cloud architecture. To construct the autonomous protection of secure cloud by association with security services like authentication, confidentiality, reduce the risk of data intrusion, and verify the integrity in cloud environment. The cloud platform hardware and software module restrain software security, platform security, and
3 | Page
F : File, file F in user-node are represented as follows: F={F(1), F(2), F(3), ….F(n)}, file F is a group of n bytes of a file. Based on the values of information security of cloud environment, we design protective sharing algorithm with cryptography methods such as encryption which maintains a protective secret key for each machine in integrated security based cloud computing model is indicated as follows : Bs=A(user-node); Bs=P.Bs + Ki Ds=E(F)Bs of which: As(.) : Authorized application server; B s : Byte matrix in protected mode; P : Users’ protective matrix; E(F) : Encrypt the byte of file F; The model adopts a multi-dimension architecture of two layer defense in cloud environment. The RAID (redundant array of independent disk) assures data integrity by data placement in terms of node striping. The cloud service provider audit events, log and monitoring, what happened in the cloud environment. V. CURRENT SOLUTIONS FOR THE ISSUES IN CLOUD ENVIRONMENT In order to reduce threats, vulnerability, risk in cloud environment, consumers can use cryptographic methods to protect the data, information and sharing of resources in the cloud [22]. Using a hash function [13] is a solution for data integrity by maintaining a small hash memory.
September 2014, Volume-1, Special Issue-1
Bessani et al. [18] use Byzantine fault-tolerant method to provide and store data on different clouds, so if one of the cloud providers is out of order, they are still able to store and retrieve information accurately
design more practical and operational in the future. To welcome the coming cloud computing era, solving the cloud security issues becomes extreme urgency, that lead the cloud computing has a bright future.
Bessani et al [18] use a Depsky system deal with the availability and confidentiality in cloud computing architecture. Using cryptographic methods, store the keys in cloud by using the secret sharing algorithm to hide the values of the key from attackers. . Encryption is measured solution by Bessani et al. to address the issue of loss of data.
REFERENCES
Munts-Mulero discussed the issues of existing privacy protection technologies like K anonymous faced when applied to large information and analyzed the current solutions [23]. Sharing of account credentials between customers should be strictly denied [24] by deploying strong authentication, authorization and auditing mechanism by cloud service provider for consumer session. The consumer can able to allow HIPS (Host Intrusion Prevention System) at customer end points, in order to achieve confidentiality and secure information management. The integrated based security model provides a RAID technology with sharing algorithm and cryptographic methods, assure data integrity and service availability in cloud computing architecture. The authentication boot and access control mechanism ensuring security through cloud deployment models. VI. CONCLUSION AND FUTURE WORK It is clear that, although the use of cloud computing has rapidly increased. Cloud security is still considered the major issue in the cloud computing environment. To achieve a secure paradigm, this paper focused on vital issues and at a minimum, from cloud computing deployment models view point, the cloud security mechanisms should have the enormous flair to be self defending with ability to offer monitoring and controlling the user authentication, access control through booting mechanism in cloud computing integrated security model. This paper proposes a strong security based cloud computing framework for cloud computing environment with many security features such as protective sharing of resources with cryptography methods along with the combination of redundant array of independent disk storage technology and java archive files between the users and cloud service provider. The analysis show that our proposed model is more secure under integrated security based cloud computing environment and efficient in cloud computing. Future research on this work will include the development of interfaces, standard and specific protocols that can support confidentiality and integrity in cloud computing environment. We will make the actual
4 | Page
[1] Guoman Lin, “Research on Electronic Data Security Strategy Based on Cloud Computing”, 2012 IEEE second International conference on Consumer Electronics,ISBN: 978-1-4577-1415-3, 2012, pp.1228-1231. [2] Akhil Behl, Kanika Behl, “An Analysis of Clou d Computing Security Issues”, 2012 IEEE proceedings World Congress on Information and Communication Technologies, ISBN: 978-1-46734805-8,2012,pp.109-114. [3] Deyan Chen, Hong Zhao “Data Security and Priv acy Protection Issues in Cloud Computing”,2012 IEEE proceedings of International Conference on Computer Science and Electronics Engineering, ISBN: 978-0-7695-4647-6,2012,pp.647-651. [4] Mohammed A. AlZain, Eric Pardede, Ben Soh, James A. Thom, “Cloud Computing Security: From Single to Multi-Clouds”, IEEE Transactions on cloud computing,9(4), 2012, pp.5490-5499. [5] J.Viega, “Cloud computing and the common man”,Computer,42,2009,pp.106-108. [6] Clavister,”Security in the cloud”, Clavister White Paper, 2008. [7] C.Wang,Q.Wang,K.Ren and W.Lou,”Ensuring data storage security in cloud computing”,ARTCOM’10:Proc. Intl.Conf. on Advances in Recent inTechnologies in Communication and Computing,2010,pp.1-9. [8] H.Takabi,J.B.D.Joshi and G.J Ahn,”Security an d Privacy Challenges in Cloud Computing Environments”,IEEE Security & Privacy,8(6),2010,pp.24-31. [9] T.Ristenpart,E.Tromer ,H.Shacham and S.Savage, “Hey you,get off of my cloud:exploring information leakage in third-party compute clouds”,CCS’09:Proc.16 th ACM Conf. on Computer and communications security,2009,pp.199-212. [10] S.Subashini and V.Kavitha,”A survey on secur ity issues in service delivery models of cloud computing”,Journal of Network and Computer Applications,34(1),2011,pp.1-11. [11] C.Cahin,I.Keidar and A.Shraer,”Trusting the cloud”,ACM SIGACT News,40,2009.pp.81-86. [12] J.Hendricks, G.R.Ganger and M.K.Reiter,”Low overhead byzantine fault –tolerant storage”,SOSP’07:Proc.21 st ACM SIGOPS symposium on Operating systems principles,2007,pp.73-86. [13] R.C.Merkle,”Protocols for public key crptosystems”,IEEE Symposium on Security and Privacy,1980,pp.122-134. [14] S.L.Garfinkel,”Email-based identification an d
September 2014, Volume-1, Special Issue-1
[15]
[16] [17] [18]
[19]
[20] [21]
[22] [23]
[24]
authentication: An alternative to PKI?”, IEEE Security and Privacy,1(6),2003,pp.20-26. Gartner:Seven cloud computing security risks.InfoWorld,2008-07-02, http://www.infoworld.com/d/securitycentral/gartner-seven-cloud-computing-securityrisks-853. Microsoft Research, Securing Microsoft’s Cloud Infrastructure”, in White Paper, 2009. Amazon,Amazon Web Services, Web Services licensing agreement, October 3,2006. A.Bessani,M.Correia, B.Quaresma, F.Andre andP.Sousa,”DkpSky:dependable and secure storage in a cloud-of-clouds”,EuroSys’11:Proc.6 th Conf. on Computer systems,2011,pp.31-46. Arjunkumar, Byung gook Lee, Hoon Jae Lee,Anukumari,”Secure Storage and Access of Data in Cloud Computing”, 2012 IEEE ICT Convergence,ISBN:978-1-4673-48287,2012,pp.336-339. M.S.Blumental,”Hide and Seek in the Cloud”, IE EE Security and Privacy, IEEE,11(2),2010,pp.5758. Akhil Behl, KanikaBehl,”Security Paradigms f or Cloud Computing”, 2012 IEEE Fourth International Conference on Computational Intelligence, Communication Systems and Networks,ISBN:9780-7695-4821-0,2012,pp.200-205. R.C Merkle,”Protocols for public key cryptosystems”,IEEE Symposium on Security and Privacy,1980. Muntes-Mulero V, Nin J. Privacy and anonymization for very large datasets In:Chen P,ed. Proc of the ACM 18th Int’l Conf. on Information and Knowledge Management, CKIM 2009, New York:Associationfor Computing Machinery, 2009,2117.2118,[doi:10.114 5/1645953.1646333]. Wikipedia-Cloud Computing security.
5 | Page
September 2014, Volume-1, Special Issue-1
A SUPERVISED WEB-SCALE FORUM CRAWLER USING URL TYPE RECOGNITION A. Anitha1,
Mrs. R. Angeline2, M. Tech. 2 Assistant Professor 1,2 Department of Computer Science & Engineering, SRM University, Chennai, India. 1
ABSTRACT– The main goal of the Supervised Web-Scale Forum Crawler Using URL Type Recognition crawler is to discover relevant content from the web forums with minimal overhead. The result of forum crawler is to get the information content of a forum threads. The recent post information of the user is used to refresh the crawled thread in timely manner. For each user, a regression model to predict the time when the next post arrives in the thread page is found .This information is used for timely refresh of forum data. Although forums are powered by different forum software packages and have different layouts or styles, they always have similar implicit navigation paths. Implicit navigation paths are connected by specific URL types which lead users from entry pages to thread pages. Based on this remark, the web forum crawling problem is reduced to a URL-type recognition problem. And show how to learn regular expression patterns of implicit navigation paths from automatically generated training sets usingaggregated results from weak page type classifiers. Robust page type classifiers can be trained from as few as three annotated forums. The forum crawler achieved over 98 percent effectiveness and 98 percent coverage on a large set of test forums powered by over 100 different forum software packages. Index Terms—EIT path, forum crawling, ITF regex, page classification, page type, URL pattern learning, URL type. 1 INTRODUCTION INTERNET forums [1] (also called web forums) are becoming most important services in online. Discussions about distinct topics are made between usersin web forums. For example, inopera forum Board is a place where people can ask and shareinformation related to opera software. Due to the abundance of information in forums, knowledge mining on forums is becoming an interesting research topic. Zhai and Liu [20], Yang et al. [19], and Song et al. [15] mined structured data from forums. Gao et al. [9] recognized question and answer pairs in forumthreads. Glance et al. [10] tried to extract business intelligence from forum data. To mine knowledge from forums, their content must bedownloaded first. Generic crawlers [7], adopts breadth-firsttraversal strategy. The two main noncrawler friendly features of forums [8], [18]: 1) duplicate linksand uninformative pages and 2) page-flipping links. Aforum contains many duplicate links which point to a common
6 | Page
page but each link hasdifferent URLs [4], e.g., shortcut links pointing to the most recent posts or URLs for user experience tasks such as ―view by date‖ or ―view by title.‖ Ageneric crawler blindly follows these links and crawl many duplicate pages, making it inefficient. A forum also contains many uninformative pages such as forum software specific FAQs.Following these links, a crawler will display many uninformative pages. A long forum board or thread is usually divided into multiple pages .These pages are linked by pageflipping links, for example,see Figs.2b, and 2c. Generic crawlers process each page of page-flipping links separately and eliminate the relationships between suchpages. Tofacilitate downstream tasks such as pagewrapping and content indexing [19] these relationship between pages should be conservedduring crawling. For example, in order to mine all the posts in the thread aswell as the reply-relationships between posts, multiples pages of the thread should be concatenated together.Jiang et al [7] proposed techniques to learn and searching web forums using URL patterns but he does not discussed about the timely refresh of thread pages. A supervised web-scale forum crawler based on URL type recognition is introduced to address these challenges. The objective of this crawler is to search relevant content, i.e., user posts, from forums with minimal overhead. Each Forum has different layouts or styles and each one is powered by a variety of forum software packages, but they always contain implicit navigation paths to lead users from entry pages to thread pages.
Figure 1 Example link relations in a forum Fig. 1 illustrates about the link structure of each
September 2014, Volume-1, Special Issue-1
page in a forum. For example, a user can traverse from the entry page to a thread page through the following paths: 1. entry -> board -> thread 2. entry -> list-of-board -> board -> thread 3. entry -> list-of-board & thread -> thread 4. entry -> list-of-board & thread -> board -> thread 5. entry -> list-of-board -> list-of-board & thread -> thread 6. entry -> list-of-board -> list-of-board & thread -> board -> thread Pages between the entry page and thread page can be called index pages. The implicit paths for navigation in forum can be presents as (entry-indexthread (EIT) path): entry page-> index page ->thread page he task of forum crawling is reduced to a URL type recognition problem. The URLs are classified into three types- index URLs, Thread URLs and page flipping URLs. It is showed how to learn URL patterns, i.e., Index-Thread-page-Flipping (ITF) regexes and steps to identify these three types of URLs from as few as three annotatedforum packages. ―Forum package‖ here refers to ―forum site.‖ The timestamp in each thread page is collected. Any change in post of the same thread page but distributed on various pages can be concatenated using the timestamp details in each thread page. For each thread, a regression model to predict the time is used when the next post arrives in the same page. The most important contributions of this paper are as follows: 1. The forum crawling problem is reduced to a URLtype recognition problem. 2. It is presented how to automatically learn regularexpression patterns (ITF regexes) that identify theindex URL, thread URL, and page-flipping URLusing the Pre- built page classifiers from as few as threeannotated forums. 3.To refresh the crawled thread pages, incremental crawling of each page using timestamp is used. 4. The evaluation of URL type recognition crawler on a large set of 100 unseenforum packages showed that the learned patterns(ITF regexes)aremore effective during crawling.The result also showed that the performance of URL type recognition crawleris more when compared with structure-driven crawler, and iRobot. The rest of this paper is organized as follows. Section 2 provides a brief review of related work. Section 3, defines the termsused in this paper. Describe the overview and the algorithms of the proposed approachin Section 4. Experiment evaluations are reported in Section
7 | Page
5. Section 6 contains the conclusion and future work of the research. 2 RELATED WORKS Vidal et al. [17] proposed acrawler which crawls from entry page to thread pages using the learned regular expression patterns of URLs. The target pages are found by comparing DOM trees of pages with help of preselected sample target page. This method is effective only when the sample page is drawn from the specific site. For each new site the same process must be repeated. Therefore, this method is not suitable for large-scale crawling. Incontrast, URL type recognition crawlerautomatically learns URL patterns across multiple sites using the training sets and finds a forum’s entry page given a pagefrom the forum. Guo et al.[11] did not mention how to discover andtraverse URLs. Li et al. [22] developed some heuristic rules todiscover URLs but rules can be applied only for specific forum software packages for which heuristic is considered. But, in internet there are hundreds of different forum software packages. Refer ForumMatrix [2] to get extra information about forum software packages. Many forums also have their own customized software. A more widespread work on forum crawling is iRobot by Cai et al. [8]. iRobotis an intelligent forum crawler based on site-level structure analysis. It crawls by sampling pages,clustering them, using the informativeness evaluation select informative clusters and find traversal path using spanning tree algorithm.But, the traversal path selection procedure requireshuman inspection. From the entry to thread page there are six paths but iRobot will take only the first path(entry -> board ->thread). iRobot discover new URL link using both URL pattern and location information, but when the page structure changes the URL location might become invalid. Next, Wang et al. [18] follows the work and proposed an algorithm for traversal pathselection problem. They presented the concept of skeletonlink and page-flipping link. Skeleton links are ―the valuable links of a forum site.‖ These links are identified by informativeness and coverage metrics. Page-flipping links are identified by connectivitymetric. By following these links, they exhibited that iRobot canachieve more effectiveness and coverage. The URL type recognition crawler learns URL patterns instead of URL locations todiscover new URLs. URL patterns are not affected by page structure modification. The next related work in forum crawling is nearduplicate detection. The main problem in Forum crawling is to identify duplicates and remove them. The contentbasedduplicate detection [12], [14] first downloads the pages and then applies the detection algorithm which makes it bandwidth inefficient method. URL-based duplicate detection[13] attempts to mine rules of different
September 2014, Volume-1, Special Issue-1
URLs with similar text. They need to analyze logs from sites or results of a previous crawl which is helpless. Inforums, all the three types of URLs have specific URL patterns.URL type recognition crawler adopts a simple URL string de-duplicationtechnique (e.g., a string hashset).This method can avoid duplicates without duplicate detection. To reduce the unnecessary crawling, industry standards protocols such as ―nofollow‖ [3], Robots Exclusion Standard (robots.txt) [6], and Sitemap Protocol [5] have been introduced. The page authors can inform the crawler that the destination page is not informative by specifying the ―rel‖ attribute with the―nofollow‖ value (i.e., ―rel=nofollow).This method is ineffective since each time the author must specify the ―rel‖ attribute. Next, Robots Exclusion Standard(robots.txt) specifies what pages a crawler is allowed to visitor not. Sitemap [5] method lists all URLs along with their metadata including update time, change frequency etc in an XML files.The purpose of robots.txtand Sitemap is to allow the site to be crawled intelligently.Although these files are useful, their maintenance isvery difficult since they change continually. 3 TEMINOLOGY In this section, some terms used in this paper are defined to make the demonstration clear and to proceed further discussions, Page Type: Forum pages are categorized into four page types. Entry Page: The homepage of a forum, which is the lowest common ancestor of all threads. It contains a list of boards. See Fig. 2a for an example. Index Page: It is a page board in forum which contains a table like structure. Each row in the table contains information of a board or a thread. See Figs. 2b for examples. List-of board page, list-of-board and thread page, and board page are all stated as index pages. Thread Page: A page in a forum that contains a list of posts content belonging to the same discussion topic generated by users .That page is termed as thread page. See Figs. 2c for examples. Other Page: A page which doesn’t belong to any of the three pages (i.e.) entry page, index page, or thread page.
Figure 2 An example of EIT paths: entry board thread URL Type: URLs can be categorized into four different types.
Index URL: A URL links between an entry page and an index page or between two index pages. Its anchor text displays the title of its destination board. Figs. 2a and 2b show an example.
Thread URL: A URL links between an index page and a thread page. Its anchor text is the heading of its destination thread. Figs. 2b and 2c show an example.
Page-flipping URL: A URL links connecting multiple pages of a board or a thread. Page-flipping URLs allows a crawler to download all threads in a large board or all posts in a long thread. See Figs. 2b, and 2c for examples.
Other URL: A URL which doesn’t belong to any of the three URLs (i.e.) index URL; thread URL, or page-flipping URL.
EIT Path: An entry-index-thread path is navigation
8 | Page
September 2014, Volume-1, Special Issue-1
path from an entry page to thread pages through a sequence of index pages. See fig.2 ITF Regex: An index-thread-page-flipping regular expression is used to recognize index, thread, or pageflipping URLs. ITF regex of the URLs are learned and applied directly in online crawling. The learned ITF regexes are four for each specific site: one for identifying index URLs, one for thread URLs, one for index page-flipping URLs, and one for thread pageflipping URLs. See table 2 for example. 4 A SUPERVISED WEB SCALE FORUM CRAWLER – URL TYPE RECOGNITION Inthis section some observations related to crawling, system overview and modules are discussed. 4.1 Observations The following characteristicsof forums are observed by investigating 20 forums to make crawling effective: 1. Navigation path:Each Forum has different layout and styles but all the forums have implicit navigation paths in common which lead the user from entry page to thread pages. In this crawler, implicit navigation path is specified as EIT path which says about the types of links and pagesthat a crawler should track to reach thread pages. 2. URL layout: URL layout information such as thelocation of a URL on a page and its anchor text lengthare usedfor the identification of Index URLs and thread URLs. For Index URLs the anchor text length will be small and it contains more URLs in the same page. For thread URLs the anchor text length will be long and it contains less or no URLs in the page. 3. Page layout: Index pages and thread pages of different forums have similar layouts. Anindex page has narrow records like a board. A thread page has large records of user post. Using the page type classifier learned from a set of few annotated pages based on the page characteristic. This is the only process in crawling where manual annotation is required. Using the URL layout characteristics the index URL, thread URL, and page-flipping URL can be detected.
9 | Page
Figure 3 System Overview 4.2 System Overview Fig. 3 shows the overall architecture of the crawler. It consists of two major parts: the learning and the online crawling part. The learning part first learns ITF regexes of a given forum from constructed URL training sets and then implements the incremental crawling using the timestamp when there is a new user post in the thread page. The learned ITF regexes are used to crawl all threads pages in the online crawling part. The crawler finds the index URLs and thread URLs on the entry page using Index/Thread URL Detection module.The identified index URLs and thread URLs are stored in the index/ thread URL training sets. The destination pages of the identified index URLs are fed again into the index/thread URL Detection module to find more index and thread URLs until no more index URL is detected. After that, the Page-Flipping URL are found from both index pages and thread pages using Page-Flipping URL Detection module .These URLs are stored in pageflipping URLs training sets. From the training sets, ITF Regexes Learning module learns a set of ITF regexes foreach URL type. Once the learning is completed, online crawling part is executed: starting from the entry URL, the crawler tracks all URLs matched with any learned ITF regex and crawl until no page could be retrieved or other condition is satisfied. It also checks for any change in index/ thread pages during the user login time. The next user login time is identified by regression method. The identified change in index and thread page fed again to detection module to identify any changes in the page URLs. The online crawling part displays the resultant thread pages with the modified thread pages with help of learned ITF Regexes.
September 2014, Volume-1, Special Issue-1
4.3 ITF Regexes Learning To learn ITF regexes, the crawler has two step of training procedure. The first step is to construct the training sets. The second step is regexes learning. 4.3.1 Constructing URL Training Sets The aim of this training set is to create set of highly precise index URL, thread URL, and pageflipping URL strings for ITF regexes learning. Two separate training sets are created: index/thread training set, page-flipping training set. 4.3.1.1 Index URLs and thread URLs training set: An index URLs are the links between an entry page and an index page or between two index pages. Its anchor text displays the title of its destination board. A Thread URLs are the links between an index page and a thread page. Its anchor text is the heading of its destination thread. Both index and thread page has their own layout. An index page contains many narrow records and has long anchor text, short plain text; whereas a thread page contains few large records (user posts). Each user post has a very long textblock and very short anchor text.Each record of the index page or athread page is always associated with a timestamp field, but the timestamp order in these two types of pages arereversed: in an index page the timestamps are indescending order while in the thread page they are in ascending order. T The difference between index and thread page are made in pre-built page classifiers. The page classifiers are built by Support Vector Machine (SVM) [16] to identify the page type. Based on the page layout, outgoing links, and metadata and DOM tree structures of the records are used as main features for crawling instead of page content in generic crawling. The most features with their description are displayed in Table 1. Feature Record Count Max/Avg/Va r of Width Max/Av g/V ar of Height
Value Float
Float
Float
Max/Avg/Va r of Anchor Float Length
10 | Page
Description Number of records The maximum/average/variance of record width among all records The maximum/average/variance of record height among all records The maximum/average/variance of anchor text length in characters among all records
Max/Avg/Va r of Text Length
Max/Avg/Va r of Leaf Nodes Max/Avg/Va r of Links
Float
The maximum/average/variance of plain text length in characters among
Float
all records The maximum/average/variance of leaf nodes in HTML DOM tree among
Float
all records The maximum/average/variance of
links among all records Whether each record has a Has Link Boolean link Whether each record has a Has User link Link Boolean pointing to a user profile page Has Whether each record has a Timestamp Boolean timestamp The order of timestamps in the records Time Order Float if the timestamps exist The similarity of HTML Record Tree DOM trees Float Similarity among all the records Ratio of The ratio of anchor text Anchor length in Length to characters to plain text Text Float length in Length characters The number of elements Number of groups after Float HTML DOM tree Groups alignment TABLE 1 Main Features Classification
for
Index/Thread
Page
September 2014, Volume-1, Special Issue-1
Algorithm IndexUrlAndThreadUrlDetection Input: p: an entry page or index page
proposed to detect theirproperties.
page-flippingURLs
based
on
Output: it_groups: a group of index/thread URLs
The observation states that the grouped page-flipping URLs have thefollowing properties:
1: 2:
1. Their anchor text is either a series of digits suchas 1, 2, 3, or special text such as ―last‖ , ―Next.‖
3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:
let it_groupsbe φ; data url_groups = Collect URL groups by aligning HTML DOM tree of p; foreach urg in url_groupsdo urg.anchor_len = Total anchor text length in urg; end foreach it_groups = max(urg.anchor_len) in url_groups; it_groups.DstPageType = select the most common page type of the destination pages of URLs in urg; if it_groups.DstPageType is INDEX_PAGE it_groups.UrlType = INDEX_URL; else if it_groups.DstPageType is THREAD_PAGE it_groups.UrlType = THREAD_URL; else it_groups = φ; end if return it_groups; Figure 4 Index and Thread URL Detection Algorithm
Using the same feature set both index and thread page classifier can be built. The URL type recognition crawler does not require strong page type classifiers. According to [15], [20], URLs that are displayed in the HTMLtable-like structure can be mined by aligning DOM trees.These can be stored in a link-table. The partial tree alignment method in [15] is adopted for crawling. The index and thread URL training sets is create using the algorithm shown in Fig. 4. Lines 2-5 collects all the URL groups and calculates their total anchor text length; line 6 chooses the longest anchor text length URL group from the index/thread URL group; and lines 7-14 decides its URL type. The URL group is discarded, if it doesn’t belong to both index and thread pages. 4.3.1.2 Page-flipping URL training set Page-flipping URLs are very different from both index and thread URLs. Page-flipping URLs connects multiple pages of index or thread. There are two types of page-flipping URLs: grouped page-flipping URLs and single page-flipping URLs. In a single page, grouped page-flipping URLs have more than one page-flipping URL.In a single page, a single page-flipping URL has only one page-flipping link or URL. Wang et al. [18] explained ―connectivity‖ metric to distinguish pageflippingURLs from other loop-back URLs. However, the metric works well only for grouped page-flipping URLs and the metric is unable to detectedsingle page-flipping URLs. To address both the types of page-flipping URLs, their special Characteristics are observed. An algorithm is
11 | Page
2. They are seen at the source page of same location on the DOM tree and the DOM trees of theirdestination pages. Algorithm PageFlippingUrlDetection Input: pg: an index page or thread page Output: pf_groups: a group of page-flipping URLs 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23:
let pf_groupsbe φ; url_groups = Collect URL groups by aligning HTML DOM tree of pg; foreach urginurl_groupsdo if the anchor texts of urg are digit strings pages = Download (URLs in urg); if pages have the similar layout to pgandurg is located in same pg page pf_groups = urg; break; end if end if end foreach if pf_groups is φ foreach urlin outgoingURLs inpg sp = Download (url); pf_urls = ExtractURL inspat the same location asurlinpg; if pf_urlsexistsandpf_urls.anchor == url.anchor and pf_urls.UrlString != url.UrlString add urlandcond_urlintopf_groups; break; end if end foreach end if pf_groups.UrlType = PAGE_FLIPPING_URL; return pf_groups; Figure 5 Page Flipping URL Detection Algorithm
3. The layouts of their source page and destination pages are similar. To determine the similarity between the two page layouts a tree similarity method is used. The single page-flipping URLs do not havethe property 1, but they have another special property. 4. In the single page-flipping URLs,the source pages and the destination pages have thesimilar anchor text but have different URL strings.
September 2014, Volume-1, Special Issue-1
The page-flipping URL detection algorithm is basedon the above properties. The detail is shown in Fig. 5. Lines 1-11 tries to identify the ―group‖ page-flipping URLs; if it fails, lines 13-20 will count all the outgoing URLs todetect the single page-flipping URLs; and line 23 set its URLtype to page-flipping URL. 4.3.2 Learning ITF Regexes The algorithms for the creation of index URL, thread URL, and page-flipping URL string training sets are explained. How to learn ITF regexes from these training sets is explained in this section. Vidal et al. [17] proposed URL string generalization for learning, but this method is affected by negative URL and it requires very clean, precise URL examples.The URL Type recognition crawler cannot guarantee that the training sets created are clean and precise since it is generated automatically. So, Vidal et al[17] method cannot be used for learning. Page Type Index Index Thread Thread
URL Type
URL Pattern http://www.aspforums.net\foru Index m+/ \w+/ \d+/threads http:// Pagewww.aspforums.net\forum+/ Flipping +\w+/ \d+/ \w+\d http://www.aspforums.net\Threa Thread ds +/ \d+/ \w/ http:// Pagewww.aspforums.net\Threads +/ Flipping \d+/ \w+/ \d- \w Table 2 The learned ITF regexes from http://www.aspforums.net
Take these URLs for example http://www.aspforums.net/Forums/NetBasics/233/Threads http://www.aspforums.net/Threads/679152/TableC-Net/ http://www.aspforums.net/Forums/ASPAJAX/212/Threads http://www.aspforums.net/Threads/446862/AJAXCalen/ http://www.aspforums.net/Threads/113227/AJAXForms /
a set of URLs. Theneach specific pattern is extra refined to get more specificpatterns. Patterns are collectedcontinuously until no morepatterns can be refined. When this method is applied to the previous example, ―*‖ refined to a specific pattern http://www.aspforums.net/\w+\d+\w/ which matches all URLs both positive and negative URLs. Then this pattern is further refined to two more specific patterns. 1. http://www.aspforums.net/Forums+/ \threads
\w+/
2. http://www.aspforums.net/Threads+/ \d+/ \w All the URLs subsets are matched with each specificpattern. These twopatternsare tried to refined further but it can’t be done. So, the final output patterns are these three patterns. Asmall modification is done to this technique to reduce patterns and expect many URLs to be covered in the correct pattern. The adjustment is that pattern is retained only if the number of itsmatching URLsis greater than an empirically calculated threshold. Thethreshold is equal to 0.2 times the total count of URLs. For the given example, only the first pattern is retained because it threshold is more. The crawler learns a set of ITF regexes for a givenForum and each ITF regex has three elements: page type ofdestination pages, URL type, and the URL pattern. Table 2 shows the learned ITF regexes from forum aspforums. When a user post a new index or thread page in the forum, it is identified from the timestamp of the user login information .Regression method is used to find any change in the user login information. The crawling is done again to find new index/ thread pages. 4.4 Online Crawling The crawler performs an online crawling using breadth-first strategy. The crawler first pushes the entry URL into a URL queue and then it fetches a URLfrom the URL queue and downloads its page. Then itpushes the outgoing URLs of the fetched URL which matches any learned regex into the URL queue. The crawlercontinues this process until theURL queue is empty or other conditions are satisfied.
The regular expression pattern for the above URLs is given as: http://www.aspforums.net /\w+/ \d+/ \w/. The target pattern is given as: http://www.aspforums.net.com/ \w+\d+\w/.Koppula et al. [13] proposed a method to deal with negative example. Starting with the generic pattern ―*,‖ the algorithmdiscoveriesthe more specific patterns matching
12 | Page
\d+/
September 2014, Volume-1, Special Issue-1
Index Page %
Thread Page %
Index/Thread URL
#Tra in Detection % Foru Precisio Precisio Precisio m n Recall n Recall n Recall Avg. – Avg. – Avg. – Avg. – Avg. – Avg. – SD SD SD SD SD SD 97.51 – 96.98 – 98.24 – 98.12 – 99.02 – 98.14 – 3 0.83 1.33 0.55 1.15 0.17 0.21 97.05 – 97.47 – 98.28 – 98.04 – 99.01 – 98.13 – 5 0.69 1.56 0.27 1.23 0.15 0.18 10
97.23 – 96.91 – 98.43 – 97.96 – 99.01 – 98.08 – 0.20 1.38 0.44 1.49 0.15 0.17
20
97.34 – 96.18 – 98.66 – 98.00 – 99.00 – 98.10 – 0.18 0.56 0.26 1.18 0.10 0.12
97.44 – 96.38 – 99.04 – 97.49 – 99.03 – 98.12 – 30 N/A N/A N/A N/A N/A N/A Table 3 Results of Page Type Classification and URL Detection The online crawling of this crawler is very efficient since it only needs to apply the learned ITF regexes in learning phase on newoutgoing URLs in newly downloaded pages. This reduces the time need for crawling. 5 EXPERIMENTS AND RESULTS In this Section, the experimental results of the proposed system includes performance analysis of each modules and comparison of URL type recognition crawler with other types generic crawlers in terms of both the effectiveness and coverage. 5.1 Experiment Setup To carry out experiment 200 different forum software packages are selected from ForumMatrix [2]. The Forum powered by each software package is found. In total, there are 120 forums powered by 120 different software packages. Among them, 20 forums are selected as training set and remaining 100 forums are used for testing. The 20 training packages are installed by 23,672 forums and the 100 test packages are installed by 127,345 forums. A script is created to find the number of thread and user in these packages. It is estimated that these packages cover about 1.2 million threads generated by over 98,349 users
three thread pages, and three other pages from each of the 20 forums are selected manually and the features of these pages are extracted. For testing, 10 index pages, 10 thread pages, and 10 other pages from each of the 100 forums are selected manually. This is known as 10-Page/100 test set. Index/Thread URL Detection module described in Section 4.3.1 is executed and the test set is generated. The detected URLs are checked manually. The result is computed at page level not at individual URL level since a majority voting procedure is applied. To make an additional check about how many annotated pages this crawler needs to achieve good performance. The same experiments are conducted with different training forums (3, 5, 10, 20 and 30) and applied cross validation. The results are shown in Table 3. From the result it is showed that a minimum of three annotated forums can achieve over 98 percent precision and recall. I D
1
2 3
4 5
Forum
Foru m
Softwa #Threa re ds
Name AfterDa forums.afterdaw wn: Customi n.com zed 535,383 Forums ASP.NE Commun T ity 1,446,2 forums.asp.net 64 Forums Server forum.xdaAndroid vBulletin 299,073 deveopers.com Forums BlackBer forums.crackber ry vBulletin ry.com V2 525,381 Forums techreport.com/f Tech orums Report phpBB 65,083
Table 4 Forum used in Online Crawling Evaluation
5.2 Evaluations of Modules
5.2.2 Evaluation of Page-Flipping URL Detection To evaluate page-flipping URL detection module explained in Section 4.3.1, this module is applied on the 10-Page/100 test set and checked manually. The method achieved over 99 percent precision and 95 percent recall in identifying the page flipping URLs.The failure in this module is mainly due to JavaScript-based page-flipping URLs or HTML DOM tree alignment error.
5.2.1 Evaluation of Index/Thread URL Detection To build page classifiers, three index pages,
5.3 Evaluation of Online Crawling Among the 100 test five forums (table 4) are
13 | Page
September 2014, Volume-1, Special Issue-1
selected for comparison study. In which four forums are more popular software packages used by many forum sites. These packages have more than 88,245 forums. 5.3.1Online Crawling Comparison Based on these metrics URL type recognition crawler is compared with other generic crawler like structure-driven crawler, iRobot. Even though the structure-driven crawler [25] is not a forum crawler, it can also be applied to forums. Each forum is given as an input to each crawler and the number of thread pages and other pages retrieved during crawling are counted. Learning efficiency comparison The learning efficiency comparisons between the crawlers are evaluated by the number of pages crawled. The results are estimated under the metric of average coverage over the five forums. The sample for each method is limited to almost N pages, where N varies from 10 to 1,000 pages. 100% 80% 60% 40% 20%
0%
10
20
50
URL Type Recognition iRobo t Structure driven crawler 100 200 500 1000
Figure 6 Coverage comparison based on different numbers of sampled pages in learning phase Structure driven crawler Recognition
iRobot
URL Type
100% 50% 0% 1 2 3 4 5 Figure 7 Effectiveness comparisons between the structure-driven, iRobot, and URL Type recognition crawler Using the learned knowledge the forums are crawled for each method and results are evaluated. Fig. 6 shows the average coverage of each method based on different numbers of sampled pages. The result showed that URL type recognition crawler needs only 100 pages
14 | Page
to achieve a stable performance but iRobot and structure driven crawler needs more than 750 pages to achieve a stable performance. This result indicates that URL type recognition crawler can learn better knowledge about forum crawling with smaller effort. Crawling effectiveness comparison Fig. 7 shows the result of effectiveness comparison. URL type recognitioncrawler achieved almost 100% effectiveness on all forums. The average effectiveness structure-driven crawler is about 73%. This low effectiveness is mainly due to the absence of specific URL similarity functions for each URL patterns. The average effectiveness iRobot is about 90 % but also it is considered as an ineffective crawler since it uses random sampling strategy which samples many useless and noisy pages during crawling. Compared to iRobot, URL type recognition crawler learns the EIT path and ITF regexes for crawling so it is not affected by noisy pages and performed better. This shows that for a given fixed bandwidth and storage, URL type recognition crawler can fetch much more valuable content than iRobot. Crawling coverage comparison Fig. 8 shows that URL type recognition crawler had better coverage than the structure-driven crawlerand iRobot. The average coverage of URL type recognition crawler was 99 %compared to structure-driven crawler 93 % andiRobot’s 86 %.The low coverage of structuredriven crawler is due to small domain adaptation. Structure driven crawler Recognition
iRobot
URL Type
100% 50% 0% 1 2 3 4 5 Figure 8 Coverage comparisons between the structure-driven crawler, iRobot, and URL Type Recognition The coverage of iRobot is very low because it learns only one path from the sampled pages which lead to loss of many thread pages. In contrast, URL type recognition crawler learns EIT path and ITF regexes directly and crawls all the thread pages inforums. This result also showed that Index and thread URL and Page flipping URL algorithm is very effective. 6 CONCLUSION The forum crawling problem is reduced to a URL type recognition problem and showed how to leverage implicit navigation paths of forums, i.e., EIT path, and designed methods to learn ITF regexes explicitly.Experimental results confirm that URL type recognition crawler can effectively learn knowledge of
September 2014, Volume-1, Special Issue-1
EIT path from as few as three annotated forums. The test resultson five unseen forums showed that URL type recognition crawlerhas better coverage and effectiveness than other generic crawlers. In future, more comprehensive experiments shall be conducted to further verify that URL type recognition crawler method can be applied to other social media’s and it can be enhanced to handle forums using javascript. REFERENCES [1] [2] [3] [4] [5] [6] [7]
[8]
[9]
[10]
[11]
Internet Forum, http://en.wikipedia.org/wiki/Internetforu m,2012. ―ForumMatrix,‖ http://www.forummatrix.org/index.php, 2012. nofollow, http://en.wikipedia.org/wiki/Nofollow, 2012. ―RFC 1738—Uniform Resource Locators (URL),‖ http://www.ietf.org/rfc/rfc1738.txt, 2012. ―The Sitemap Protocol,‖ http://sitemaps.org/protocol.php, 2012. ―The Web Robots Pages,‖ http://www.robotstxt.org/, 2012. J. Jiang, X. Song, N. Yu, C.-Y. Lin, ―FoCUS: Learning to crawl web forums,‖ IEEE Trans. Know.and Data Eng.,vol. 25, NO. 6 pp, JUNE 2013 R. Cai, J.-M. Yang, W. Lai, Y. Wang, and L. Zhang, ―iRobot: AnIntelligent Crawler for Web Forums,‖ Proc. 17th Int’l Conf. WorldWide Web, pp. 447-456, 2008. C. Gao, L. Wang, C.-Y. Lin, and Y.-I. Song, ―Finding Question-Answer Pairs from Online Forums,‖ Proc. 31st Ann. Int’l ACMSIGIR Conf. R & D in Information Retrieval,pp. 467-474, 2008. N. Glance, M. Hurst, K. Nigam, M. Siegler, R. Stockton, and T.Tomokiyo, ―Deriving Marketing Intelligence from Online Discussion,‖ Proc. 11th ACM SIGKDD Int’l Conf. Knowledge Discovery andData Mining, pp. 419-428, 2005. Y. Guo, K. Li, K. Zhang, and G. Zhang, ―Board Forum Crawling: AWeb Crawling Method for Web Forum,‖ Proc.
15 | Page
[14]
[15]
[16] [17]
[19]
[20]
EEE/WIC/ACMInt’l Conf. Web Intelligence, pp. 475-478, 2006. [12] M. Henzinger, ―Finding NearDuplicate Web Pages: A Large-Scale Evaluation of Algorithms,‖ Proc. Int’l ACM SIGIRConf. Research and Development in Information Retrieval, pp. 284-291,2006. S. H.S. Koppula, K.P. Leela, A. Agarwal, K.P. Chitrapura, Garg,and A. Sasturkar, ―Learning URL Patterns for Webpage De-Duplication,‖ Proc. ACM Conf.Web Search and Data Mining,pp. 381-390, 2010. G.S. Manku, A. Jain, and A.D. Sarma, ―Detecting Near-Duplicatesfor Web Crawling,‖ Proc. 16th Int’l Conf. WWW, pp.141-150, 2007. X.Y. Song, J. Liu, Y.B. Cao, and C.-Y. Lin, ―Automatic Extraction ofWeb Data Records Containing User-Generated Content,‖ Proc.19th Int’l Conf. Information and Knowledge Management, pp. 39-48,2010. V.N. Vapnik, the Nature of Statistical Learning Theory. Springer,1995. M.L.A. Vidal, A.S. Silva, E.S. Moura, and J.M.B. Cavalcanti,―Structure-Driven Crawler Generation by Example,‖ Proc. 29thAnn. Int’l ACM SIGIR Conf. Research and Development in Information retrieval, pp. 292-299, 2006. Y. Y. Wang, J.-M. Yang, W. Lai, R. Cai, L. Zhang, and W.- Ma,―Exploring Traversal Strategy for Web Forum Crawling,‖ Proc.31st Ann. Int’l ACM SIGIR Conf. Research and Development inInformation Retrieval, pp. 459466, 2008. J.-M. Yang, R. Cai, Y. Wang, J. Zhu, L. Zhang, and W.Y. Ma,―Incorporating Site-Level Knowledge to Extract Structured Datafrom Web Forums,‖ Proc. 18th Int’l Conf. World Wide Web, pp. 181-190, 2009. Y. Zhai and B. Liu, ―Structured Data Extraction from the Webbased on Partial Tree Alignment,‖ IEEE Trans. Knowledge DataEng., vol. 18, no. 12, pp. 1614-1628, Dec. 2006. K. Li, X.Q. Cheng, Y. Guo, and K. Zhang, ―Crawling DynamicWeb Pages in WWW Forums,‖ Computer Eng., vol. 33, no. 6,pp. 80-82, 2007.
September 2014, Volume-1, Special Issue-1
A ROBUST DATA OBFUSCATION APPROACH FOR PRIVACY PRESERVING DATAMINING S.Deebika1
1
A.Sathyapriya2
PG Student Assistant Professor Department of Computer Science and Engineering, Vivekananda College of engineering for women, Namakkal, India. 1 Email:
[email protected] 2 Email:
[email protected] 2
Abstract: Data mining play an important role in the storing and retrieving of huge data from database. Every user wants to efficiently retrieve some of the encrypted files containing specific keywords, keeping the keywords themselves secret and not jeopardizing the security of the remotely stored files. For well-defined security requirements and the global distribution of the attributes needs the privacy preserving data mining (PPDM). Privacy-preserving data mining is used to uphold sensitive information from unendorsed disclosure. Privacy preserving data is to develop methods without increasing the risk of misuse of the data. Anonymization techniques: K- Anonymity, L-Diversity, T-Closeness, P-Sensitive and M-invariance offers more privacy options rather to other privacy preservation techniques (Randomization, Encryption, and Sanitization). All these Anonymization techniques only offer resistance against prominent attacks like homogeneity and background. None of them is able to provide a protection against all known possible attacks and calculate overall proportion of the data by comparing the sensitive data. We will try to evaluate a new technique called (n,t)-Closeness which requires that the distribution of a sensitive attribute in any equivalence class to be close to the distribution of the attribute in the overall table. Index Terms— Anonymization, L-Diversity, PPDM, PSensitive, T-Closeness, (n,t)-Closeness. I. INTRODUCTION Rapid growth of internet technology have made possible to make use of remote communication in every aspects of life. As well as the increase of technology, privacy and security is needed in electronic communications became warm issues. Security to sensitive data against unofficial access has been a long term goal for the database security study group of people. Data mining consists of number of techniques for manufacture automatically and entertainingly to retrieve the information from the large amount of database which consists of sensitive information too. Privacy is vital issue in transferring of sensitive information from one spot to another spot through internet. Most considerably, in hospital, in government administrative center and in industries; there is need to 16 | Page
establish privacy for sensitive information or data to analyze and future processing on it from other departments. Various organizations (e.g., Hospital authorities, industries and government organizations etc) releasing person thorough data, which called as micro data. They provide information of privacy of individuals. Main aspire is to protect information simultaneously to produce external knowledge. The table consist of micro data is called Micro table [6]. i) identifiers-Uniquely identified attributes are called as identifiers. e.g., Social Security number. ii) Quasiidentifiers -adversary of attribute may already known and taken together can potentially identify an individual e.g., Birth date, Sex and Zip code. iii) Sensitive attributes adversary of attribute is unknown and sensitive. e.g., Disease and Salary. [3] are the three tupules. Sensitive information is fragment different from secret and confidential. Secret information means Passwords, pin codes, credit card details etc. The sensitive information mostly linked to diseases like HIV, Cancer, and Heart Problem etc. II. RELATED WORKS The main aim of the privacy preserving is to create method and techniques for the prevention of misusage of sensitive data. The techniques are proposed for altering the original data to carry out privacy. The alteration may not affect the original data and to improve the privacy on it. Various methods of privacy can prevent unauthorized usage of sensitive attribute. Some of the Privacy methods [11][4] are Anonymization, Randomization, Encryption, and Data Sanitization. Extending of this many advanced techniques are proposed, such as p-sensitive k-anonymity, (α, k)-anonymity, l-diversity, t-closeness, M-invariance, Personalized anonymity, and so on. For multiple sensitive attribute[7], there are three kinds of information disclosure. i)
Identity Disclosure: An individual is linked to a particular record in the published data. ii) Attribute Disclosure: When sensitive information regarding individual is disclosed known as Attribute Disclosure.
September 2014, Volume-1, Special Issue-1
iii) Membership Disclosure: When information regarding individual’s information is present in data set and it is not disclosed. When the micro data is published the various attacks are occurred like record linkage model attack and attribute linkage model attack. To avoid these attacks the different anonymization techniques was introduced. We did many surveys on anonymization [8] techniques. They are explained below. A.
K-Anonymity K-anonymity is a property possessed by certain anonymized data. The theory of k-anonymity was first formulated by L. Sweeney[12] in a paper published in 2002 as an attempt to solve the problem: "Given personspecific field-structured data, produce a release of the data with scientific guarantees that the individuals who are the subjects of the data cannot be re identified while the data remain practically useful."[9][10].. A release of data is said have the k-anonymity property if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appear in the release. Methods for k-anonymization In the framework of k-anonymization problems, a database is a table with n rows and m columns. Each row of the table represents a record relating to a specific member of a population and the entries in the various rows need not be unique. The values in a mixture of columns are the values of attributes associated with the members of the population. The following table 1 is a non anonymized database consisting of the patient records. A conclusion section is not required. Although a conclusion may review the main points of the paper, do not replicate the abstract as the conclusion. A conclusion might elaborate on the importance of the work or suggest applications and extensions.
Suppression: In this Suppression method, certain values of the attributes of column are replaced by an asterisk '*'. In the anonymized below table, have replaced all the values in the 'Name' attribute and the 'Religion' attribute have been replaced by a '*'. Generalisation: In this method, individual values of attributes are replaced by with a broader category. For example, the value '23' by '20 < Age ≤ 30’, etc. The below table 2 shows the anonymized database. K-anonymity model was developed to protect released data from linking attack but it causes the information disclosure. The protection of k-anonymity provides is easy and simple to appreciate. K-anonymity does not provide a shelter against attribute disclosure. Table 2 is Anonymized version of the database are shown below. S.No 1
Zip code 43**
2
43**
3
45**
4
45**
5
44**
S.No
Zip code
Age
Disease
1
4369
29
TB
476**
2*
2
4389
24
476**
2*
3
4598
28
Viral infection No illness
4
4599
27
4790** 4790**
>=40 >=40
5
4478
23
476**
3*
476** 476**
3* 3*
Table 1 Non Anonymized database The above table 1 has 4 attributes and 5 records in this data. There are two common methods for achieving kanonymity [13] for some value of k.
17 | Page
Disease
20 30 20 30 20 30 20 30 20 30
< Age ≤
TB
< Age ≤
Viral infection No illness
< Age ≤ < Age ≤ < Age ≤
Viral infection Heart-related
Table 2 Anonymized database. . Attacks on k-anonymity In the section, we study about the attacks on kanonymity. There are two types of attacks. They are Homogeneity Attack and background attack. Table 3 shows two types of attack Zip 476**
Viral infection Heart-related
Age
Age 2*
Disease Heart Disease Heart Disease Heart Disease Flu Heart Disease Heart Disease Cancer Cancer
Homogeneity attack Bob Zip Age 47678 27 John Zip Age 47673 36 Background Knowledge attack
Table 3 Homogeneity and Background knowledge attack Homogeneity Attack Sensitive attributes are lack in diversity values. From the above table, we easily conclude that Bob Zip code is September 2014, Volume-1, Special Issue-1
up to the range of 476** and his age is between 20 to 29.Then finally conclude he is attacked by Heart Disease. It is said to be Homogeneity attack. Background Knowledge Attack Attacker has additional background knowledge of other sensitive data.
L-diversity does not consider the overall distribution of sensitive values. Similarity Attack When the sensitive attribute values are distinct but also semantically parallel, an adversary can learn important information. Table 4 shows similarity attack.
Restrictions of K-anonymity K-anonymity make visible of individuals' sensitive attributes. Background knowledge attack is not protected by K-anonymity. Plain knowledge of the k-anonymization algorithm can be dishonored by the privacy. Applied to high-dimensional data is not possible. K- Anonymity cannot protect against Attribute disclosure. Variants of K-anonymity A micro data satisfies the p-sensitive k-anonymity [15] property if it satisfies K-anonymity and the number of distinct values for each sensitive attribute is at least p within the same QI. It reduces information loss through anatomy approach. (α, k) – Anonymity A view of a table is said to be an (α, k)anonymization [16] of the table if the view modifies the table such that the view satisfies both k-anonymity and α –deassociation properties with respect to the quasiidentifier. B. L-diversity L-diversity is proposed to overcome the short comes of K-anonymity. It is the extension of K-anonymity. Ldiversity [1] is proposed by Ashwin Machanavajjhala in the year 2005.An equivalence class has l-diversity if there is l or more well-represented values for the sensitive attribute. A table is said to be l -diverse if each equivalence class of the table is l-diverse. This can guard against by requiring “many” sensitive values are “wellrepresented” in a q* block (a generalization block). Attacks on l-diversity In this section, we study about two attacks on ldiversity [2]: the Skewness attack and the Similarity attack. Skewness Attack There are two sensitive values, they are HIV positive (1%) and HIV negative (99%).Serious privacy risk Consider an equivalence class that contains an equal number of positive records and negative records ldiversity does not differentiate Equivalence class. Equivalence class 1: 49 positive + 1 negative; Equivalence class 2: 1 positive + 49 negative.
18 | Page
Zip
Age
Salary
Disease
2*
20k
Gastric
code 476**
Similarity attack ulcer 2*
30k
Gastric
476** 476**
Bob 2*
40k
Stomach cancer
479**
>=4
100k
Gastric
476**
>=4
60k
Flu
476**
3*
70k
Bronchitis
zip
Age
47678
27
Table 4. Similarity attack. As conclude from table, Bob’s salary is in [20k, 40k], which is relative low. Bob has some stomachrelated disease. Variant of L-diversity Distinct l-diversity Each equivalence class has at least l wellrepresented sensitive values. It doesn’t prevent the probabilistic inference attacks. e.g., In one equivalent class, there are ten tupules. In the “Disease” area, one of them is “Cancer”, one is “Lung Disease” and the remaining eight are “Kidney failure”. This satisfies 3diversity, but the attacker can still affirm that the target person’s disease is “Kidney failure” with the accuracy of 80%. Entropy l-diversity Each equivalence class not only must have enough different sensitive values, but also the different sensitive values must be distributed evenly enough. The entropy of the entire table may be very low. This leads to the less conservative notion of l- diversity. Recursive (c,l)-diversity The most frequent value does not appear too frequently. Restrictions of L-diversity It prevents Homogeneity attack but l-diversity is insufficient to prevent attribute disclosure.
September 2014, Volume-1, Special Issue-1
L-diversity is unnecessary and difficult to achieve for some cases. A single sensitive attribute two values: HIV positive (1%) and HIV negative (99%) very different degrees of sensitivity. C. T-closeness The t-closeness [14] model was introduced to overcome attacks which were possible on l-diversity (like similarity attack). L-diversity model uses all values of a given attribute in a similar way (as distinct) even if they are semantically related. Also not all values of an attribute are equally sensitive. An equivalence class is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than a threshold t. It requires that the earth mover's distance between the distribution of a sensitive attribute within each equivalence class does not differ from the overall earth movers distance of the sensitive attribute in the whole table by more than a predefined parameter t. Restrictions of t-closeness T-closeness is an effective way when it is combined with generalizations and suppressions or slicing[5]. It can lost co-relation between different attributes because each attribute is generalized separately and so we lose their dependencies on each other. There is no computational procedure to enforce t-closeness. If we consider very small utility of data is damaged. III. PROPOSED WORK (n,t) –CLOSENESS The (n, t)-closeness principle: An equivalence class E1 is said to have (n, t)-closeness if there exists a set E2 of records that is a natural superset of E1 such that E2 contains at least n records, and the distance between the two distributions of the sensitive attribute in E1 and E2 is no more than a threshold t. A table is said to have (n, t)closeness if all equivalence classes have (n, t)-closeness. (n,t) -Closeness which requires that the distribution of a sensitive attribute in any equivalence class to be close to the distribution of the attribute in the overall table.
S.No Zip Code 1 47696
Age 29
Disease pnemonia
Count 100
2
47647
21
Flu
100
3
47602
28
Pnemonia
200
4
47606
23
5
47952
49
Flu Pnemonia
6
47909
48
Flu
900
7
47906
47
Pnemonia
100
8
47907
45
Flu
900
9
47603
33
Pnemonia
100
10
47601
30
Flu
100
11
47608
35
Pnemonia
100
12
47606
36
Flu
100
200 100
Table 5 Original patient data In the above definition of the (n, t)-closeness principle, the parameter n defines the breadth of the observer’s background knowledge. Smaller n means that the observer knows the sensitive information about a smaller group of records. The parameter t bounds the amount of sensitive information that the observer can get from the released table. A smaller t implies a stronger privacy requirement S.No
Age
Disease
Count
1
ZIP Code 476**
2*
Pnemonia
300
2 3
476** 479**
2* 4*
Flu Pnemonia
300 100
4 5
479** 476**
4* 3*
Flu Pnemonia
900 100
6
476**
3*
Flu
100
Table 6 An Anonymous Version of table 5 The intuition is that to learn information about a population of a large-enough size (at least n). One key term in the above definition is “natural superset”. Assume that we want to achieve (1000, 0.1)-closeness for the above example. The first equivalence class E1 is defined by (zip code=“476**”, 20 ≤ Age ≤ 29) and contains 600 tuples. One equivalence class that naturally
. 19 | Page
September 2014, Volume-1, Special Issue-1
contains it would be the one defined by (zip code= “476**”, 20 ≤ Age ≤ 39). Another such equivalence class would be the one defined by (zip code= “47***”, 20 ≤ Age ≤29). If both of the two large equivalence classes contain at least 1,000 records, and E1’s distribution is close to (i.e., the distance is at most 0.1) either of the two large equivalence classes, then E1 satisfies (1,000, 0.1)-closeness. In fact, Table 6 satisfies (1,000, 0.1)-closeness. The second equivalence class satisfies (1,000, 0.1)-closeness because it contains 2 , 0 0 0 > 1,000 individuals, and thus, meets the privacy requirement (by setting the large group to be itself). The first and the third equivalence classes also satisfy (1,000, 0.1)-closeness because both have the same distribution (the distribution is (0.5, 0.5)) as the large group which is the union of these two equivalence classes and the large group contains 1,000 individuals. Choosing the parameters n and t would affect the level of privacy and utility. The larger n is and the smaller t is, one achieves more privacy and less utility. IV. EXPERIMENTAL SETUP We did a sample experiment to check the efficiency of the new privacy measure. Here, a sample graph is shown in fig 1.We compared our different techniques with the proposed model and gets the sample graph with efficient manner. We use parameter number of datasets and privacy degree. In this, datasets are given as sample input and getting privacy with the efficient manner as an output.
20 18 16 14
k-anonymity
12 10 8 6 4 2
l-diversity t-closeness (n,t)closeness
0
We explained detail about the related works and the drawbacks of anonymization techniques. The new novel privacy technique has overcome the drawbacks of Anonymization technique and generalization and suppression too. It provides security and proportional calculation of data. We illustrate how to calculate overall proportion of data and to prevent attribute disclosure and membership disclosure. We have explained and compared between different types of Anonymization. Our experiments show that (n,t)-Closeness preserves better data utility than Anonymization techniques . VI. REFERENCES [1] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam. ℓ-diversity: Privacy beyond kanonymity. Available at http://www.cs.cornell.edu/_mvnak, 2005. [2] Ashwin Machanavajjhala, Johannes Gehrke, Daniel Kifer, Muthuramakrishnan Venkitasubramaniam, ℓDiversity: Privacy Beyond k-Anonymity 2006. [3] Dimitris Sacharidis, Kyriakos Mouratidis, and Dimitris Papadias.K-Anonymity in the presence of External database, IEEE Transactions on Knowledge and Data Engineering, vol.22, No.3, March 2010. [4] Gayatri Nayak, Swagatika Devi, “A Survey on Privacy Preserving Data Mining: Approaches and Techniques”, India, 2011. [5] Li, N. Li, T. Venkatasubramanian, S. t-Closeness: Privacy Beyond k-Anonymity and l-Diversity. ICDE 2007: 106-115. [6] In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM SIGKDD 2006), pages 754 – 759. [7] Inan.A,Kantarcioglu.M,and Bertino.e, “Using Anonymized Data for Classification,” Proc. IEEE 25th Int Conf. Data Eng. (ICDE), pp. 429-440, 2009. [8] Li T. and Li N. (2007), Towards Optimal kAnonymization, Elsevier Publisher, CERIAS and Department of Computer Science, Purdue University, 305 N. University Street, West Lafayette, IN 479072107, USA. [9] L. Sweeney. "Database Security: k-anonymity". Retrieved 19 January 2014.
Fig 1 Comparison of different anonymization technique with number of datasets and privacy efficiency V. CONCLUSION This paper presents a new approach called (n,t)Closeness to privacy-preserving micro data publishing.
20 | Page
[10] L. Sweeney. k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 2002; 557-570.
September 2014, Volume-1, Special Issue-1
[11] R. Agrawal, R. Srikant, “Privacy-Preserving Data Mining”, ACM SIGMOD Record, New York, vol.29, no.2, pp.439-450,2000. [12] Sweeney.L, k-anonymity: a model for protecting privacy. International Journal on Uncertainty,Fuzziness and Knowledge-based Systems, 10 (5), 2002; 557-570. [13] Sweeney, L. Achieving k-Anonymity Privacy Protection Using Generalization and Suppression.International Journal of Uncertainty, Fuzziness and Knowledge-Based System, 10(5) pp. 571-588, 2002. [14] t-Closeness: Privacy Beyond k-Anonymity and l – Diversity ICDE Conference, 2007, Ninghui Li , Tiancheng Li , Suresh Venkatasubramanian. [15] Truta, T.M. and Bindu, V. (2006) Privacy Protection: P‐Sensitive K-Anonymity Property. In Proceedings of the Workshop on Privacy Data Management, bwith ICDE 2006, pages 94. [16] Wong, R.C.W., Li, J., Fu, A.W.C., and Wang, K. (2006) (α, k)-Anonymity:An Enhanced k-Anonymity Model for PrivacyPreserving Data Publishing.
21 | Page
September 2014, Volume-1, Special Issue-1
E-WASTE MANAGEMENT – A GLOBAL SCENARIO R. Devika
Department of Biotechnology, Aarupadai veedu institute of technology, Paiyanoor INTRODUCTION: Advances in the field of science and technology in the 18th century brought about the industrial evolution which marked a new era in human civilization. Later in the 20th century, Information and Communication Technology has brought out enomorous changes in Indian economy, industries etc. which has undoubtedly enhanced the quality of human life. At the same time, it had led to manifold problems including enomorous amount of hazardous wastes which poses a great threat to human health and environment. Rapid changes in technologies, urbanization, change in media, planned obsolescence etc. have resulted in a fast growing surplus of electronic waste (E-waste) around the Globe, About 50 million tones of e – waste are been produced every year, wherein USA discards 3 million tones of each year amounting 30 million computers per year and Europe disposes 100 million phones every year and China leads second with 2.3 million tons of e – waste. Electronic wastes or e – waste or e – scrap or electronic disposal refers to all the discarded electrical or electronic devices like mobile phones, television sets, computers, refrigerators etc [1]. The other definitions are re-usable (working and repairable electronics), secondary scrap (Copper, Steel, Plastic etc.) and others are wastes which are damped or incinerated. Cathode Ray Tubes (CRTs) are considered one of the hardest types to recycle and the United States Environmental Protection Agency (EPA) included CRT monitors as “Hazardous Household Waste” since it contains lead, cadmium, beryllium or brominated flame retardants as contaminants [2]. Guiyu in the Shantou region of China is referred as the “E – Waste Capital of the world” [3] as it employs about 1,50,000 workers with 16 hour days disassembling old computers and recapturing metals and other reusable parts (for resale or reuse). Their workmanship includes snip cables, pry chips from circuit boards, grind plastic computer cases, dip circuit boards in acid baths to dissolve the lead, cadmium and other toxic metals [4]. Uncontrolled burning, disassembly and disposal causes a variety of environmental problems such as groundwater contamination, atmospheric pollution, immediate discharge or due to surface runoff, occupational health hazards (directly or indirectly). Professor Huoxia of Shantou University Medical College has evident that out of 165 of Guiyu, 82% children had lead in their blood (Above 100 µg) with an average of 149 µg which is considered unsafe by International health experts [5]. Tossing of equipment onto an open fire, in order to melt plastics and to burn away non – valuable metals releases
22 | Page
carcinogens and neurotoxins into the air, contributing to an acrid, lingering smog which includes dioxins and furans [6] ENVIRONMENTAL IMPACTS OF E – WASTE [2] E – Waste component
Process
Environmental Impact
Cathode Ray Tubes
Breaking and removal of yoke, then dumping
Leaching of lead, barium and other heavy metab into the water table in turn releasing phosphor (toxic)
Printed Circuit Board
Desoldering and removal Open burning Acid bath to remove fine metals
Emission of glass dust, tin, lead, brominated dioxin, beryllium cadmium, mercury etc.
Chips and other gold planted components
Chemical stripping using nitric and hydrochloric acid.
Release of hydrocarbons, tin, lead, brominated dioxins etc.
Plastics from printers, keyboards, monitors etc.
Shredding and low temperature melting
Emission of brominated dioxins, hydrocarbons etc.
Computer wires
Open burning & Stripping to remove copper
Ashes of hydrocarbons
Other Hazardous Components of E – Wastes E. Waste
Hazardous components
Smoke Alarms Fluorescent tubes
Americium Mercury
Lead acid batteries
Sulphur
Resistors, Nickel – Cadmuim batteries
Cadmium (6-18%)
Cathode Ray tubes (CRT)
Lead (1.5 pounds of lead in 15 inch CRT).
Thermal grease used as heatsinks for CPUS and power transistors, magnetrons etc. Vacuum tubes and gas lasers.
Beryllium Oxide
Environmental Impacts
Caricinogenic Health effects includes sensory impairment, dermatitis, memory loss muscle weakness etc. Liver, kidney, heart damages, eye and throat irritation – Acid rain formation Hazardous wastes causing severe. Damage to lungs, kidneys etc. Impaired cognitive functions, Hyper activity, Behavioural disturbances, lower IQ etc. Health impairments
September 2014, Volume-1, Special Issue-1
Non – Stick Cookware (PTFE)
Perfluorooctanoic acid (PFOA)
Risk of spontaneous abortion, preterm birth, stillbirth etc
INDIAN SCENARIO India has the label of being the second largest ewaste generator in Asia. According to MAIT-GT2 estimate India generated 3,30,000 lakh tonnes of e-waste which is equivalent of 110 million laptops. [ “Imported ewaste seized by customs officials”- The Times of India, 20th August. 2010]. Guidelines have been formulated with the objectives of providing broad guidance of e-waste and handling methodologies and disposal. Extended Producer Responsibility (EPR) It is an environment protection strategy that makes the producer responsible for the entire life cycle of the product, take back, recycle and final disposal of the product. E – Waste Treatment & Disposable Methods I. INCINERATION Complete combustion of waste material at high temperature (900 - 1000°C) Advantage Reduction of e-waste volume Maximum Utilization of energy content of combustible material Hazardous organic substances are converted into less hazardous compounds. Disadvantage: Release of large amount of residues from gas cleaning and combustion. Significant emission of cadmium and mercury (Heavy metal removal has to be opted). II. RECYCLING Recycling of monitors, CRT, Keyboards, Modems, Telephone Boards, Mobile, Fax machines, Printers, Memory Chips etc., can be dismantled for different parts and removal of hazardous substance like PCB, Hg, Plastic, segregation of ferrous and non-ferrous metals etc. Use of strong acids to remove heavy metals like copper, lead, gold etc. III. Re – Use This constitute the direct second hand use or use after slight modifications to the original functioning equipment. This method will considerably reduce the volume of e-waste generation. IV. LANDFILLING This method is widely used methods disposal of e – waste. Landfilling trenches are made on the earth and
23 | Page
waste materials are buried and covered by a thick layer of soil. Modern techniques includes impervious liner made up of plastic or clay and the leacheates are collected and transferred to wastewater treatment plant. Care should be taken in the collection of leachates since they contain toxic metals like mercury, cadmium, lead etc. which will contaminate the soil and ground water. Disadvantage: Landfills are prone to uncontrolled fires and release toxic fumes. Persistence of Poly Chlorinated Biphenyl (Non biodegradable). E – Waste Management -50-80% of e – wastes collected are exported for recycling by U.S. Export. -Five e-waste recyclers are identified by Tamil Nadu pollution control Board. Thrishyiraya Recycling India Pvt. Ltd. INAA Enterprises. AER World Wide (India) Pvt. Ltd. TESAMM recycler India Pvt. Ltd. Ultrust Solution (I) Pvt. Ltd Maharashtra Pollution Control Board has authorized Eco Reco company, Mumbai for e-waste management across India. TCS, Oberoi groups of Hotels, Castrol, Pfizer, Aventis Pharma, Tata Ficosa etc. recycle their e-waste with Eco Reco. REFERENCES 1. Prashant and Nitya. 2008. Cash for laptops offers Green Solution for broken or outdated computers. Green Technology, National Center for Electronics Recycling News Summary.08-28. 2. Wath SB, Dutt PS and Chakrabarti. T. 2011. E – waste scenario in India, its management and implications. Environmental Monitoring and Assessment. 172,249-252. 3. Frazzoli C. 2010. Diagnostic health risk assessment of electronic waste on the population in developing countries scenarios. Environmental Impact Assessment Review. 388-399. 4. Doctorow and Cory.2009. Illegal E – waste Dumped in Ghana includes - Unencrypted Hard Drives full of US Security Secrets. Boing. 5. Fela. 2010. Developing countries face e-waste crisis. Frontiers in Ecology and the Environmental. 8(3), 117. 6. Sthiannopkao S and Wong MH. 2012. Handling e – waste in developed and developing countries initiatives, practices and consequences. Science Total Environ
September 2014, Volume-1, Special Issue-1
AN ADEQUACY BASED MULTIPATH ROUTING IN 802.16 WIMAX NETWORKS 1
K.Saranya
2 1
Dr. M.A. Dorai Rangasamy
Research Scholar 2 Senior Professor& HOD CSE & IT 1 Bharathiar University, Coimbatore 2 AVIT, Chennai 1
[email protected] 2
[email protected]
Abstract — Multipath Routing in 802.16 WiMax Networks approach consists of a multipath routing protocol and congestion control. End-to-End Packet Scatter (EPS), alleviates long term congestion by splitting the flow at the source, and performing rate control. EPS selects the paths dynamically, and uses a less aggressive congestion control mechanism on nongreedy paths to improve energy efficiency fairness and increase throughput in wireless networks with location information. I. INTRODUCTION WiMAX (Worldwide interoperability for Microwave access) or IEEE 802.16 is regarded as a standard for metropolitan area networks (MANs) It is one among the most reliable wireless access technologies for upcoming generation all-IP networks.IEEE 802.16[1].(Wimax) is “defacto” standard for broadband wireless communication. It is considered as the missing link for the”last mile” connection in Wireless Metropolitan Area Networks (WMAN). It represents a serious alternative to the wired network, such as DSL and cablemodem. Besides Quality of Service (QoS) support, the IEEE 802.16 standard is currently offering a nominal data rate up to 100 Mega Bit Per Second (Mbps), and a covering area around 50 kilometers. Thus, a deployment of multimedia services such as Voice over IP (VoIP), Video on Demand (VoD) and video conferencing is now possible by this Wimax Networks[2].WiMAX is regarded as a disruptive wireless technology and has many potential applications. It is expected to support business applications, for which QoS support will be a necessity[3].In Wimax the nodes can communicate without having a direct connection with the base station. This improves coverage and data rates even on uneven terrain [4]. II. ROUTING IN MESH NETWORKS Mesh mode that only allows communication between the BS and SS, each station is able to create direct communication links to a number of other stations in the network instead of communicating only with a BS. However, in typical network deployments, there will still be certain nodes that provide the BS function of connecting the Mesh network to the backbone networks.
24 | Page
When using Mesh centralized scheduling to be describe below, these BS nodes perform much of the same basic functions as the BSs do in mesh mode. Communication in all these links in the network are controlled by a centralized algorithm (either by the BS or decentralized by all nodes periodically), scheduled in a distributed manner within each node's extended neighborhood, or scheduled using a combination of these. The stations that have direct links are called neighbors and forms a neighborhood. A nodes neighbor is considered to be one hop away from the node. A two-hop extended neighborhood contains, additionally, all the neighbors of the neighborhood. Our solution reduces the variance of throughput across all flows by 35%, reduction which is mainly achieved by increasing throughput of long-range flows with around 70%. Furthermore, overall network throughput increases by approximately 10%. There are two basic mechanisms for routing in the IEEE 802.16 mesh network A. Centralized Routing In mesh mode the concept BS (Base Station) refers to the station that has directed connection to the backhaul services outside the Mesh Network. All the others Stations are termed SSs (Subscriber Stations). Within the Mesh Networks there are no downlink or uplink concepts. Nevertheless a Mesh Network can perform similar as PMP, with the difference that not all the SSs must be directly connected with the BS. The resources are granted by the Mesh BS. This option is termed centralized routing. B. Distributed Routing In distributed routing each node receives some information about the network from its adjacent nodes. This information is used to determine the way each router forwards its traffic. When using distributed routing, there is no clearly defined BS in the network [5] In this paper, we present a solution that seeks to utilize idle or under-loaded nodes to reduce the effects of throughput.
September 2014, Volume-1, Special Issue-1
III. PROBLEM MODELING In this section we first discuss about the EPS. End-to-End Multipath Packet Scatter. EPS successfully support the aggregate traffic (i.e. Avoid congestion), it will only scatter packets to a wider area Potentially amplifying the effects of congestion collapse due to its longer paths (a larger number of contending nodes lead to a larger probability of loss). In such cases a closed loop mechanism is required to regulate the source rates. EPS is applied at the endpoints of the flows, and regulates the number of paths the flow is scattered on and the rate corresponding to each path. The source requires constant feedback from the destination regarding network conditions, making this mechanism more expensive than its local counterpart. The idea behind EPS is to dynamically search and use free resources available in the network in order to avoid congestion. When the greedy path becomes congested, EPS starts sending packets on two additional side paths obtained with BGR, searching for free resources. To avoid disrupting other flows, the side paths perform more aggressive multiplicative rate decrease when congested.EPS dynamically adjusts to changing conditions and selects the best paths to send the packets without causing oscillations. The way we achieve this is by doing independent congestion control on each path. If the total available throughput on the three paths is larger than the sender’s packet rate, the shortest path is preferred (this means that edge paths will send at a rate smaller than their capacity). On the other hand, if the shortest path and one of the side paths are congested but one other side path has unused capacity, our algorithm will naturally send almost all the traffic on the latter path to increase throughput. IV. SYSTEM MODELING A. Congestion Signaling Choosing an appropriate closed loop feedback mechanism impacts the performance of EPS. Unlike WTCP[6] which monitors packet inter-arrival times or CODA[7] which does 100 local congestion measurements at the destination, we use a more accurate yet lightweight mechanism, similar to Explicit Congestion Notification [8]. Nodes set a congestion bit in each packet they forward when congestion is detected. In our implementation, the receiver sends state messages to the sender to indicate the state of the flow. State messages are triggered by the receipt of a predefined number of messages, as in CODA.The number of packets acknowledged by one feedback message is a parameter of the algorithm, which creates a tradeoff between high overhead and accurate congestion signaling (e.g., each packet is acknowledged) and less expensive but also less accurate signaling. The destination maintains two counters for each path of each incoming flow: packets
25 | Page
counts the number of packets received on the path, while congested counts the number of packets that have been lost or received and have the congested bit set to 1. When packets reaches a threshold value (given by a parameter called messages_per_ack), the destination creates a feedback message and sends it to the source. The feedback is negative if at least half of the packets received by the destination have the congestion bit set, or positive otherwise. As suggested in the ECN paper[8]. This effectively implements a low pass filter to avoid signaling transient congestions, and has the positive effect that congestion will not be signaled if it can be quickly. B. RTT estimation When the sender starts the flow, it starts a timer equal to: messages_per_ack / packet rate + 2·hopcount·hop_time. We estimate hop count using the expected inter-node distance;hop_time is chosen as an upper bound for the time taken by a packet to travel one hop. Timer expiration is treated as negative feedback. A more accurate timer might be implemented by embedding timestamps in the packets (such as WTCP,TCP) but we avoid that due to energy efficiency considerations. However, most times the ECN mechanism should trigger the end-to-end mechanism, limiting the use of timeouts to the cases when acknowledgements are lost. A.
Rate control
When congestion persists even after the flow has been split at the source, we use congestion control (AIMD) on each individual path to alleviate congestion. When negative feedback is received, multiplicative decrease is performed on the corresponding path’s rate. We use differentiated multiplicative decrease that is more aggressive on exterior paths than on the greedy path, to increase energy efficiency; effectively, this prioritizes greedy traffic when competing with split traffic. Additive increase is uniform for all paths; when the aggregate rate of the paths exceeds the maximum rate, we favor the greedy path to increase energy efficiency. More specifically, if the additive increase is on the shortest (central) path, exterior paths are penalized proportionally to their sending rate; otherwise, the rate of side path is increased only up to the overall desired rate. D. Discussion EPS is suited for long lived flows and adapts to a wider range of traffic characteristics, relieving persistent or wide-spread congestion when it appears. The paths created by this technique are more symmetric and thus further away from each other, resulting in lessinterference. The mechanism requires each end-node maintain state information for its incoming and outgoing flows of packets, including number of paths, as well as spread angle and send rate for each path. The price of source splitting is represented by the periodic signaling
September 2014, Volume-1, Special Issue-1
messages. If reliable message transfer is required, this cost is amortized as congestion information can be piggybacked in the acknowledgement messages. Pseudocode for a simplified version of EPS //For simplicity, we assume a single destination and three paths MaxPaths = 3; bias={ 0, 45o,-45o}; reduce_rate= {0.85, 0.7, 0.7}; //sender side pseudo code receive Feedback (int path, bool flowCongested) { if (!EPS_Split) //not already split if(flowCongested) splitSinglePath(); else sendingRates[0]+=increase_rate; //additive increase
else //we have already split the flow into multiple paths if(flowCongested)sendingRates[path]*= reduce_rate[path]; else { // no congestion, we increase the path sending rate
if(path == 0) { // main path sendingRates[0] += increase_rate; //additive increase totalAvailableRate = sum(sendingRates);
if(totalAvailRate > 1) {//we can transmit more than we want diff = 1 – totalAvailableRate; for(int i = 1; i < MaxPaths; i++) sendingRates[i] – = diff*sendingRates[i]/ ( totalAvailableRate - sendingRates[0]);
}
}
else sendingRates[path] += min(increase_rate, 1sum(sendingRates))
} } splitSinglePath(){
for(int i = 0; i < MaxPaths; i++) sendingRates[i] = 1 / MaxPaths;
EPS_Split = true; } sendPacketTimerFired(){ path_choice = LotteryScheduling(sendingRates); Packet p = Buffer.getNext(); //orthogonal buffer policy p.split = EPS_Split; // if we split or not p.bias = bias[path_choice];
next = chooseBGRNextHop(p); …//other variables sendLinkLayerPacket(next,p);
When congestion is widespread and long-lived, splitting might make things worse since paths are longer and the entire network is already congested. However, as we show in the Evaluation section, this only happens when the individual flow throughput gets dramatically small (10% of the normal value) and when the costs of path splitting – in terms of loss in throughput – are insignificant. Also, if paths interfere severely, splitting traffic might make things worse due to media access collisions, as more nodes are transmitting. This is not to say that we can only use completely non-interfering paths. In fact, as we show in Section our approach exploits the tradeoff between contention (when nodes hear each other and contend for media) and interference nodes do not hear each other but their packets collide) throughput is more affected by high contention than by interference. V. IMPLEMENTATION In this section we present simulation results obtained through ns2 simulations [9]. We use two main metrics for our measurements: throughput increase and fairness among flows. We ran tests on a network of 400 nodes, distributed uniformly on a grid in a square area of 6000m x 6000m. We assume events occur uniformly at random in the geographical area; the node closest to the event triggers a communication burst to a uniformly selected destination. To emulate this model we select a set of random source-destination pairs and run 20-second synchronous communications among all pairs. The data we present is averaged over hundreds of such iterations. The parameters are summarized in Table 1.An important parameter of our solution is the number of paths a flow should be split into and their corresponding biases. Simulation measurements show that the number of no interfering paths between a source and a destination is usually quite small (more paths would only make sense on very large networks). Therefore we choose to split a flow exactly once into 3 sub-flows if congestion is detected. We prefer this to splitting in two flows for energy efficiency considerations (the cheaper, greedy path is also used). We have experimentally chosen the biases to be +/-45 degrees for EPS.
}
// receiver side pseudocode receive Packet(Packet p){
receivedPackets[p.source][p.path]++; if(p.congested)congestedPackets[p.source]
[p.path]++; if(receivedPackets[p.source][p.path] > messagesPerAck) { boolean isCongested = congestedPackets [p.source][p.path] > packets[p.source][p.path]/2); sendFeedback(p.source, isCongested); …//reinitialize state variables
}
}
26 | Page
September 2014, Volume-1, Special Issue-1
TABLE 1. SUMMARY OF PARAMETERS Parameter
Value
Number of Nodes
Parameter
Value
Link Layer 400
Transmission Rate
2Mbps
Area size
6000m x 6000m
RTS0CTS
No
MAC
802.11
Retransmission Count(ARQ)
4
Radio Range
250m
Interface queue
4
550m
Packet size
100B
Packet of frequency
80/s
Contention Range
Average Node Degree
8
Figure 4 Received vs Transmission VI. RESULTS
Figure 1 Throughput vs Transmission
27 | Page
As expected, our solution works well for flows where the distance between the source and the destination is large enough to allow the use of non-interfering multiple paths. The EPS combination increases long-range flow throughputs with around 70% as compared to single path transmission (both with and without AIMD). For shortrange flows, where multiple paths cannot be used, the throughput obtained by our solution is smaller with at most 14%, as the short-range flows interfere with split flows of long-range communications. However,by increasing long-range flows’ throughput we improve fairness among the different flows achieving a lower throughput variance across flows with different lengths by 35% compared to a single path with AIMD. Moreover, the overall throughput is increased with around 10% for a moderate level of load (e.g. 3-6 concurrent transmissions).Finally, we show that our algorithm EPS does not increase the number of losses compared to AIMD.
September 2014, Volume-1, Special Issue-1
A. Throughput and Transmission Fig. 1 presents how the number of transmissions in the network affects the average flow throughput. Throughput drastically decreases as the network becomes congested regardless of the mechanism used. For moderate number of transmissions (3-5) the combination EPS increases the overall throughput by around 10%.However, it is not using rate control and a lot of the sent packets are lost, leading to inefficiency. B.Impact of factor rate Fig. 2a shows that the combination EPS has a similar packet loss rate to “AIMD”. Fig. 2b displays the overall throughput for different transmission rates. As we can see the throughput flattens as congestion builds in the network but the (small) overall increase remains approximately steady. C.Received and Transmission Fig. 3 shows this is also true when the transmission rate varies. This is important on two counts: first, for energy efficiency reasons, and second, to implement reliable transmission.
[4]. Vinod Sharma, A. Anil Kumar, S. R. Sandeep, M. Siddhartha Sankaran “Providing QoS to Real and Data Applications in WiMAX Mesh Networks” In Proc. WCNC, 2008. [5]. Yaaqob A.A. Qassem, A. Al-Hemyari, Chee Kyun Ng, N.K. Noordin and M.F.A. Rasid “Review of Network Routing in IEEE 802.16 WiMAX Mesh Networks”, Australian Journal of Basic and Applied Sciences, 3(4): 3980-3996, 2009. [6]. Sinha P. Nandagopal T., Venkitaraman N., Sivakumar R., Bhargavan V., "A Reliable Transport Protocol for Wireless Wide-Area Networks.", in Proc. of Mobihoc, 2003. [7]. Wan C.Y. Eisenman S.B., Campbell A.T., "CODA: Congestion Detection and Avoidance in Sensor Networks," in Proc. of SenSys, 2003. [8]. Ramakrishnan K.K. Jain R., "A Binary Feedback Scheme for Congestion Avoidance in Computer Networks," in Transactions on Computer Systems, vol. 8, 1990. [9]. NS2 simulator, http://www.isi.edu/nsnam/ns/.
VII. CONCLUSION In this paper, we have presented a solution that increases fairness and throughput in dense wireless networks. Our solution achieves its goals by using multipath geographic routing to find available resources in the network. EPS (end-to-end packet scatter), that split a flow into multiple paths when it is experiencing congestion. EPS is activated. EPS performs rate control to minimize losses while maintaining high throughput. It uses a less aggressive congestion response for the non-greedy paths to gracefully capture resources available in the network. REFERENCES [1]. Murali Prasad, Dr.P. Satish Kumar “An Adaptive Power Efficient Packet Scheduling Algorithm for Wimax Networks” (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 1, April 2010. [2]. Adlen Ksentini “IPv6 over IEEE 802.16 (WiMAX) networks: Facts and challenges” Journal of Communications, Vol. 3, No. 3, July 2008. [3]. Jianhua Hey, Xiaoming Fuz, Jie Xiangx, Yan Zhangx, Zuoyin Tang “Routing and Scheduling for WiMAX Mesh Networks” in WiMAX Network Planning and Optimization, edited by Y. Zhang, CRC Press, USA, 2009.
28 | Page
September 2014, Volume-1, Special Issue-1
CALCULATION OF ASYMMETRY PARAMETERS FOR LATTICE BASED FACIAL MODELS M. Ramasubramanian1 1
Dr. M.A. Dorai Rangaswamy2
Research Scholar & Associate Professor 2 Research Supervisor & Sr. Professor 1,2 Department of Computer Science and Engineering, Aarupadai Veedu Institute of Technology, Vinayaka Missions University, Rajiv Gandhi Salai, (OMR), Paiyanoor-603104, Kancheepuram District, Tamil Nadu, India 1
[email protected] 2
[email protected] Abstract— Construction of human like avatars is a key to produce realistic animation in virtual reality environments and has been a commonplace in present day applications. However most of the models proposed to date intuitively assume human face as a symmetric entity. Such assumptions produce unfavorable drawbacks in applications where the analysis and estimation of facial deformation patterns play a major role. Thus in this work we propose an approach to define asymmetry parameters of facial expressions and a method to evaluate them. The proposed method is based on capturing facial expressions in threedimension by using a rangefinder system. Threedimensional range data acquired by the sy6 tem are analyzed by adapting a generic LATTICE with facial topology. The asymmetry parameters are defined based on the elements of the generic mash and evaluated for facial expressions of normal subjects and patients with facial nerve paralysis disorders. The proposed system can be used to store asymmetric details of expressions and well fitted to remote doctor-patient environments. Keywords- Generic 3d models, morphing, animation, texture etc. I. INTRODUCTION The construction of facial models that interpret human like behaviors date hack to 1970's, where Parke [l] introduced first known "realistic" CG animation model to move facial parts to mimic human-like expressions. Since then, a noticeable interest in producing virtual realistic facial models with different levels of sophistication has been seen in the areas of animation industry, telecommunication, identification and medical related areas etc. However, most of these models have inherently assumed that the human face as a symmetric entity. The relevance and the importance of defining asymmetric properties of facial models can he illustrated in many application areas. In this study, its relevance in the field of Otorhinolaryngology in Medicine is illustrated. A major requirement in such an application is to construct robust facial parameters that determine the asymmetric deformation patterns in expressions of patients with facial nerve paralysis disorders. These parameters can he used to estimate the level deformation in different facial parts, as well as to transmit and receive at the ends of remote doctor-patient environments. 'Acknowledgement: Authors would like to thank Dr Toshiyuki Amam 29 | Page
(Nagoya Institute of Technology) and Dr. Seiichi Nakata (School of Medicine, Nagoya University) for their support rendered towards the SUCC~SS of this work. Yukio Sat0 Dept. of Electrical and Computer Engineering Nagoya Institute of Technology Many attempts have been made in the past by researches to develop systems to analyze and represent the levels of facial motion dysfunction in expressions. Pioneering work of Neely et al. [2] reported a method to analyze the movement dysfunction in the paretic side of a face hy capturing expressions with 2D video frames. Selected frames of expressions are subtracted from the similar frames captured at the rest condition with image subtraction techniques. Similarly, most of other attempts proposed to date are based on 2D intensity images and they inherently possess the drawbacks associated with inconstant lighting in the environment, change of skin colors etc. To eliminate these drawbacks, use of 3D models is observed to be of commonplace. Although there are many techniques available today for the construction of 3D models, a laser-scanned method that accurately produces high-density range data is used here to acquire 3D facial data of expressions. Construction of 3D models from scanned data can he done by approximating measured surface by continuous or discrete techniques. The continuous forms, such as spline curve approximations can he found in some previous animation works [3], 141. A great disadvantage in these approaches is the inevitable loss of subtle information of facial surface during the approximation. To make it more realistic and preserve the subtle information, there must he intensive computations, in the way of introducing more control points, which makes it difficult to implement in analysis stages. On contrary, LATTICE based methods make it less complicated to implement and widely used in modeling tasks. Thus the approach proposed here adheres to LATTICE based 3D facial models in deriving asymmetry parameters. 2. CONSTRUCTION OF 3D MODEL Predesigned facial actions are measured hy a rangefinder system [5], which produces &bit 512 x 242 resolution frontal range images and color texture images. A symmetric generic face LATTICE with triangular patches is adapted to each of these range images to produce 3D models to he used in asymmetry estimations. September 2014, Volume-1, Special Issue-1
The LATTICE adaptation is a tedious and time consuming process if it involves segmentation of range images to extract feature points for mapping. Instead, here we resort to a much simpler approach hy extracting feature points from the texture images since both range and texture images cap t u r d by this system have oneteone correspondence. Forty-two evenly distributed feature points are selected as mapping points, of which the corresponding LATTICE locations are predetermined. We then calculate the displacements between these feature points and corresponding LATTICE locations. The least squares approximation method is used to fit the face LATTICE to the feature
from the corresponding range images are mapped to the vertices of the face LATTICE to produce a 3D model of the measured expression 161. Constructed 3D models of a patient with Bell’s palsy are depicted in Fig.,l with eyeclosure and grin facial expressions. 3.ESTIMATION OF DEFORMATION ASYMMETRY The facial deformations during expressions are calculated based on the 3D models generated for each expression as described in previous section. Since facial expressions do modifications to the facial surface at rest, 30 models we generated also reflect these deformations in their constituent triangular patches. To estimate these deformations, we implement the 3D LATTICE model as a LATTICE of connected linear springs. Suppose a particular patch in the left side consist of three springs with their gained lengths at the rest condition, from the equilibrium as CL,, C L ~an d