A New Approach to Blog Information Searching and ... - IEEE Xplore

A New Approach to Blog Information Searching and Curating Harsh Khatter1

Brij Mohan Kalra2

Department of Computer Science and Engineering, Ajay Kumar Garg Engineering College, Ghaziabad, India [email protected], [email protected]

Abstract- Blogs are one of the main components of Web 2.0 i.e. a Read-Write Web. Blogs are online diaries created by individuals; which provide excellent information on any topic all over the world. With the increasing use of blogging sites, people share their opinions, experiences, and their views with others. However, it is not so easy to fetch valuable information among various blogs in available time, which normally is very short. In this paper, a new approach of fetching the blog posts automatically from various blogging sites based on user’s interest is introduced. This new method of content curation will improve the knowledge experience of a user. The paper highlights a blog model, which includes content curation method with efficient searching and rating algorithms. The paper also discusses the major characteristics of blogs, and the gaps in the currently available system.

Keywords: Blogs, Web tools, Blog mining, Curation, Content Aggregation, Searching Algorithm, Rating Algorithm, Internet, Web technology, Web services, Web 2.0. I. INTRODUCTION Blogs are the online diaries handled by an individual, maybe a person, or an organization. Blogs share knowledge and information across a wide audience. Blogs are websites that allow one or more individuals to write about things they want to share with others. The whole universe of all blog sites i.e. the blog world, is referred to as Blogosphere [1]. Microblogs are also a kind of Blogs. The Micro-blogs, as a new social media, possesses big differences with other social media on the aspect of information updating frequency, organization structure, user connection etc, which have astonishing power of convergence and penetration [2].

In the present scenario, Blogs, to some extent, are a type of websites. People usually create a blog as a hobby to share their information and experience on a particular subject. It completely depends on the user what he wants to publish or post. Entries are displayed in a reverse-chronological order [3]. Blogs allow users to publish their opinions, views and ideas on any topic. Other users may analyze the posts and can comment, rate, add and share them. The linkage between blogs has indicated that communities formed in blogosphere are not a random process but a result of shared interests, thus, binding bloggers together across the globe. Blogging is the act of posting to a blog and blogosphere is the distributed, collective and interlinked world of blogging [4]. Malik Muhammad Saad Missen et al. states it would be quite reasonable to say that blogging phenomenon is the generator of opinion information for the Web and can be utilized to satisfy many opinion-based information needs of the users [5]. Blogs are the social media that enable users to publish information easily and rapidly on their personal activities and interests [6]. There are many potential benefits of blogs. Blogs can promote analytical and critical thinking. Blogs can promote creative, intuitive and associational thinking or it can be seen as a resource for interlinking, commenting on interlinked ideas. Blogs are a combination of solitary and social interaction [4]. HsiuHau Cheng states bloggers share their creative ideas and articles on blogs [7]. A blogger with high structural embeddedness, he can directly connect with many bloggers who are his friends. Lourdes Canos Danos et al. [8] states that blogs allows achieving the main objective of teaching-learning process: the acquisition of skills by students. Characteristics of blogs based on various parameters, are presented in a tabular form as shown in Table I.

TABLE I CHARACTERISTICS OF BLOGS Parameters Blog Characteristics Locus of control Centralized and Personal Post Aim

Content (Static/Dynamic)

Way of display Comments required Intent

Owned by a poster (could be owned by a small group also) Interact with others, analyze, comment, rate, add and share the posts (Information) Static, once posted. (comments by other users may add information to posting) Displayed in reverse chronological order Optional but encouraged to continue the conversation thread Personal

Login required to see the content Personal profile maintained

Optional

Content posts

Ideas

Yes

A blog has the potential to change the way we perceive information and make friends [9]. Blogs are an important source of vast user generated information, but to get the relevant information from the blogosphere in an efficient time is a difficult task. The goal is to provide the user with reliable and accurate information conveniently from the blogs. For bloggers and frequent blog readers, it is impracticable to keep track of the growing content on the blogosphere. Hence, a service recommending the blogs matching user interests will be of a high value to them. II. RELATED WORK Beyond serving as online diaries, weblogs have evolved into complex social structures. Blogging software allows users to publish opinions on any topic without any constraints. For literature survey, the work was divided into parts; i) designing of blogs, ii) personalization of blogs, iii) mining of posts from various blogs, and iv) searching and rating algorithms. V. K. Singh et al. mention that it is easy and simple to create blog posts, which has attracted people and companies across disciplines to exploit it for varied purposes. The valuable data contained in posts from a large number of users across the world provide a rich data source [1]. CHENG Tao et al. discusses about the Virtual enterprise (VE), which is an effective and collaborative way of a wiki & blog-based knowledge-

sharing mechanism and its prototype system is designed for supporting enterprises to intercommunicate, share knowledge and manage knowledge within a VE environment [10]. ZHOU Ping proposed an algorithm of personalized blog information retrieval based on user‟s interest model. The paper discusses the system architecture of personalized blog information retrieval and studies the identification module of blog webpage [3]. Obradovic et al. states that when blog articles are monitored for the tracking of a certain personality or product, the automatic identification of topic clusters is of high interest. Clustering by textual content is a popular method to accomplish this. They focused on how links between individual blog articles can be used to support the clustering with another dimension of information [11]. As per Chau, M. et al., blogs are very dynamic, so it isn‟t as straightforward to apply traditional Web mining techniques to them. They suggest that a general blog framework created for different tasks must consists of a blog spider, a blog parser, a blog content analyzer, a blog network analyzer, and a blog visualize [12]. In addition, a framework, BlogHarvest, for blog mining and search is demonstrated by Mukul Joshi and Nikhil Belsare [13]. This framework extracts the interests of the blogger, finds and recommends blogs with similar topics and provides blog oriented search functionality. Kuzar, T. and Navrat, P. presented a method suitable for enhancing blog clustering using the information hidden in web comments. They conclude that blog clusters based on clustering of the commentators differ significantly from the content clusters [14]. Guoliang Li et al. introduced an efficient 3-in-1 keyword search method, which works well for all types of data i.e. unstructured, semi-structured, and structured data. It is an efficient & adaptive keyword search of all kinds of data words. Authors use this algorithm for indexing & querying of large collections of heterogeneous data [15]. A. Gaps Observed After taking a complete survey, various gaps are observed. There is no blog search engine available at present, which also behaves as a blog. User can only search the blog posts but cannot start blogging from there. There is a large list of available blog search engines but they all are related to specific field only, i.e. they do not cover the entire variety of blog topics. To search a relevant post, user has to search it manually on each individual blog site, which itself will

consume lot of time and a large number of clicks. Moreover, the clicks will introduce unnecessary traffic due to advertisements, which will consume important resources unnecessarily. At present, there is no method available which automatically traverses other websites using their URLs and fetches the required information. Curation is a method which retrieves relevant information, aggregates and provides it to the user. Another major observation is that, it is not possible to keep track of all the blogs. In addition, the important factor is “What exactly the user wants”. That exactly is the need of the hour. There is a need to add some more functionality and service to the blogs.

III. PROPOSED DESIGN – BLOG MINER FRAMEWORK Based on the gaps of current available systems, a new approach is proposed to improve the knowledge searching experience of the user. The proposed model is shown in Figure 1 and consists of four major modules: Search Manager, Curator, Personalized Module, and Rating Engine. Working of the proposed

model is similar to a mining activity, which mines the blog posts as per the need of the user, and thus the name of the model “Blog Miner”. The model extracts the blog posts from various blog websites. The very first part of proposed model is its interface - Blog Miner Interface, through which all its modules are connected. User can interact via this interface to track blog posts and feed input search query. A. Blog Miner Interface The very first part of proposed model is its interface - Blog Miner Interface, through which all its modules are connected. User can interact via this interface to track blog posts and feed input search query. To interact with the user, Blog Miner Interface consists of four sub-interfaces: Curator Interface, Login Interface, Keyword Query Interface, and Post Visualizer. Curator Interface shows the resultant blog posts of the curator module. User can read, rate, comment and share the blog posts results of curator module. User has to login to the blogging site, if he/she wants to access the services. On Login Interface, there are two fields: user name and password. There is one more option of sign up. Search Interface consists of search bars where the user feeds the query in form of keywords.

Fig.1. Model of Blog Miner

Two search bars are given on the home page; one is for local blog posts search result, and second is for the global blog posts search results. The blog posts search results are shown using Post Visualize. All blog posts are visible to user in reverse chronological order i.e. latest post first. B. Login and personalization As name shows, the user logs in using this module. After login, user can share, add, comment, rate and modify the blog posts. When user performs any operation on bog post, then his/her name also displays with that blog post. There is an option for Forgot Password and New User. Mail id of user is taken for verification purpose. In Blog Miner, there is an option for a user to create their profile by mentioning their field of interest. Whenever a user logs in, based on user‟s profile i.e. interest stored in the database, user will interact with blog posts of his interest only. Using this, an efficient post result will be shown to the user. As shown in the model, the login module is connected with Analyzer where the user‟s profile will be analyzed and the specified field of interest is taken out. Analyzed keywords are given to the search manager where the local search process starts. Local search will take the keywords and search for the blog posts in blog posts database. This is an additional facility for the users of Blog Miner. This reduces the searching time of the user. C. Search Manager This module consists of all the processes of searching the blog posts. Search can be done internally i.e. a local Blog site search, and globally i.e. blogosphere. As user inputs the keyword in the search bar to search blog posts relevant to the keyword, search manager takes input in the form of keyword and then the requisite search process starts. If the user wants to fetch the post results from local blog site then the local search module will be operational. Moreover, the global search will start if user wants to fetch the blog posts from various other blogs from WWW. After searching, „Result Aggregator‟ aggregates resultant blog posts. Searching can be done on any type of data available on World Wide Web. The data may be structured, unstructured or semi-structured. Guoliang Li et al. (2008) gave an algorithm that works well for all types of data whether structured, semi-structured, or unstructured. A similar algorithm has been designed and is used for searching the blog posts in Blog Miner. Keyword based searching 1. 2. 3. 4. 5.

Keyword = local_search for each post in post_list{ if (keyword is in post_title) or (keyword is in post_text) POP(post) enqueue(post)

6. 7. 8. 9. 10. 11. 12.

if (queue = = empty) return “No search result found”; } for each post in post_queue { print(post); dequeue(post); }

Searching on global search 1. 2. 3. 4.

Keyword = global_search Db_temp = blog(keyword) for each url in db_temp{ if (keyword is in url_post_title) or (keyword is in url_post_text) { a. enqueue(post) 5. } 6. if (queue = = empty) a. return “No search result found”; 7. } 8. for each post in post_queue { 9. print(post); 10. dequeue(post); 11. } D. Curator Curator means a method, which curates the contents from various sources and makes clusters of the similar objects i.e. blogs posts. However, here, an automatic content curation is performed. An approach used here is to fetch the posts from various blogs by traversing on their websites via their URLs. There are wget utilities which gives all hyperlinks available on a URL link. On a blogging site, these hyperlinks consist of the links of blog posts. All blog posts fetch via hyperlinks and stored in a temporary database for posts refinement. The refined post result is stored in the blog post database and put forward to the user using the post visualize i.e. Blog Miner Interface. The series of steps required to perform curation are: Step 1. Check for the blog URLs from the database, which are inserted manually by a user. Step 2. Process traverses each URL and use wget utilities on them. Step 3. These utilities via blog URLs fetch all blog posts and hyperlinks on that page and temporarily stored in a database. Step 4. The next step is to search top rated blog posts from the temporary database within the specified period of time. Step 5. Selected blog posts are stored into the blog post database and rest of the contents is dropped. Step 6. Relevant blog posts are presented to the user through Post Visualizer (Blog Miner Interface). E. Rating Engine

User is allowed to rate the local blog posts. The rating is visible publicly to the users in the form of stars. Based on the star rating, a numeric value is assigned and sorting of blog posts are done in descending order of the numeric value. Two points are given to each star. If rating of a blog post is four stars, it means eight points or eight numeric value is assigned to that post. Data structure used for storing all blog posts is simply a queue. An algorithm is used to sort the blog posts in the descending order. The blog posts are displayed to the user under the tag “Top Rated”. Step 1. All blog posts are stored in posts database. Step 2. When user rates the posts, it is shown in the form of stars. Step 3. Each star has allotted a numeric value, 2, (5 stars means numeric value 10). Step 4. As user hits on star, as per the assigned value, the rating is decided. Step 5. All rated posts are enqueue to the queue. Then based on the numeric value assigned to the posts, sorting is performed. Step 6. The result is shown under a separate label “Top rated posts” and the blog posts are shown in decreasing order of their rating or numerical values. Step 7. If two or more users rate the post, then the average of their numeric values are taken out. IV. WORKING OF BLOG MINER General working of proposed model, Blog Miner, is divided into three parts, first is to search the content, second is to curate it, and finally to present it to the user. A basic model is shown in Figure 2.

TABLE II WORKING FOR LOCAL/GLOBAL SEARCH For Local Search For Global Search Step1. Enters the keyword in local search bar for local blog posts search.

Step1. Enters the keyword in global search bar to search for blog posts in blogosphere.

Step2. Searches local blog posts for inputted keywords (Search within local website the relevant blog posts).

Step2. Searches blog posts for inputted keywords (Blogosphere search with the help of Google blog search).

Step3. Search is applied to the local blog posts database.

Step3. All Google blog search results are shown to the user.

Step4. Relevant blog posts are retrieved and presented to the user through Blog Miner Interface.

Step4. When user clicks on any of the blog post of blogosphere post result, the user is redirected to the webpage of clicked blog post.

Step5. Now user can read, comment, share and rate the blog posts.

Step5. User can read that blog post and share, but cannot comment and rate it.

Step 2. When a user logs in, he/she interacts with blog posts of his interest only. Search Manger searches for the blog posts of user‟s interest from the blog posts database. Step 3. Side by side, Curator is on with its work. Curator curates the blog posts automatically. It takes the URLs from the URL Database. Periodically, Curator fetches all blog posts from to those URLs. Only the selected blog posts are stored into the blog post database and rest of the content is dropped. Step 4. User is permissible to rate the local blog posts by stars and the rating is visible publicly to all the users in the form of stars. Step 5. All the blog posts are displayed in reverse chronological order, and user can comment, rate, add and share the post information. V. CONCLUSION

Fig. 2. A Basic Model

The proposed system works in the manner as discussed below: Step 1. Enter the keyword in search bar, i.e. local search bar or global search bar, to fetch the relevant blog posts. The working for local and global search is discussed separately in tabular form in Table II.

Blog comes under the category of important Web tools. Now a days, major part of knowledge and recent activities are shared using blogs. Various blog algorithms, designs, and the gaps in the present system have been discussed in this paper. To enhance the performance of blogs, a new approach has been discussed in this paper, which will surely improve the information searching and the knowledge experience of a user. The Blog Miner is a combination of both, a blog search engine as well as an individual blog. In addition, user can customize their search results and blogging activity. To search a relevant post, user does not require to waste lot of time in clicking here and there getting the required

information. Moreover, user will get what exactly he/she wants.

[10]

VI. FUTURE ENHANCEMENT Blog Miner provides a new idea for automatic content curation. This new approach is merged with the existing searching and rating algorithms. Other ensuing algorithms might further replace these algorithms. Collaboration of Blogs with other Web tools like Social Networking Sites (SNS), Discussion Forums, Wikis, Online Communities, etc., is also possible. Integrating of these tools will provide an ease to use services of Web 2.0. The proper collaboration of these Web 2.0 tools will provide a new platform to its users to learn things more easily, to search things and to communicate with others i.e. friends, expertise, guides, or people of similar interests. As Web 2.0 came as an evolution in Internet world, the collaboration of its tools will be an evolution in informal eLearning world in the same way. REFERENCES [1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

V. K. Singh, D. Mahata, and R. Adhikari. “Mining the Blogosphere from a Socio-political Perspective”, International Conference on Computer Information Systems and Industrial Management Applications (CISIM), pp. 365 – 370, 2010. Hao Lu. “The research on micro-blog public opinion index and the application of prototype system” 9th IEEE International Conference on Networking, Sensing and Control (ICNSC), pp. 405-410, 2012. Zhou Ping, “Research on Personalized Blog Information Retrieval”, International Conference on Web Information Systems and Mining (WISM), pp. 289 – 292, 2010. Peter Duffy and Axel Bruns, “The use of blogs, wikis and RSS in education: A conversation of possibilities”, Learning and Teaching Conference, 2006. Malik Muhammad Saad Missen, Mohand Boughanem, Guillaume Cabanac. “Opinion Detection in Blogs: What is still Missing?”, International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp. 270-275, 2010. Masahiko Itoh, Naoki Yoshinaga, Masashi Toyoda, and Masaru Kitsuregawa. “Analysis and Visualization of Temporal Changes in Bloggers‟ Activities and Interests” Pacific Visualization Symposium (PacificVis), IEEE, pp. 57-64, 2012. Hsiu-Hau Cheng, “The Creativity of Bloggers: From the Perspective of Network Embeddedness” International Conference on Information Society (i-Society), pp. 374 – 376, 2011. Daros, L.C.; Marin-Garcia, J.A.; Romano, C.A.; Miralles, C.; Sabater, J.J.G.; Marin, R.P.; Sabater, J.P.G.; Mascarell, C.S.; and Carreras, P.I.V., “Using blogs in teaching postgraduate courses” Promotion and Innovation with New Technologies in Engineering Education (FINTDI), pp. 1 – 6, 2011. Tse-Ming Tsai, Chia-Chun Shih, and Seng-cho T. Chou, “Personalized Blog Recommendation Using the Value, Semantic, and Social Model”, International Conference on Innovations in Information Technology, pp. 1 – 5, 2006.

[11]

[12]

[13]

[14]

[15]

Cheng Tao, Peng Xiaobo, Feng Ping, and Du Jianming. “Research on Design of A Wiki & Blog-based Knowledge-sharing Mechanism for Virtual Enterprise”, Third International Conference on Measuring Technology and Mechatronics Automation (ICMTMA), pp. 1133 – 1137, 2011. Obradovic, D., Pimenta, F. and Dengel, A. “Mining shared social media links to support clustering of blog articles” International Conference on Computational Aspects of Social Networks (CASoN), pp. 181-184, 2011. Chau, M., Lam, P., Shiu, B., Xu, J., and Jinwei Cao, “A Blog Mining Framework”, International Journal of IT Professional, vol. 11, pp. 36 – 41, 2009. Mukul Joshi and Nikhil Belsare, “BlogHarvest: Blog Mining and Search Framework”, International Conference on Management of Data COMAD, 2006. Kuzar, T. and Navrat, P. “Slovak Blog Clustering Enhanced by Mining the Web Comments” International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Vol. 3, pp. 293 – 296, 2011. Guoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, Lizhu Zhou. “EASE: An efficient 3-in-1 Keyword Search Method for Unstructured, Semi-structured & Structured Data” Proceedings in international conference on Management of data ACM SIGMOD, pp. 903-914, 2008.

Harsh Khatter is a postgraduate student, pursuing master of technology in Computer Science and Engineering from Ajay Kumar Garg Engineering College, Ghaziabad, India. He received his bachelor‟s degree in 2010. As thesis subject, He is working on Web 2.0 tools, Blogs. His research interests include Web service, Data Mining and Databases. He is also a member of IEEE Society.

Brij Mohan Kalra is currently working as a Professor and Head in the Department of Computer Science and Engineering at Ajay Kumar Garg Engineering College, Ghaziabad, India. He has done his B.Tech. from Delhi College of Engineering, Delhi in 1977 and completed his M.Tech from IIT, Delhi in 1991. He has vast experience of 35 years of academia and industry in CSE and IT fields. He is pursuing his Ph.D in CSE from Gautam Buddha University, Greater Noida, India. His research interests include eLearning, Computer Networks, and Digital Logic Design. He is also a member of several professional bodies: IEEE, CSI, and IET.