big data statistics with r

18 downloads 194024 Views 4MB Size Report
Dec 8, 2014 - Choose create an Twitter application. 4. ... sets in real-time requires a platform like Hadoop to store large data sets across a distributed cluster ...
BIG DATA STATISTICS WITH R

AUTHOR : UDEH TOCHUKWU LIVINUS

ACKNOWLEDGEMENT I want to thank God for the grace to compile this piece of work. I still want to appreciate my twin brother for the support and encouragement he gave me during the work process. And most of all to my professors – Professor Natalya Zagorodna, Prof. Rachid Chelouah, Prof. Bezma Zeddini and Prof. Maria Maliek. Thank you all for the support and materials provided for me in research process. God bless you all.

Table of Contents

Chapter 1. Twitter Authentication and Retrieval of Big Data Tweets….................. Chapter 2. Big Data Concepts…………………………………………………………………………… Chapter 3. Sources of Big Data…………………………………………………………………….. Chapter 4. Big Data Analytic Tools………………………………….................................... Chapter 5. Twitter Big Data Statistical Analysis and Visualization………………………

CHAPTER 1: TWITTER AUTHENTICATION AND TWEETS RETRIEVAL Twitter is an online social networking service that enables users to send and read short 140character messages called "tweets". Registered users can read and post tweets, but unregistered users can only read them. Users access Twitter through the website interface, SMS, or mobile device app. Twitter Inc. is based in San Francisco and has more than 25 offices around the world. Tweets are publicly visible by default, but senders can restrict message delivery to just their followers. Users can tweet via the Twitter website, compatible external applications (such as for smartphones), or by Short Message Service (SMS) available in certain countries. Retweeting is when a tweet is forwarded via Twitter by users. Both tweets and retweets can be tracked to see which ones are most popular. While the service is free, accessing it through SMS may incur phone service provider fees. Users may subscribe to other users' tweets – this is known as "following" and subscribers are known as "followers" or "tweeps", a portmanteau of Twitter and peeps. Users can check the people who are unsubscribing them on Twitter ("unfollowing") via various services. In addition, users can block those who have followed them. Twitter allows users to update their profile via their mobile phone either by text messaging or by apps released for certain smartphones and tablets. Twitter has been compared to a web-based Internet Relay Chat (IRC) client. In a 2009 Time essay, technology author Steven Johnson described the basic mechanics of Twitter as "remarkably simple". As a social network, Twitter revolves around the principle of followers. When you choose to follow another Twitter user that user's tweets appear in reverse chronological order on your main Twitter page. If you follow 20 people, you'll see a mix of tweets scrolling down the page: breakfast-cereal updates, interesting new links, music recommendations, even musings on the future of education. Open Authentication (OAuth) is an open standard for authentication, adopted by Twitter to provide access to protected information. Passwords are highly vulnerable to theft and OAuth provides a safer alternative to traditional authentication approaches using a three-way handshake. It also improves the confidence of the user in the application as the user’s password for his Twitter account is never shared with third-party applications. The authentication of API requests on Twitter is carried out using OAuth. Twitter APIs can only be accessed by applications. Below we detail the steps for making an API call from a Twitter application using OAuth:  Applications are also known as consumers and all applications are required to register themselves with Twitter4. Through this process the application is issued a consumer key and secret which the application must use to authenticate itself to Twitter.  The application uses the consumer key and secret to create a unique Twitter link to which a user is directed for authentication. The user authorizes the application by authenticating himself to Twitter. Twitter verifies the user’s identity and issues an OAuth verifier also called a PIN.

 Using the “Access Token” and “Access Secret”, the application authenticates the user on Twitter and issues API calls on behalf of the user.  The user provides this PIN to the application. The application uses the PIN to request an “Access Token” and “Access Secret” unique to the user.  The “Access Token” and “Access Secret” for a user do not change and can be cached by the application for future requests. Thus, this process only needs to be performed once, and it can be easily accomplished using the method Get User Access Key Secret.

A verified Twitter account formally validates the identity of the person or company that owns the account—the aim of the "verified" status is to prove that a real-world person or company is not being impersonated, through the placement of a small blue checkmark by the top-right corner of a user's page, or next to the username in the platform's Search function. Twitter is responsible for assigning the blue checkmark, and it is frequently applied to the accounts of notable people in politics, music, movies, business, fashion, government, sports, media, and journalism. The owners of verified accounts can also access additional features that are not available to standard Twitter-account holders. These features include the following: 1. The ability to choose how their notifications and mentions are presented. Since verified accounts typically receive a lot of followers, account holders can filter these notices based on whether or not they are from verified accounts.

2. The ability to view information about their followers and their involvement on Twitter. 3. The ability to receive direct messages from all followers or only selected followers. 4. In a breach of Twitter's rules, some users placed the verified checkmark in their background—Twitter confirmed that such conduct is invalid. Following a design update of the Twitter platform, it is more difficult for users to impersonate a verified account because of the layout. A limitation of the verified status is that if the account is hacked, the person or company can still be impersonated for a limited time, until control is regained over the account by the legitimate owners - as happened, for example, with Tesla Motors' Twitter account briefly in 2015.

Steps to start your application 1. 2. 3. 4. 5. 6.

Login into your twitter setting account Click on Apps Choose create an Twitter application Enter your personal details Copy your authentication keys Open RStudio

Step 2: Installation of R packages for the analysis 1. Install the following packages as shown in the figure below.

2. twitteR: Provides an interface to the Twitter web API. 3. Mice: The R package mice imputes incomplete multivariate data by chained equations. 4. ROAuth: This package provides an interface to the OAuth 1.0 specification, allowing users to authenticate via OAuth to the server of their choice. 5. Rccp: this package provides C++ classes that greatly facilitate interfacing C or C++ code in R packages using the .Call interface provided by R. 6. plyr: This package is a set of tools that solves a common set of problems: you need to break a big problem down into manageable pieces, operate on each pieces and then put all the pieces back together. 7. SnowballC: is a small string processing language designed for creating stemming algorithms for use in Information Retrieval 8. stringr: is a set of simple wrappers that make R's string functions more consistent, simpler and easier to use. It does this by ensuring that: function and argument names (and positions) are consistent, all functions deal with NA's and zero length character appropriately, and the output data structures from each function matches the input data structures of other functions. 9. Lattice: It is a powerful and elegant high-level data visualization system with an emphasis on multivariate data. It is designed to meet most typical graphics needs with minimal tuning, but can also be easily extended to handle most nonstandard requirements. 10. ggplot2: An implementation of the grammar of graphics in R. It combines the advantages of both base and lattice graphics: conditioning and shared axes are handled automatically, and you can still build up a plot step by step from multiple data sources. 11. xtable: Function returning and displaying or writing to disk the LaTeX or HTML code associated with the supplied object of class xtable. 12. RColorBrewer: The package provides palettes for drawing nice maps shaded according to a variable. 13. topicmodels: The method proposed in this package is exploratory multivariate methods such as principal component analysis, correspondence analysis or clustering. 14. tm: A framework for text mining applications within R. 15. pamr: prediction analysis for microarrays 16. Wordcloud: This package helps in creating pretty looking word clouds in Text Mining.

CHAPTER 2: INTRODUCTION

BIG DATA CONCEPT Big data is an evolving term that describes any voluminous amount of structured, semi-structured and unstructured data that has the potential to be mined for information. Big data can be characterized by 3Vs: the extreme volume of data, the wide variety of types of data and the velocity at which the data must be must processed. Although big data doesn't refer to any specific quantity, the term is often used when speaking about petabytes and exabytes of data, much of which cannot be integrated easily. Because big data takes too much time and costs too much money to load into a traditional relational database for analysis, new approaches to storing and analyzing data have emerged that rely less on data schema and data quality. Instead, raw data with extended metadata is aggregated in a data lake and machine learning and artificial intelligence (AI) programs use complex algorithms to look for repeatable patterns.

Big data analytics is often associated with cloud computing because the analysis of large data sets in real-time requires a platform like Hadoop to store large data sets across a distributed cluster and MapReduce to coordinate, combine and process data from multiple sources. Although the demand for big data analytics is high, there is currently a shortage of data scientists and other analysts who have experience working with big data in a distributed, open source environment. In the enterprise, vendors have responded to this shortage by creating Hadoop appliances to help companies take advantage of the semi-structured and unstructured data they own. Big data can be contrasted with small data, another evolving term that's often used to describe data whose volume and format can be easily used for self-service analytics. A commonly quoted axiom is that "big data is for machines; small data is for people."

CHARACTERISTICS OF BIG DATA

Big data can be described by the following characteristics: 1. Volume – The quantity of data that is generated is very important in this context. It is the size of the data which determines the value and potential of the data under consideration and whether it can actually be considered Big Data or not. The name ‘Big Data’ itself contains a term which is related to size and hence the characteristic. 2. Variety – The next aspect of Big Data is its variety. This means that the category to which Big Data belongs to is also a very essential fact that needs to be known by the data analysts. This helps the people, who are closely analyzing the data and are associated with it, to effectively use the data to their advantage and thus upholding the importance of the Big Data. 3. Velocity - The term ‘velocity’ in the context refers to the speed of generation of data or how fast the data is generated and processed to meet the demands and the challenges which lie ahead in the path of growth and development. 4. Variability - This is a factor which can be a problem for those who analyze the data. This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively. 5. Veracity - The quality of the data being captured can vary greatly. Accuracy of analysis depends on the veracity of the source data. 6. Complexity - Data management can become a very complex process, especially when large volumes of data come from multiple sources. These data need to be linked, connected and correlated in order to be able to grasp the information that is supposed to be conveyed by these data. This situation, is therefore, termed as the ‘complexity’ of Big Data. Factory work and Cyber-physical systems may have a 6C system:      

Connection (sensor and networks), Cloud (computing and data on demand), Cyber (model and memory), content/context (meaning and correlation), community (sharing and collaboration), and customization (personalization and value).

In this scenario and in order to provide useful insight to the factory management and gain correct content, data has to be processed with advanced tools (analytics and algorithms) to generate meaningful information. Considering the presence of visible and invisible issues in an industrial factory, the information generation algorithm has to be capable of detecting and addressing invisible issues such as machine degradation, component wear, etc. in the factory floor.

CHAPTER 3: SOURCES OF BIG DATA There are various sources of Big Data, some of them listed in this paper includes: Data.gov The US Government pledged last year to make all government data available freely online. This site is the first stage and acts as a portal to all sorts of amazing information on everything from climate to crime. US Census Bureau The US Bureau collects wealth of information on the lives of US citizens covering population data, geographic data and education. To check it out, click here. European Union Open Data Portal EU collects data from European Union institutions. Data.gov.uk This includes Data from the UK Government, including the British National Bibliography – metadata on all UK books and publications since 1950. The CIA World Fact book CIA has Information on history, population, economy, government, infrastructure and military of 267 countries. Healthdata.gov 125 years of US healthcare data including claim-level Medicare data, epidemiology and population statistics. NHS Health and Social Care Information Centre Health data sets from the UK National Health Service. To check it out, click here. Amazon Web Services public datasets Huge resource of public data, including the 1000 Genome Project, an attempt to build the most comprehensive database of human genetic information and NASA’s database of satellite imagery of Earth. To check it out, click here. Facebook Graph Although much of the information on users’ Facebook profile is private, a lot isn’t – Facebook provide the Graph API as a way of querying the huge amount of information that its users are happy to share with the world (or can’t hide because they haven’t worked out how the privacy settings work).

Gapminder Compilation of data from sources including the World Health Organization and World Bank covering economic, medical and social statistics from around the world. Google Trends Statistics on search volume (as a proportion of total search) for any given term, since 2004. Google Finance Google has 40 years’ worth of stock market data, updated in real time. Google Books Ngrams Search and analyze the full text of any of the millions of books digitized as part of the Google Books project. National Climatic Data Center Huge collection of environmental, meteorological and climate data sets from the US National Climatic Data Center. The world’s largest archive of weather data. DBPedia Wikipedia is comprised of millions of pieces of data, structured and unstructured on every subject under the sun. DBPedia is an ambitious project to catalogue and create a public, freely distributable database allowing anyone to analyze this data. Topsy Free, comprehensive social media data is hard to come by – after all their data is what generates profits for the big players (Facebook, Twitter etc.) so they don’t want to give it away. However Topsy provides a searchable database of public tweets going back to 2006 as well as several tools to analyze the conversations. Like-button Mines Facebook’s public data - globally and from your own network - to give an overview of what people “Like” at the moment. New York Times Searchable, indexed archive of news articles going back to 1851. Freebase A community-compiled database of structured data about people, places and things, with over 45 million entries.

Million Song Data Set Metadata on over a million songs and pieces of music. Part of Amazon Web Services. YouTube More than 2.9 billion video hours are watched by users every month in YouTube Experiments CERN atomic facility generates 40 TB of particles per second. Twitter Twitter generates 12TB of data daily. More than 230 million tweets are generated by the users, and 97000 tweets per seconds. Radio Frequency Identifications (RFIDs) In 2005 there were about 1.5 million RFIDs; while in 2012 more than 30 billion RFIDs has been generated and used by Walmart. These RFIDs are used to track items example UPS items. Emails 300 billion emails are sent every day. Internet Users There are more than 2 billion people using internet. By 2014 Cisco estimated that internet traffic exceeded more than 4.8ZB. Air Bus Air Bus generates 10TB of data every 30 minutes. About 640 TB is generated in one flight. Trading The NYSE produces 1 TB of data per day.

CHAPTER 4: BIG DATA ANALYTICAL TOOLS TOOLS FOR BIG DATA ANALYSIS Analyzing Big Data can be very cumbersome and challenging. There is no particular software that can be used for the analysis. Different enterprises use different tools for Big Data analysis. However, the tool to use depends on the type of the data one needs to analyze. The choice of tools can also affect the quality of your data which can have a significant impact on your analysis. In the chart below list some tools that can be used to analyze both structured and unstructured data (Big Data). Some tools are open wares while others are commercial and very expensive.

Open Source Big Data Analysis Platforms and Tools 1. Hadoop You simply can't talk about big data without mentioning Hadoop. The Apache distributed data processing software is so pervasive that often the terms "Hadoop" and "big data" are used synonymously. The Apache Foundation also sponsors a number of related projects that extend the capabilities of Hadoop, and many of them are mentioned below. In addition, numerous vendors offer supported versions of Hadoop and related technologies. Operating System: Windows, Linux, OS X. 2. MapReduce Originally developed by Google, the MapReduce website describe it as "a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes." It's used by Hadoop, as well as many other data processing applications. Operating System: OS Independent.

3. GridGain GridGrain offers an alternative to Hadoop's MapReduce that is compatible with the Hadoop Distributed File System. It offers in-memory processing for fast analysis of real-time data. You can download the open source version from GitHub or purchase a commercially supported version from the link above. Operating System: Windows, Linux, OS X. 4. HPCC Developed by LexisNexis Risk Solutions, HPCC is short for "high performance computing cluster." It claims to offer superior performance to Hadoop. Both free community versions and paid enterprise versions are available. Operating System: Linux. 5. Storm Now owned by Twitter, Storm offers distributed real-time computation capabilities and is often described as the "Hadoop of real-time." It's highly scalable, robust, fault-tolerant and works with nearly all programming languages. Operating System: Linux. Databases/Data Warehouses 6. Cassandra Originally developed by Facebook, this NoSQL database is now managed by the Apache Foundation. It's used by many organizations with large, active datasets, including Netflix, Twitter, Urban Airship, Constant Contact, Reddit, Cisco and Digg. Commercial support and services are available through third-party vendors. Operating System: OS Independent. 7. HBase Another Apache project, HBase is the non-relational data store for Hadoop. Features include linear and modular scalability, strictly consistent reads and writes automatic failover support and much more. Operating System: OS Independent. 8. MongoDB MongoDB was designed to support humongous databases. It's a NoSQL database with documentoriented storage, full index support, replication and high availability, and more. Commercial support is available through 10gen. Operating system: Windows, Linux, OS X, Solaris. 9. Neo4j The "world’s leading graph database," Neo4j boasts performance improvements up to 1000x or more versus relational databases. Interested organizations can purchase advanced or enterprise versions from Neo Technology. Operating System: Windows, Linux. 10. Couch DB Designed for the Web, Couch DB stores data in JSON documents that you can access via the Web or query using JavaScript. It offers distributed scaling with fault-tolerant storage. Operating system: Windows, Linux, OS X, Android.

11. Orient DB This NoSQL database can store up to 150,000 documents per second and can load graphs in just milliseconds. It combines the flexibility of document databases with the power of graph databases, while supporting features such as ACID transactions, fast indexes, native and SQL queries, and JSON import and export. Operating system: OS Independent. 12. Terrastore Based on Terracotta, Terrastore boasts "advanced scalability and elasticity features without sacrificing consistency." It supports custom data partitioning, event processing, push-down predicates, range queries, map/reduce querying and processing and server-side update functions. Operating System: OS Independent. 13. FlockDB Best known as Twitter's database, FlockDB was designed to store social graphs (i.e., who is following whom and who is blocking whom). It offers horizontal scaling and very fast reads and writes. Operating System: OS Independent. 14. Hibari Used by many telecom companies, Hibari is a key-value, big data store with strong consistency, high availability and fast performance. Support is available through Gemini Mobile. Operating System: OS Independent. 15. Riak Riak humbly claims to be "the most powerful open-source, distributed database you'll ever put into production." Users include Comcast, Yammer, Voxer, Boeing, SEOMoz, Joyent, Kiip.me, DotCloud, Form spring, the Danish Government and many others. Operating System: Linux, OS X. 16. Hypertable This NoSQL database offers efficiency and fast performance that result in cost savings versus similar databases. The code is 100 percent open source, but paid support is available. Operating System: Linux, OS X. 17. BigData This distributed database can run on a single system or scale to hundreds or thousands of machines. Features include dynamic sharding, high performance, high concurrency, high availability and more. Commercial support is available. Operating System: OS Independent. 18. Hive Hadoop's data warehouse, Hive promises easy data summarization, ad-hoc queries and other analysis of big data. For queries, it uses a SQL-like language known as HiveQL. Operating System: OS Independent. 19. InfoBright Community Edition

This scalable data warehouse supports data stores up to 50TB and offers "market-leading" data compression up to 40:1 for improved performance. Commercial products based on the same technology can be found at InfoBright.com. Operating System: Windows, Linux. 20. Infinispan Infinispan from JBoss describes itself as an "extremely scalable, highly available data grid platform." Java-based, it was designed for multi-core architecture and provides distributed cache capabilities. Operating System: OS Independent. 21. Redis Sponsored by VMware, Redis offers an in-memory key-value store that can be saved to disk for persistence. It supports many of the most popular programming languages. Operating System: Linux. Business Intelligence 22. Talend Talend makes a number of different business intelligence and data warehouse products, including Talend Open Studio for Big Data, which is a set of data integration tools that support Hadoop, HDFS, Hive, Hbase and Pig. The company also sells an enterprise edition and other commercial products and services. Operating System: Windows, Linux, OS X. 23. Jaspersoft Jaspersoft boasts that it makes "the most flexible, cost effective and widely deployed business intelligence software in the world." The link above primarily discusses the commercial versions of its applications, but you can find the open source versions, including the Big Data Reporting Tool atJasperForge.org. Operating System: OS Independent. 24. Palo BI Suite/Jedox The open source Palo Suite includes an OLAP Server, Palo Web, Palo ETL Server and Palo for Excel. Jedox offers commercial software based on the same tools. Operating System: OS Independent. 25. Pentaho Used by more than 10,000 companies, Pentaho offers business and big data analytics tools with data mining, reporting and dashboard capabilities. Seethe Pentaho Community Wiki for easy access to the open source downloads. Operating System: Windows, Linux, OS X. 26. SpagoBI SpagoBI claims to be "the only entirely open source business intelligence suite." Commercial support, training and services are available. Operating System: OS Independent. 27. KNIME The Konstanz Information Miner, or KNIME, offers user-friendly data integration, processing, analysis, and exploration. In 2010, Gartner named KNIME a "Cool Vendor" in analytics, business

intelligence, and performance management. In addition to the open source desktop version, several commercial versions are also available. Operating System: Windows, Linux, OS X. 28. BIRT/Actuate Short for "Business Intelligence and Reporting Tools," BIRT is an Eclipse-based tool that adds reporting features to Java applications. Actuate is a company that co-founded BIRT and offers a variety of software based on the open source technology. Operating System: OS Independent. Others BI tools includes QlikView and QlikSense, Micro Strategy Data Mining 29. RapidMiner/RapidAnalytics RapidMiner claims to be "the world-leading open-source system for data and text mining." RapidAnalytics is a server version of that product. In addition to the open source versions of each, enterprise versions and paid support are also available from the same site. Operating System: OS Independent. 30. Mahout This Apache project offers algorithms for clustering, classification and batch-based collaborative filtering that run on top of Hadoop. The project's goal is to build scalable machine learning libraries. Operating System: OS Independent. 31. Orange This project hopes to make data mining "fruitful and fun" for both novices and experts. It offers a wide variety of visualizations, plus a toolbox of more than 100 widgets. Operating System: Windows, Linux, OS X. 32. Weka Short for "Waikato Environment for Knowledge Analysis," Weka offers a set of algorithms for data mining that you can apply directly to data or use in another Java application. It's part of a larger machine learning project, and it's also sponsored by Pentaho. Operating System: Windows, Linux, OS X. 33. JHepWork Also known as "jWork," this Java-based project provides scientists, engineers and students with an interactive environment for scientific computation, data analysis and data visualization. It's frequently used in data mining, as well as for mathematics and statistical analysis. Operating System: OS Independent. 34. KEEL

KEEL stands for "Knowledge Extraction based on Evolutionary Learning," and it aims to help uses assess evolutionary algorithms for data mining problems like regression, classification, clustering and pattern mining. It includes a large collection of existing algorithms that it uses to compare and with new algorithms. Operating System: OS Independent. 35. SPMF Another Java-based data mining framework, SPMF originally focused on sequential pattern mining, but now also includes tools for association rule mining, sequential rule mining and frequent itemset mining. Currently, it includes 46 different algorithms. Operating System: OS Independent. 36. Rattle Rattle, the "R Analytical Tool to Learn Easily," makes it easier for non-programmers to use the R language by providing a graphical interface for data mining. It can create data summaries (both visual and statistical), build models, draw graphs, score datasets and more. Operating System: Windows, Linux, OS X. File System 37. Gluster Sponsored by Red Hat, Gluster offers unified file and object storage for very large datasets. Because it can scale to 72 brontobytes, it can be used to extend the capabilities of Hadoop beyond the limitations of HDFS (see below). Operating System: Linux. 38. Hadoop Distributed File System Also known as HDFS, this is the primary storage system for Hadoop. It quickly replicates data onto several nodes in a cluster in order to provide reliable, fast performance. Operating System: Windows, Linux, OS X. Programming Languages 39. Pig/Pig Latin Another Apache Big Data project, Pig is a data analysis platform that uses a textual language called Pig Latin and produces sequences of Map-Reduce programs. It helps makes it easier to write, understand and maintain programs which conduct data analysis tasks in parallel. Operating System: OS Independent. 40. R Developed by Bell Laboratories, R is a programming language and an environment for statistical computing and graphics that is similar to S. The environment includes a set of tools that make it easier to manipulate data, perform calculations and generate charts and graphs. Operating System: Windows, Linux, OS X.

41. ECL ECL ("Enterprise Control Language") is the language for working with HPCC. A complete set of tools, including an IDE and a debugger are included in HPCC, and documentation is available on the HPCC site. Operating System: Linux. Big Data Search 42. Lucene The self-proclaimed "de facto standard for search libraries," Lucene offers very fast indexing and searching for very large datasets. In fact, it can index over 95GB/hour when using modern hardware. Operating System: OS Independent. 43. Solr Solr is an enterprise search platform based on the Lucene tools. It powers the search capabilities for many large sites, including Netflix, AOL, CNET and Zappos. Operating System: OS Independent. Data Aggregation and Transfer 44. Sqoop Sqoop transfers data between Hadoop and RDBMSes and data warehouses. As of March of this year, it is now a top-level Apache project. Operating System: OS Independent. 45. Flume Another Apache project, Flume collects, aggregates and transfers log data from applications to HDFS. It's Java-based, robust and fault-tolerant. Operating System: Windows, Linux, OS X. 46. Chukwa Built on top of HDFS and MapReduce, Chukwa collects data from large distributed systems. It also includes tools for displaying and analyzing the data it collects. Operating System: Linux, OS X. Miscellaneous Big Data Tools 47. Terracotta Terracotta's "Big Memory" technology allows enterprise applications to store and manage big data in server memory, dramatically speeding performance. The company offers both open source and commercial versions of its Terracotta platform, BigMemory, Ehcache and Quartz software. Operating System: OS Independent. 48. Avro Apache Avro is a data serialization system based on JSON-defined schemas. APIs are available for Java, C, C++ and C#. Operating System: OS Independent.

49. Oozie This Apache project is designed to coordinate the scheduling of Hadoop jobs. It can trigger jobs at a scheduled time or based on data availability. Operating System: Linux, OS X. 50. Zookeeper Formerly a Hadoop sub-project, Zookeeper is "a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services." APIs are available for Java and C, with Python, Perl, and REST interfaces planned. Operating System: Linux, Windows (development only), OS X (development only).

REFERENCES http://www.datamation.com/data-center/50-top-open-source-tools-for-big-data-3.html http://searchcloudcomputing.techtarget.com/definition/big-data-Big-Data http://www.datasciencecentral.com/profiles/blogs/the-free-big-data-sources-everyone-should-know http://www.informationweek.com/big-data/big-data-analytics/16-top-big-data-analyticsplatforms/d/d-id/1113609?image_number=1

CHAPTER 5: TWITTER BIG DATA STATISTICAL ANALYSIS AND VISUALIZATION MOST BIG DATA TWEETS

The plot above shows the date the tweet the tweets was created. The longest tweet was made on the 8th of December 2014.

SCREEN_NAME

This plot above shows the screenNames of people who tweeted about Big Data and the number of tweets they made about the concept of Big data. The longest bar shows the person who tweeted most about Big data.

HASHTAGS

BIG DATA TWEETS AND RETWEETS In order to get the number of tweets and ReTweets from people on Big Data. The following steps was undertaking:                

determine the frequency of tweets per account create an ordered data frame for further manipulation and plotting create a subset of those who tweeted at least 5 times or more extract counts of how many tweets from each account were retweeted clean the twitter messages by removing odd characters remove @ symbol from user names pull out who the message is to extract who has been retweeted replace names with corresponding anonymizing number make a table with anonymized IDs and number of RTs for each account subset those people RT’d at least twice create a data frame for plotting combine tweet and retweet counts into one data frame create a Cleveland dot plot of tweet counts and retweet counts per Twitter account solid data point = number of tweets, letter R = number of retweets install the packages “grid” and “ggplot2”

STATISTICAL ANALYSIS OF BIG DATA TWEETS To estimate the Ratio of retweets to tweets for some more insights about the data. Then the following process was taken:         

make table with counts of tweets per person make table with counts of retweets per person combine tweet count and retweet count per person creates new col and adds ratio tweet/retweet sort it to put names in order by ratio exclude those with 2 tweets or less drop unused levels leftover from sub-setting plot nicely ordered counts of tweets by person for people > 5 tweets

SUMMARY TABLE OF BIG DATA TWEETS A. CHARACTER PER TWEET

B. HASHTAG PER TWEET

C. @MENTIONS PER TWEET

D. @MENTIONS PER TWEET

GRAPH OF TWEETS OF BIG DATA

This is a positive graph. The plot shows that there is a strong relationship between the number of words tweeted and the unique words used in the tweet. From the graph, we estimated that the correlation between the words and unique words was 0.986866. Pearson’s correlation is a parametric measure of the linear association between two variables. Spearman’s correlation is a non-parametric measure of the monotonic association between two numeric variables; Kendall’s rank correlation is another non parametric measure of the association based on concordance and discordance of x – y pairs. LINEAR MODEL OF OUR BIG DATA –

HYPOTHESIS TESTING OF BIG DATA TWEETS – A. WORDS vs UNIQS

From the table above we observe that the t- test is 611.3358 while the degree of freedom df = 9998. The p- value is less than 2.2e -16. The 95 % and 99 % confidence interval for the correlation is also given on the table. The Alternative hypothesis is true because the correlation is more than zero.

B. WORDS vs LENGTHS OF WORDS

The plot shows a negative relationship between the length of words per tweets and the number of words. The table below shows the correlation and the hypothesis testing of the two variables.

C. WORDS vs HASHTAGS

The graph shows that there is not strong relationship between hastags and the words in the tweet. This can be statistically seen in the table of statistical analysis and hypothesis testing of the variables.

DENSITY PLOTS OF THE DISTRIBUTION OF BIG DATA TWEETS Density of a continuous random variable, is a function that describes the relative likelihood for this random variable to take on a given value. The probability of the random variable falling within a particular range of values is given by the integral of this variable’s density over that range— that is, it is given by the area under the density function but above the horizontal axis and between the lowest and greatest values of the range. The probability density function is nonnegative everywhere, and its integral over the entire space is equal to one. A probability density function is most commonly associated with absolutely continuous univariate distributions. A random variable X has density fX, where fX is a non-negative Lebesgueintegrable function, if:

Hence, if FX is the cumulative distribution function of X, then:

and (if fX is continuous at x)

Intuitively, one can think of fX(x) dx as being the probability of X falling within the infinitesimal interval [x, x + dx]. Density of Characters per tweet

Density of @mentions per tweet

Density of Links

Density of Word Length

Density of Hashtags

APPLICATIONS OF DENSITY DISTRIBUTION The concept of the probability distribution and the random variables which they describe underlies the mathematical discipline of probability theory, and the science of statistics. There is spread or variability in almost any value that can be measured in a population (e.g. height of people, durability of a metal, sales growth, traffic flow, etc.); almost all measurements are made with some intrinsic error; in physics many processes are described probabilistically, from the kinetic properties of gases to the quantum mechanical description of fundamental particles. For these and many other reasons, simple numbers are often inadequate for describing a quantity, while probability distributions are often more appropriate. As a more specific example of an application, the cache language models and other statistical language models used in natural language processing to assign probabilities to the occurrence of particular words and word sequences do so by means of probability distributions.

SOCIAL NETWORK ANALYSIS PLOTTING OF THE BIG DATA TWEETS Social network analysis (SNA) is a strategy for investigating social structures through the use of network and graph theories. It characterizes networked structures in terms of nodes (individual actors, people, or things within the network) and the ties or edges (relationships or interactions) that connect them. Examples of social structures commonly visualized through social network analysis include social media networks, friendship and acquaintance networks, kinship, disease transmission, and sexual relationships. These networks are often visualized through socio-grams in which nodes are represented as points and ties are represented as lines. Everything is connected this includes - people, information, events and places, all the more so with the advent of online social media. A practical way of making sense of the tangle of connections is to analyze them as networks. In the social network, we try to identify the important nodes in the network, to detecting communities, to tracing information diffusion and opinion formation. To analyze the SNA in this project, we need the following packages:  

Igraph library sna library

These graphs show the pairs of each tweet and retweet.

APPLICATION OF SOCIAL NETWORK ANALYSIS Social network analysis is used widely in the social and behavioral sciences, as well as in economics, marketing, and industrial engineering. The social network perspective focuses on relationships among social entities and is an important addition to standard social and behavioral research, which is primarily concerned with attributes of the social units. Social Network Analysis: Methods and Applications reviews and discusses methods for the analysis of social networks with a focus on applications of these methods to many substantive examples. The network can also be used to measure social capital – the value that an individual gets from the social network. These concepts are often displayed in a social network diagram, where nodes are the points and ties are the lines. ATTRIBUTES OF SOCIAL NETWORKS 1. Reciprocity Reciprocity is another quantity to specifically characterize directed networks. Link reciprocity measures the tendency of vertex pairs to form mutual connections between each other. The measure of reciprocity defines the proportion of mutual connections, in a directed graph. It is most commonly defined as the probability that the opposite counterpart of a directed edge is also included in the graph. Or in adjacency matrix notation: sum(i, j, (A.*A')ij) / sum(i, j, Aij), where A.*A'

is the element-wise product of matrix A and its transpose. This measure is calculated if the mode argument is default.

2. Transitivity The transitivity T of a graph is based on the relative number of triangles in the graph, compared to total number of connected triples of nodes.

The factor of three in the number accounts for the fact that each triangle contributes to three different connected triples in the graph, one centered at each node of the triangle. With this definition, 0≤T≤1, and T=1 if the network contains all possible edges. The transitivity of a graph is closely related to the clustering coefficient of a graph, as both measure the relative frequency of triangles.

3. Centralization Centralization is a method for creating a graph level centralization measure from the centrality scores of the vertices.

Centralization is a general method for calculating a graph-level centrality score based on node-level centrality measure. The formula for this is C(G) = sum( max(c(w), w) - c(v),v), where c(v) is the centrality of vertex v. The graph-level centrality score can be normalized by dividing by the maximum theoretical score for a graph with the same number of vertices, using the same parameters, e.g. directedness, whether we consider loop edges, etc. For degree, closeness and betweenness the most centralized structure is some version of the star graph, in-star, out-star or undirected star. For eigenvector centrality the most centralized structure is the graph with a single edge (and potentially many isolates). centralize.scores using the general centralization formula to calculate a graph-level score from vertex-level scores.

UNIVARIATE CONDITIONAL UNIFORM GRAPH TESTS Conditional Uniform Graph Test also known as (cugtest) tests an arbitrary GLI (computed on dat by FUN) against a conditional uniform graph null hypothesis, via Monte Carlo simulation. Some variation in the nature of the conditioning is available; currently, conditioning only on size, conditioning jointly on size and estimated tie probability (via expected density), and conditioning jointly on size and (bootstrapped) edge value distributions are implemented. Note that fair amount of flexibility is possible regarding CUG tests on functions of GLIs (Anderson et al., 1999).

BUILDING A MODEL WITH OUR DATASET Topic models extend and build on classical methods in natural language processing such as the unigram model and the mixture of unigram models (Nigam, McCallum, Thrun, and Mitchell 2000) as well as Latent Semantic Analysis (LSA; Deerwester, Dumais, Furnas, Landauer, and Harshman 1990). Topic models differ from the unigram or the mixture of unigram models because they are mixed-membership models (see for example Airoldi, Blei, Fienberg, and Xing 2008). In the unigram model each word is assumed to be drawn from the same term distribution, in the mixture of unigram models a topic is drawn for each document and all words in a document are drawn from the term distribution of the topic. In mixed-membership models documents are not assumed to belong to single topics, but to simultaneously belong to several topics and the topic distributions vary over documents. In machine learning and natural language processing topic models are generative models which provide a probabilistic framework for the term frequency occurrences in documents in a given corpus. Using only the term frequencies assumes that the information in which order the words occur in a document is negligible. This assumption is also referred to as the exchangeability assumption for the words in a document and this assumption leads to bag-of-words models. The latent Dirichlet allocation (LDA; Blei, Ng, and Jordan 2003b) model is a Bayesian mixture model for discrete data where topics are assumed to be uncorrelated. The correlated topics model (CTM; Blei and Lafferty 2007) is an extension of the LDA model where correlations between topics are allowed. The R package topicmodels currently provides an interface to the code for fitting an LDA

model and a CTM with the VEM algorithm as implemented by Blei and co-authors and to the code for fitting an LDA topic model with Gibbs sampling written by Phan and co-authors. Package topicmodels builds on package tm (Feinerer, Hornik, and Meyer 2008; Feinerer 2011) which constitutes a framework for text mining applications within R. tm provides infrastructure for constructing a corpus, e.g., by reading in text data from PDF files, and transforming a corpus to a document-term matrix which is the input data for topic models. To perform Big Data Modeling the following procedure was taken:       

Build a Term Document Matrix Install package “topicmodels” Choose 30 topics Find 10 terms of every topic Identify every topic of each tweets Create a data frame of those topic and the time Build or plot the model

WORD CLOUD OF BIG DATA A tag cloud (word cloud or weighted list in visual design) is a visual representation for text data, typically used to depict keyword metadata (tags) on websites, or to visualize free form text. Tags are usually single words, and the importance of each tag is shown with font size or color. This format is useful for quickly perceiving the most prominent terms and for locating a term alphabetically to determine its relative prominence. When used as website navigation aids, the terms are hyperlinked to items associated with the tag. There are three main types of tag cloud applications in social software, distinguished by their meaning rather than appearance. In the first type, there is a tag for the frequency of each item, whereas in the second type, there are global tag clouds where the frequencies are aggregated over all items and users. In the third type, the cloud contains categories, with size indicating number of subcategories.

A data cloud or cloud data is a data display which uses font size and/or color to indicate numerical values. It is similar to a tag cloud but instead of word count, displays data such as population or stock market prices. A text cloud or word cloud is a visualization of word frequency in a given text as a weighted list. The technique has recently been popularly used to visualize the topical content of political speeches.

Tag clouds have been subject of investigation in several usability studies. The following summary is based on an overview of research results given by Lohmann et al.     

Tag size: Large tags attract more user attention than small tags (effect influenced by further properties, e.g., number of characters, position, neighboring tags). Scanning: Users scan rather than read tag clouds. Centering: Tags in the middle of the cloud attract more user attention than tags near the borders (effect influenced by layout). Position: The upper left quadrant receives more user attention than the others (Western reading habits). Exploration: Tag clouds provide suboptimal support when searching for specific tags (if these do not have a very large font size).

The term keyword cloud is sometimes used as a search engine marketing (SEM) term that refers to a group of keywords that are relevant to a specific website. In recent years tag clouds have gained popularity because of their role in search engine optimization of Web pages as well as supporting the user in navigating the content in an information system efficiently. Tag clouds as a navigational tool make the resources of a website more connected, when crawled by a search engine spider, which

may improve the site's search engine rank. From a user interface perspective they are often used to summarize search results to support the user in finding content in a particular information system more quickly. CLUSTERING OF BIG DATA TWEETS

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include groups with small distances among the cluster members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore be formulated as a multi-objective optimization problem. The appropriate clustering algorithm and parameter settings (including values such as the distance function to use, a density threshold or the number of expected clusters) depend on the individual data set and intended use of the results. Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure. It will often be necessary to modify data preprocessing and model parameters until the result achieves the desired properties.

Clustering results are evaluated based on data that was not used for clustering, such as known class labels and external benchmarks. Such benchmarks consist of a set of pre-classified items, and these sets are often created by human (experts). Thus, the benchmark sets can be thought of as a gold standard for evaluation. These types of evaluation methods measure how close the clustering is to the predetermined benchmark classes. However, it has recently been discussed whether this is adequate for real data, or only on synthetic data sets with a factual ground truth, since classes can

contain internal structure, the attributes present may not allow separation of clusters or the classes may contain anomalies. Additionally, from a knowledge discovery point of view, the reproduction of known knowledge may not necessarily be the intended result. A number of measures are adapted from variants used to evaluate classification tasks. In place of counting the number of times a class was correctly assigned to a single data point (known as true positives), such pair counting metrics assess whether each pair of data points that is truly in the same cluster is predicted to be in the same cluster.

Clustering of Tweets and Retweets

PRACTICAL APPLICATIONS OF CLUSTERS

Biology, computational biology and bioinformatics 

Plant and animal ecology

Cluster analysis is used to describe and to make spatial and temporal comparisons of communities (assemblages) of organisms in heterogeneous environments; it is also used in plant systematics to generate artificial phylogenies or clusters of organisms (individuals) at the species, genus or higher level that share a number of attributes 

Transcriptomic

Clustering is used to build groups of genes with related expression patterns (also known as coexpressed genes). Often such groups contain functionally related proteins, such as enzymes for a specific pathway, or genes that are co-regulated. High throughput experiments using expressed sequence tags (ESTs) or DNA microarrays can be a powerful tool for genome annotation, a general aspect of genomics. 

Sequence analysis

Clustering is used to group homologous sequences into gene families. This is a very important concept in bioinformatics, and evolutionary biology in general. See evolution by gene duplication. 

High-throughput genotyping platforms

Clustering algorithms are used to automatically assign genotypes. 

Human genetic clustering

The similarity of genetic data is used in clustering to infer population structures. Medicine 

Medical imaging

On PET scans, cluster analysis can be used to differentiate between different types of tissue and blood in a three-dimensional image. In this application, actual position does not matter, but the voxel intensity is considered as a vector, with a dimension for each image that was taken over time. This technique allows, for example, accurate measurement of the rate a radioactive tracer is delivered to the area of interest, without a separate sampling of arterial blood, an intrusive technique that is most common today.  Analysis of antimicrobial activity

Cluster analysis can be used to analyze patterns of antibiotic resistance, to classify antimicrobial compounds according to their mechanism of action, to classify antibiotics according to their antibacterial activity.  IMRT segmentation Clustering can be used to divide a fluence map into distinct regions for conversion into deliverable fields in MLC-based Radiation Therapy. Business and marketing 

Market research

Cluster analysis is widely used in market research when working with multivariate data from surveys and test panels. Market researchers use cluster analysis to partition the general population of consumers into market segments and to better understand the relationships between different groups of consumers/potential customers, and for use in market segmentation, Product positioning, New product development and Selecting test markets. 

Grouping of shopping items

Clustering can be used to group all the shopping items available on the web into a set of unique products. For example, all the items on eBay can be grouped into unique products. (eBay doesn't have the concept of a SKU) 

World wide web

Social network analysis

In the study of social networks, clustering may be used to recognize communities within large groups of people.  Search result grouping In the process of intelligent grouping of the files and websites, clustering may be used to create a more relevant set of search results compared to normal search engines like Google. There are currently a number of web based clustering tools such as Clusty. 

Slippy map optimization

Flickr's map of photos and other map sites use clustering to reduce the number of markers on a map. This makes it both faster and reduces the amount of visual clutter.

Computer science



Software evolution

Clustering is useful in software evolution as it helps to reduce legacy properties in code by reforming functionality that has become dispersed. It is a form of restructuring and hence is a way of directly preventative maintenance.  Image segmentation Clustering can be used to divide a digital image into distinct regions for border detection or object recognition.  Evolutionary algorithms Clustering may be used to identify different niches within the population of an evolutionary algorithm so that reproductive opportunity can be distributed more evenly amongst the evolving species or subspecies.  Recommender systems Recommender systems are designed to recommend new items based on a user's tastes. They sometimes use clustering algorithms to predict a user's preferences based on the preferences of other users in the user's cluster.  Markov chain Monte Carlo methods Clustering is often utilized to locate and characterize extrema in the target distribution.

Social science



Crime analysis

Cluster analysis can be used to identify areas where there are greater incidences of particular types of crime. By identifying these distinct areas or "hot spots" where a similar crime has happened over a period of time, it is possible to manage law enforcement resources more effectively. Educational data mining

Cluster analysis is for example used to identify groups of schools or students with similar properties.  Typologies From poll data, projects such as those undertaken by the Pew Research Center use cluster analysis to discern typologies of opinions, habits, and demographics that may be useful in politics and marketing.

Clusters of screenname and tweets Field robotics

Clustering algorithms are used for robotic situational awareness to track objects and detect outliers in sensor data.  Mathematical chemistry To find structural similarity, etc., for example, 3000 chemical compounds were clustered in the space of 90 topological indices.  Climatology To find weather regimes or preferred sea level pressure atmospheric patterns.  Petroleum geology Cluster analysis is used to reconstruct missing bottom hole core data or missing log curves in order to evaluate reservoir properties.  Physical geography The clustering of chemical properties in different sample locations.

SENTIMENT ANALYSIS OF BIG DATA TWEETS Sentiment analysis (also known as opinion mining) refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials. Basic task in sentiment analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect level — whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative, or neutral. Advanced, "beyond polarity" sentiment classification looks, for instance, at emotional states such as "angry," "sad," and "happy." Early work in that area includes Turney and Pang who applied different methods for detecting the polarity of product reviews and movie reviews respectively. This work is at the document level. One can also classify a document's polarity on a multi-way scale, which was attempted by Pang and Snyder (among others): expanded the basic task of classifying a movie review as either positive or negative to predicting star ratings on either a 3 or a 4 star scale, while Snyder performed an in-depth analysis of restaurant reviews, predicting ratings for various aspects of the given restaurant, such as the food and atmosphere (on a five-star scale). Even though in most statistical classification methods, the neutral class is ignored under the assumption that neutral texts lie near the boundary of the binary classifier, several researchers suggest that, as in every polarity problem, three categories must be identified. Moreover it can be proven that specific classifiers such as the Max Entropy and the SVMs can benefit from the introduction of neutral class and improve the overall accuracy of the classification.

The accuracy of a sentiment analysis system is, in principle, how well it agrees with human judgments. This is usually measured by precision and recall. However, according to research human raters typically agree 79% of the time (see Inter-rater reliability). Thus, a 70% accurate program is doing nearly as well as humans, even though such accuracy may not sound impressive. If a program were "right" 100% of the time, humans would still disagree with it about 20% of the time, since they disagree that much about any answer. More sophisticated

measures can be applied, but evaluation of sentiment analysis systems remains a complex matter. For sentiment analysis tasks returning a scale rather than a binary judgement, correlation is a better measure than precision because it takes into account how close the predicted value is to the target value.

SOURCES OF TWEETS

Dashboards

REFERENCES 1.

Peter Turney (2002). "Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews". Proceedings of the Association for Computational Linguistics. pp. 417– 424. arXiv:cs.LG/0212032.

2. Bo Pang; Lillian Lee and Shivakumar Vaithyanathan (2002)."Thumbs up? Sentiment Classification using Machine Learning Techniques". Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 79–86. 3. http://en.wikipedia.org/wiki/Cluster_analysis 4.

http://en.wikipedia.org/wiki/Tag_cloud

5.

Chang J (2010). lda: Collapsed Gibbs Sampling Methods for Topic Models. R package version 1.2.3, URL http://CRAN.R-project.org/package=lda.

6.

Daum´e III H (2008). HBC: Hierarchical Bayes Compiler. Pre-release version 0.7, URL http://www.cs.utah.edu/~hal/HBC/.

7.

Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990). “Indexing by Latent Semantic Analysis.” Journal of the American Society for Information Science, 41(6), 391–407.

8.

Dempster AP, Laird NM, Rubin DB (1977). “Maximum Likelihood from Incomplete Data Via the EM-Algorithm.” Journal of the Royal Statistical Society B, 39, 1–38. Feinerer I (2011). tm: Text Mining Package. R package version 0.5-5., URL http://CRAN. R-project.org/package=tm.

9.

Feinerer I, Hornik K, Meyer D (2008). “Text Mining Infrastructure in R.” Journal of Statistical Software, 25(5), 1–54. URL http://www.jstatsoft.org/v25/i05/.

10. Griffiths TL, Steyvers M (2004). “Finding Scientific Topics.” Proceedings of the National

Academy of Sciences of the United States of America, 101, 5228–5235.

11. Tutorial on Discovering Multiple Clustering Solutions http://dme.rwth-aachen.de/en/DMCS 12. Time-Critical Decision Making for Business Administration http://home.ubalt.edu/ntsbarsh/stat-

data/Forecast.htm 13. A paper on Open-Source Tools for Data Mining, published in 2008 http://eprints.fri.uni-

lj.si/893/1/2008-OpenSourceDataMining.pdf 14. An overview of data mining tools http://onlinelibrary.wiley.com/doi/10.1002/widm.24/pdf 15. Textbook on Introduction to social network method http://www.faculty.ucr.edu/~hanneman/nettext/ 16. Information Di_usion In Social Networks: Observing and Inuencing Societal Interests, a tutorial at

VLDB'11 http://www.cs.ucsb.edu/~cbudak/vldb_tutorial.pdf 17. Tools for large graph mining: structure and di_usion, a tutorial at WWW2008

http://cs.stanford.edu/people/jure/talks/www08tutorial/ 18. Graph Mining: Laws, Generators and Tools

http://www.stanford.edu/group/mmds/slides2008/faloutsos.pdf 19. http://thinktostart.com/cluster-twitter-data-with-r-and-k-means/ 20. http://www.rdatamining.com/