International Journal of Trend in Research and Development, Volume 2(5), ISSN 2394-9333 www.ijtrd.com
XML Object: Universal Data Structure for Big Data Manisha Shinde Pawar, Assistant Professor, Department of Management, Institute of Management and Rural Development Administration, Sangli, Bharati Vidyapeeth Univerisity, Pune, India.
Abstract: Choosing data structure for big data and optimization or reducing unstructured data to proper structures using data structure to fit to available memory size and to smoothen and enhance the speed for data retrieval for big data is most challenging task. To enhance the search performance and derive knowledge for efficient decision, tree based data structures would be more useful. The approach uses to study all available data structure applied to big data and mainly aimed to design XML Object as universal data structure for big data. By using XML structure specifications, unstructured text can be transformed to well-formed text structure which being implemented widely in software development and deployment context. Powerful features of XML language decomposed accordingly in structure form and regular grammar converted to elements and additional node for data leaf node provides meaningful structured context to sentences. Keyword: Big Data, Data node, Data Structure,
Decision, Regular Grammar, Unstructured Data, XML.
I.
INTRODUCTION
How new smart algorithm will help to solve problem? It‟s about combining and analysing data structures so you can take the right action, at the right time, and at the right place. A. What Is Data Structure? A data structure is a particular way of organizing data in a computer so that it can be used efficiently. Different kinds of data structures are suited to different kinds of applications, and some are highly specialized to specific tasks. For example, databases use B-tree indexes for small percentages of data retrieval and compilers and databases use dynamic hash tables as look up tables.
IJTRD | Sep - Oct 2015 Available
[email protected]
Data structures provide a means to manage large amounts of data efficiently for uses such as large databases and internet indexing services. Usually, efficient data structures are key to designing efficient algorithms. Some formal design methods and programming languages emphasize data structures, rather than algorithms, as the key organizing factor in software design. Storing and retrieving can be carried out on data stored in both main memory and in secondary memory. Data structures are generally based on the ability of a computer to fetch and store data at any place in its memory, specified by a pointer – a bit string, representing a memory address that can be itself stored in memory and manipulated by the program. Thus, the array and record data structures are based on computing the addresses of data items with arithmetic operations; while the linked data structures are based on storing addresses of data items within the structure itself. Many data structures use both principles, sometimes combined in non-trivial ways 1. There are numerous types of data structures, generally built upon simpler primitive data types: An array is a number of elements in a specific order, typically all of the same type. Elements are accessed using an integer index to specify which element is required (although the elements may be of almost any type). Typical implementations allocate contiguous memory words for the elements of arrays (but this is not always a necessity). Arrays may be fixed-length or resizable. A record (also called a tuple or struct) is an aggregate data structure. A record is a value that contains other values, typically in fixed number and sequence and typically indexed by names. The elements of records are usually called fields or members. An associative array (also called a dictionary or map) is a more flexible variation on an array, in which name-value pairs can be added and deleted freely. A hash table is a common implementation of an associative array. A union type specifies which of a number of permitted primitive types may be stored in its instances, e.g. float or long integer. Contrast with a
107
International Journal of Trend in Research and Development, Volume 2(5), ISSN 2394-9333 www.ijtrd.com
record, which could be defined to contain a float and an integer; whereas in a union, there is only one value at a time. Enough space is allocated to contain the widest member datatype. A tagged union (also called a variant, variant record, discriminated union, or disjoint union) contains an additional field indicating its current type, for enhanced type safety. A set is an abstract data structure that can store specific values, in no particular order and with no duplicate values. Graphs and trees are linkedabstract data structures composed of nodes. Each node contains a value and one or more pointers to other nodes arranged in a hierarchy. Graphs can be used to represent networks, while variants of trees can be used for sorting and searching, having their nodes arranged in some relative order based on their values. An object contains data fields, like a record, as well as various methods which operate on the contents of the record. In the context of object-oriented programming, records are known as plain old data structures to distinguish them from objects.
B. Natural Language Processing Natural language processinggives machines the ability to read and understand the languages that humans speak. A sufficiently powerful natural language processing system would enable natural language user interfaces and the acquisition of knowledge directly from human-written sources, such as newswire texts. Some straightforward applications of natural language processing include information retrieval (or text mining) and machine translation. As shown in Figure No.1, a parse tree represents the syntactic structure of a sentence according to some formal grammar. A precise set of evaluation criteria, which includes mainly evaluation data and evaluation metrics, enables several teams to compare their solutions to a given NLP problem. Here the Figure No. 1 represents parse tree with text as in memory data structure of variety of collection tokens with ability of quick text analysis with frequency, collocation, similarity, and simple regexbased searching. A Text Collection is a grouping of Text instances that allows you to do corpus-wide calculations (frequency, inverse document frequency etc).
IJTRD | Sep - Oct 2015 Available
[email protected]
Figure 1: Natural Language Processing (NLP) C. Big Data Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process them using traditional data processing applications. Scientists regularly encounter limitations due to large data sets in many areas, including meteorology, genomics, connectomics, complex physics simulations, and biological and environmental research. The limitations also affect Internet search, finance and business informatics. The easiest definition of big data as given by Sam Madden in white paper “From databases to Big Data” is stated as “Data that is too big, too fast, or too hard for existing tools to process”. Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn‟t fit the strictures of your database architectures. Big data is distributed data. This means the data is so massive it cannot be stored or processed by a single node. It‟s been proven by Google, Amazon, Facebook, and others that the way to scale fast and affordably is to use commodity hardware to distribute the storage and processing of our massive data streams across several nodes, adding and removing nodes as needed. The data is said to be Big Data if it is characterized by Volume, Velocity and Variety. In addition to these the purpose of data is to create value, and the complexity which increases because of the degree of interconnectedness and interdependence in data. The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s; as of 2012, every day 2.5 Exabyte‟s (2.5×1018) of data were created. The challenge for large enterprises is determining who should own big data initiatives that straddle the entire organization.
108
International Journal of Trend in Research and Development, Volume 2(5), ISSN 2394-9333 www.ijtrd.com
Big data requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times. A 2011 McKinsey report suggests suitable technologies include A/B testing, crowdsourcing, data fusion and integration, genetic algorithms, machine learning, natural language processing, signal processing, simulation, time series analysis and visualization.
system in which information is represented in the form of objects as used in object-oriented programming. Object databases are different from relational databases which are table-oriented. Object-relational databases are a hybrid of both approaches. As the usage of web-based technology increases with the implementation of Intranets and extranets, companies have a vested interest in OODBMSs to display their complex data.
Issues & Challenges with Big Data E. XML As above we have discussed sources above, according to Bill Frank the data in big data can be categorized as: Automatically generated by a machine, typically an entirely new source of data (eg. blogs), not designed to be friendly (e.g. Text streams), may not have much values (need to focus on the important part). From the above categorization author has found the big data as Structured - Most traditional data sources Semi-structured- Many sources of big data Unstructured- Video data, audio data According to Stephen Kaisleret. al. the data stored with machine plays very important role in decision making and knowledge discovery. A major challenge for IT researchers and practitioners is that growth rate is fast exceeding our ability to both: (1) Design appropriate systems to handle the data effectively (2) Analyze it to extract relevant meaning for decision making. According to Michael Cooper & Peter Mell from NIST the issues associated with Big Data can be given as• Taxonomies, ontologies, schemas, workflow • Perspectives – backgrounds, use cases • Bits – raw data formats and storage methods • Cycles – algorithms and analysis • Screws – infrastructure to support Big Data D. Object Databases An object database (also object-oriented database management system) is a database management
IJTRD | Sep - Oct 2015 Available
[email protected]
Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format which is both human-readable and machine-readable. The XML specification defines an XML document as a well-formed text – meaning that it satisfies a list of syntax rules provided in the specification. Schematron is a rule-based validation language for making assertions about the presence or absence of patterns in XML trees. It is a structural schema language expressed in XML using a small number of elements and XPath. XPath defines a syntax named XPath expressions which identifies one or more of the internal components (elements, attributes, and so on) included in an XML document. XPath is widely used in other core-XML specifications and in programming libraries for accessing XML-encoded data. XSLT is a language with an XML-based syntax that is used to transform XML documents into other XML documents, HTML, or other, unstructured formats such as plain text or RTF. XSLT is very tightly coupled with XPath, which it uses to address components of the input XML document, mainly elements and attributes. XQuery is an XML-oriented query language strongly rooted in XPath and XML Schema. It provides methods to access, manipulate and return XML, and is mainly conceived as a query language for XML databases.
109
International Journal of Trend in Research and Development, Volume 2(5), ISSN 2394-9333 www.ijtrd.com II.
LITERATURE REVIEW
Till today different data structures are being implemented with different aspects of evaluation for big data storage, retrieval and analysis. Table No.1 comparative Probabilistic Data Structure Implementations for Big Data Sr. No. 1 2
Data Structure Linear Counting Loglog Counting
Big Data Implementation Level i.
A liner counter is just a bit set and each element in the data set is mapped to a bit.
i. ii.
More powerful and much more complex technique Unstable estimation determined by using multiple independent observations and averaging them. Incoming values are routed to a number of buckets by using their first bits as a bucket address. Each bucket maintains a maximum rank of the received values A family of memory efficient data structures that allow one to estimate frequencyrelated properties of the data set. Count-Min algorithm estimates frequency of the given value as a minimum of the corresponding counters in each row because the estimation error is always positive performs well on highly skewed data It estimates noise for each hash function as the average value of all counters in the row that correspond to this function (except counter that corresponds to the query itself), deduces it from the estimation for this hash function, and, finally, computes the median of the estimations for all hash functions. Having that the sum of all counters in the sketch row equals to the total number of the added elements Stream-Summary allows one to detect most frequent items in the dataset and estimate their frequencies with explicitly tracked estimation error. Stream-Summary traces a fixed number (a number of slots) of elements that presumably are most frequent ones. If one of these elements occurs in the stream, the corresponding counter is increased. If a new, non-traced element appears, it replaces the least frequent traced element and this kicked out element become non-traced. Stream-Summary groups all traced elements into buckets where each bucket corresponds to the particular frequency, i.e. to the number of occurrences. Range query (something like SELECT count(v) WHERE v >= c1 AND v < c2) using a Count-Min sketch enumerating all points within a range and summing estimates for corresponding frequencies. maintain a number of sketches with the different “resolution”, i.e. one sketch that counts frequencies for each value separately, one sketch that counts frequencies for pairs of values (to do this one can simply truncate a one bit of a value on the sketch‟s input), one sketch with 4-items buckets and so on. Most famous and widely used probabilistic data structure. Bloom filter is similar to Linear Counting, but it is designed to maintain an identity of each item rather than statistics. Similarly to Linear Counter, the Bloom filter maintains a bitset, but each value is mapped not to one, but to some fixed number of bits by using several independent hash functions. If the filter has a relatively large size in comparison with the number of distinct elements, each element has a relatively unique signature and it is possible to check a particular value – is it already registered in the bit set or not. If all the bits of the corresponding signature are ones then the answer is yes. A query returns either "possibly in set" or "definitely not in set".
iii. 3
CountMin Sketch
i. ii.
4
CountMeanMin Sketch
i. ii.
5
StreamSummary
i. ii.
iii. 6
7
i.
Array of CountMin Sketches
ii.
Bloom Filter
i. ii.
iii.
iv.
The above table no. 1 shows comparative analysis of different probabilistic algorithms for big data structures and what it takes into account for implementation base.
IJTRD | Sep - Oct 2015 Available
[email protected]
110
International Journal of Trend in Research and Development, Volume 2(5), ISSN 2394-9333 www.ijtrd.com
III.
STATEMENT OF THE PROBLEM
As information technology system become less monolithic and more distributed, real-time big data analysis will become less exotic and more common place. At that point, the focus will shift from data science to next logical frontier: decision science. Researches would like to focus big data by applying data structure for the study to design an efficient, iterative smart algorithmic data model for unstructured big data. Such algorithm is very useful to store, retrieve and search and analyses big data.
IV.
To carry out comparative study of different data structures applied to big data. Design Iterative Smart Algorithmic data model for unstructured big data. Generalization of Algorithm. While considering multiple parameters, with accuracy, expected to be fast, precise and improved. It will help to design the strategies and reduce the business loss. The techniques will be generalized, useful not only for Indian Educational System, but for any area wherever voluminous data is required to be accessed within shortest time. V.
Figure - 2 XML Based Processing of Big Data
OBJECTIVES
All Such different data sources are also integrated to get combination of partially structured data and structured data. XML Tress, sub trees or graph structures in form of XML object can extends XML Document Structure and XML documents can be scanned for matching XML pattern. Such data structure design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and computer science. As shown the figure no. 3, the approach aimed XML based processing of big data involving three different stages to transform less structured data to highly structured as
RESEARCH METHODOLOGY
The researcher has planned to follow Design and Creation research Strategy (figure no. 2). The strategy focuses on formation of new xml based processing technique for big data analytics. As shown in Figure No. 2, unstructured big data can be collected and integrated together in XML Object Data structure.The researcher would like to explore the strength of searching, sorting and processing of XMLStructures. XMLStructures aimed to extract relevant information from possible set of heterogeneous documents in form of unstructured text. All data users may not be capable to read and analyses information from many varieties of structured data.
i. ii. iii.
Figure 3. XML Based Processing of Big Data Collection of Distributed Data Transform it to XML Object Data Structure Store to/retrieve from Database
Figure No.3 shows that, Distributed File Systems may have one of the form of structured data but because of heterogamous structure forms, user may not analyse it with accuracy and speed. It involves integration of heterogamous structured data and less structured data
IJTRD | Sep - Oct 2015 Available
[email protected]
111
International Journal of Trend in Research and Development, Volume 2(5), ISSN 2394-9333 www.ijtrd.com and unstructured data with XML structure so that XML objects can be stored and retrieved with possibly universal format. Grammar based XML structure can help to decompose and synthesise sentence from document so as to analyse the opinion, so it will help to efficient and significant analytic development.
At XML tree as big data object data structure, sentence can be decomposed with Grammar Rule bases to get more precise nodes to reach to leaf node. Objective node is classified as positive, negative and neutral node.
VI.
CONCLUSION
In this paper, by XML object as universal and very simple data structure. The linearization of tree structure which is having namespaces, expressiveness with structured grammarfor regular grammar with International Unicode Standard, XSLT, XSL formatting objects, XQuery specification provides flexible structure, strong but simple tree structure.XSLT is stronger in its handling of narrative documents with more flexible structure, while XQuery is stronger in its data handling Proposed XML tree structure aimed to design appropriate data structure for big data handling and to use it so as to analyse and retrieve data in meaningful way to take right decisions at right time.
Figure 4: Classification of XML Structure Big Data
References
Figure No. 4 shows that, it needs to simplify the large and complex data sets into smaller but logically related container sets, so all documents firstly will be divided into opinionated and non-opinionated documents, so as to focus only opinionated documents to synthesize furthermore. The researcher would like to apply universal data structure to store and retrieve big data.
1. EddDumbill, Big Data 2012 Edition O‟Reilly, Published by O‟Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 2. Soumendra Mohanty, Madhu Jagadeesh, Harsha Srivatsa ,”Big Data Imperatives: Enterprise „Big Data‟ Warehouse, BI Implementations and Analytics” Apress 3. Vincenzo Loia, MasoudNikravesh, Lofti A. Zadeh, Studies in Fuzziness and Soft Computing, “ Fuzzy Logic and Internet”, Springer-Verlag Berlin Heidelberg New York, ISBN-3-540-201807. 4. Felipe Bravo-Marqueza, Marcelo Mendoza, Barbara Poblete “Meta-Level Sentiment Models for Big Social Data Analysis”, Knowledge Based Systems May 2014 5. Demystifying Big Data: A Practical Guide to Transforming the Business of Government, TechAmerica Foundation‟s Federal Big Data Commission, 2012 6. George Gilbert, A guide to big data workload management challenges, May 2012, by Datastax. 7. Manisha Shinde, “Formation of Smart Sentiment Analysis Technique for Big Data”, International Journal of Innovative Research in Computer and Communication Engineering. Vol.2, Issue 12, December 2014.
Then subjective and objective sentences of documents can be compared using automated rule based system to make predictions and to suggest decisions. Predicted results can be classified into three categories as positive, negative and neutral set of opinions. XML has proved its power in so many different area. As Elements and attribute values can be stored and also can be transmitted by using XML, software development stakeholders are enjoying the way of XML implementations. Data node are always leaf nodes in XML tree structure, but as shown in figure no. 3, Big Data can be classified further to have more levels of node to represent data according to opinionated data and nonopinionated data tree nodes. And Opinionated Document element will have hierarchy will have further levels of subjective and Objective Document nodes.
IJTRD | Sep - Oct 2015 Available
[email protected]
112
International Journal of Trend in Research and Development, Volume 2(5), ISSN 2394-9333 www.ijtrd.com 8. “Probabilistic Data Structures for Web Analytics and Data Mining” posted on May 1st, 2012, https://highlyscalable.wordpress.com/2012/05/01/pro babilistic-structures-web-analytics-data-mining 9. http://en.wikipedia.org/wiki/Big_data 10. http://en.wikipedia.org/wiki/Data_structure Note: All Internet references are active as on 2nd January, 2015. Biography Manisha Shinde-Pawarreceived the B.Sc. in Computer Science degree in 2003 and MBA degree in Information of Technology and Management in 2008. . She has also received Master of Computer Application (MCA). She has joined BVDU, IMRDA, Sangli, MS, India as Assistant Professor in Information Technology. Her research interests are distributed systems and mobile computing, big data etc.
IJTRD | Sep - Oct 2015 Available
[email protected]
113