Hypergraph Based Abstraction for File-Less Data Management Bartosz Kryza(B) and Jacek Kitowski Department of Computer Science, Faculty of Computer Science, Electronics and Telecommunications, AGH University of Science and Technology, Krakow, Poland
[email protected]
Abstract. Data management is currently one of the predominant issues in both large scale as well as consumer computing systems. While most data is still stored in regular files and managed by various filesystems, current trends show that users no longer treat their data as files but rather objects, which is particularly evident on mobile devices and Cloud based applications, powered by distributed, highly scalable databases. In this paper we present our attempt to consider the possibility of abandoning the concept of files and filesystems in favor of more elastic and elementary data management and storage approach, by proposing a new aproach to distributed data modelling which does not require filesystems. In our opinion, filesystems are a legacy paradigm not adhering to modern data management use cases.
Keywords: File systems
1
· Data management · Hypergraphs
Introduction
Data in computer systems has been organized into files managed by filesystems since their inception. This made sense as long as data was mainly stored in block based storage systems such as tape or hard drives. Currently however, SSD based memories as well as emerging NVRAM technologies do not require data to be stored in sequentially accessed blocks. More importantly however, files introduce several issues and bottlenecks which limit the way applications and operating systems could be developed. First of all, data in files is unnaturally clustered, causing high duplication [9,14] and unnecessary transfer overhead (e.g. consider a document containing various images and charts, when we only need a single formula from selected section). Furthermore, filesystems themselves organize files in large tree-based namespaces. In case of small numbers of files they can be comprehensible, however beyond certain number of directories paths become useless from the point of view of actually finding information, as it is not possible to remember all possible branches where certain file is stored. Thus modern operating systems c Springer International Publishing Switzerland 2016 R. Wyrzykowski et al. (Eds.): PPAM 2015, Part I, LNCS 9573, pp. 322–331, 2016. DOI: 10.1007/978-3-319-32149-3 31
Hypergraph Based Abstraction for File-Less Data Management
323
provide their users with real-time file scanning tools, such as Spotlight, which enable to find files using keyword based search and without traversing manually directory structure. Another important issue is the fact that files create a certain barrier for the operating systems, which makes it impossible or at least difficult to address specific data item inside the files, as the operating system would have to understand all possible file formats. This makes it difficult to achieve in a generic way such functionality as for instance accessing and referencing metadata stored inside the multimedia files. In our previous work [17,18], we have proposed a novel approach to data management and representation, called Filess, where the building blocks managed by the operating system are not files but objects, interrelated in the form of hypergraph, a generalization of graphs where edges can connect more than 2 nodes. Our vision assumes the following objectives. First of all there no files, neither in the application nor operating system layers. All data items are represented in the form of objects interrelated using relations forming a hypergraph structure. Data and metadata exist at the same level, for instance there is no difference between the ‘Image’ object and the object describing its author or authorization policy. Filess is not a metadata formalism such as Dublin Core or even Semantic Web, which impose high restriction on how the data can be modeled. Data replication is managed by Filess middleware it is not necessary for users to copy and store the information for either security or efficiency reasons. As a consequence data redundancy is minimized. Objects can be split into subgraphs representing their content in more detail and multiple objects can be aggregated into other objects by connecting them using edges. In this position paper we focus on presenting the hypergraph data model for storing the data in the Filess based systems. The paper discusses motivation and theoretical model, as implementation and performance details are not yet the focus of this research.
2
Related Work
Most research in the area of making the existing directory based file systems more flexible can be classified into the area of semantic file systems [10], i.e. file systems where files have attached meaning. This paper sketches a vision of file systems where files can be annotated in some way, and the basic file system operation such as copy or delete don’t take directory paths as arguments but the semantic description of the files. The problem with these solutions is that still all the information is either fragmented or clustered into files, and the semantics deal only with meta data attached to these files in the form of some attributes. Nevertheless, these solutions are very important for our work as these approaches address important issues, mainly of how information can be found in file based systems. One of the formal attempts at file system implementation based on set theoretical basis is a file system using Formal Concept Analysis [7], which employs the FCA formal model of classification, neighborhood estimation and Boolean querying. A similar approach, although still bounded by
324
B. Kryza and J. Kitowski
the constraints of regular files, is the Logical File System project [22]. The basic role of this file system is to allow searching for files using first-order logic formulas instead of conventional directory paths. Unfortunately the use of first-order logic inference can seriously impair the scalability of the system in highly distributed settings. Until now, one major industrial attempt at abstracting the file concept from the operating system was the WinFS (Windows File System), which is a research effort from Microsoft [11]. Its basic assumption is to store all information about data in the system, including what would usually be referred to as file in a relational database. With respect to high scalability, an interesting approach is represented by the Google BigTable system [4], which allows to store up to Peta bytes of information about URLs indexed by Google. However the information is stored in tables, columns and rows and is accessed on a key-value basis in order to be optimized for storing information about URLs and the implementation itself is based on the Google File System, thus still all the information is inherently chunked into files. With respect to more flexible user interfaces, which could naturally evolve from the proposed solution of fileless environments, several research projects have already addressed that issue, although they were still limited by the file-centered nature of current information systems. NEPOMUK [12] is a project whose goal is provision of a semantic desktop based on Semantic Web technologies to knowledge workers, by extending most popular applications with ability to process semantic annotations of data. Currently several new formalisms and technologies begin to emerge, such as technologies related with Semantic Web [3], including RDF and OWL [13] and various knowledge base solutions which allow to annotate web ‘files’, that is web pages [16,19,20]. Although created for the purpose of annotating existing information, these technologies by themselves could possibly be used to provide a basis for our vision. Additionally several formal models of information categorization and abstraction have emerged, which can be useful in this research. One such example is Rough Set theory [23], which provides means of automatic reduction of attribute space required for generating an equivalent classification of objects, and thus could be used for some form of optimization of indexing of information within the system.
3
Data Model
The underlying model proposed in this research for universal representation of data is hypergraph [2,8]. Hypergraphs are a generalization of regular graphs with the property that each edge can connect any number of vertices, i.e. edges are simply subsets of the vertex set of the hypergraph (undirected graph) or pairs of subsets of vertices (directed hypergraph). This enables more natural modelling of complex relations like n-m joins between objects in the data model. All data in this model is stored in data objects which are vertices of the hypergraph, while relations and properties are linked using hyperedges.
Hypergraph Based Abstraction for File-Less Data Management
325
Each node’s value can be any of the selected data types (Fig. 1): – Number - this is a union data type which allows to store any numeric data type while providing users with a simple API, which handles actual data type identification on the library level, – String - this data object provides means for storing any text in UTF8, – List - most graph data modeling frameworks do not provide lists or arrays, which can be very inefficient when modeling using graph nodes. This data object provide simple means for compositing a set of data objects into ordered structure, – Binary - this data object provides means for storing large binary data such as videos, where the actual data is hashed and stored in a separate distributed key-value store, – Composite - composite data objects are objects which do not need to store any actual value in their node, but provide links for other data objects. Any object containing a value can also be a Composite object, in which case the value represents a flattened representation of the objects structure. This situation can occur during decomposition of an object into a graph, – Stream - buffer objects provide abstraction over I/O functionality of the operating system, these objects cannot be transferred between nodes, and are volatile, i.e. their state and value cannot be synchronized and no version information for these objects is maintained, only read or write operations are allowed. These objects enable complete removal of file and filesystem concepts from the applications code. In order to support maximum flexibility, no high level typing mechanism is enforced by the data model, custom typing can be achieved using application specific properties attached to the graph nodes. Node namespace : UUID
Edge
DataObject id : UUID
tail : Set head : Set
Number
String
List
value : Number
value : String
value : [DataObject]
Binary
Composite
Stream
value : Hash
Fig. 1. Data model structure and properties
326
4
B. Kryza and J. Kitowski
Example Hypergraph Representation of Simple Data Model
In order to present how hypergraphs can be used to represent various data models, let’s consider a simple model, with 3 representations: relational, JSON and XML. The model contains a very simple relationship between the people and their hobbies. In relational model (see Fig. 2) [5], this requires an intermediate relation which enables to create n-m mapping between these entities. The main concepts of this model are relation names and column names, which is a subset of a Cartesian product of a sequence of sets. A given relation has a name and provides a list of names for the subsets (columns) in the Cartesian product of the sets which are the domain for the relation (table).
Person id name 1 John 2 Mary 3 Sam
Hobbies personId hobbyId 1 1 2 1 2 2 3 2 3 3
Hobby id name 1 sailing 2 photography 3 movies
Fig. 2. Relational representation of the simple model
JSON is a text format used to represent key-value pairs, where keys are always strings, and values can be any of the following types: Number, String, Boolean, Array, Object and null. The JSON representation of the same model is presented in Fig. 3. It is more flexible than relational as it allows for nesting multiple subobjects within the properties of the object at the price of duplication of information. XML (eXtensible Markup Language) is a W3C recommendation [6] which is a tree based model for representing structure data on the Internet. In contrast to JSON, it provides means for specifying unique namespaces for all elements, ordering of the nodes as well as assigning attributes to nodes (unordered). Similarly, the XML representation of our simple model is presented in Fig. 4. Finally the hypergraph representation of the model is presented below. According to the definition a hypergraph is a pair of sets representing the vertices and edges. The vertex set contains all entities from the model. The id properties have been skipped as they are not important in this model. H = (V, E) V = {”John”, ”M ary”, ”Sam”, ”sailing”, ”photography”, ”movies”, ”P erson”, ”Hobby”}
Hypergraph Based Abstraction for File-Less Data Management
[{ id: 1, type: "Person", name: "John", hobby: [{id: 1, }, { id: 2, type: "Person", name: "Mary", hobby: [{id: 1, {id: 2, }, { id: 3, type: "Person", name: "Sam", hobby: [{id: 2, {id: 3, }]
type: "Hobby", name: "sailing"}]
type: "Hobby", name: "sailing"}, type: "Hobby", name: "photography"}]
type: "Hobby", name: "photography"}, type: "Hobby", name: "movies"}]
Fig. 3. JSON representation of the model
John sailing Mary sailing photography Sam photography movies
Fig. 4. XML representation of the model
327
328
B. Kryza and J. Kitowski
The edge set is divided into subset based on the label of the edge. E = Etype ∪ Ehobby ∪ Epeople Etype = {e1 = ({”John”, ”M ary”, ”Sam”}, {”P erson”}), e2 = ({”sailing”, ”photography”, ”movies”}, {”Hobby”})} Ehobby = {e3 = ({”John”, ”M ary”}, {”sailing”}), e4 = ({”M ary”, ”Sam”}, {”photography”}), e5 = ({”Sam”}, {”photography”, ”movies”})} Finally we have to connect the objects composing the model into a set: Epeople = {e6 = ({”John”, ”M ary”, ”Sam”}, ∅)} From this representation, we can see that directed hyperedges represent relations between nodes in such a way that it is not necessary to repeat the nodes in different relations which is the case in the JSON and XML representations. Furthermore, it does not require multiple edges to connect nodes which are the same, as the hyperedge can directly connect all source and target nodes which are in a particular relation. In general the mapping from these formats to hypergraph model can be achieved as follows. For relational model, each row is simply mapped to a single composite data object with edges representing the columns and their particular values as target data objects. Each relation (i.e. table) can be represented as a set of data objects representing rows. More interesting is the case of foreign key dependecies. In case of relational model it is impossible to directly create n:m relations. The mapping of XML data into directed hypergraph can be achieved as follows. All simple tags (containing only values) are converted to simple data objects. All complex tags, which contain children tags are converted to composite data objects. All tag attributes are added to respective data objects using edges. In case of JSON Boolean values can be modelled using Number data object type, Array’s by creating lists and null values can be achieved using hyperedges with empty head sets. Object values can be directly represented using Composite data objects. One issue is that of namespaces, as the edges created from the JSON key’s must be attached to some namespace in order to disambiguate them from other edges.
5
Filess Prototype
Filess prototype is developed currently using available technologies enabling evaluation of the idea on the proof-of-concept level. The implementation is created in Java language. We have evaluated several solutions including [15,21]. HypergraphDB provides a native directed hypergraph databases implemented in Java
Hypergraph Based Abstraction for File-Less Data Management
329
Fig. 5. Architecture of the Filess prototype
with backend based on BerkeleyDB. Finally we chose OrientDB, which is a multi-document database enabling modeling using document, key-value as well as graph paradigms simultaneously. In order to support legacy applications, an intermediate FUSE filesystem plugin was implemented which allows applications to access the information in the form of files which are composed on demand from the underlying graph when applications try to gain access to the data object. The implementation is based on fuse-jna Java Fuse provider, which allowed us to use direct OrientDB Java bindings. Filess exposes an abstract API enabling basic operations on the data objects such as searching, creating and opening, which is available to applications through libfiless library. As mentioned above, each user sees the global data space from their own perspective, which is identical on all devices from which they access the system. Current Filess prototype is implemented as an intermediate layer between user applications and distributed graph database backend (see Fig. 5). Binary, read-only data objects are stored in a separate distributed database called IPFS (Interplanetary File System) [1], which provides efficient hashing and distribution of large binary files between multiple nodes by diving them into blocks and maintaining a tree structure based on the blocks hash values.
6
Conclusions
In this paper we have presented a novel approach to data management and representation, where we propose to abandon the concept of files and filesystems altogether from the future IT infrastructures. Files and filesystems introduce several issues such as unnatural clustering of data, are a barrier for operating systems and generic services to operate directly on data and require users to navigate large hierarchical namespaces.
330
B. Kryza and J. Kitowski
Presented approach has the potential to enable much more natural access to information, while minimizing the redundancy and data transfer on a global scale, allowing at the same time for highly fine grained access control, not based on files, but on actual data elements, which will enable creation of much more sophisticated and natural computing infrastructures able to handle information processing tasks on a global scale. Future work will focus on further evaluation of the prototype, design and implementation of security mechanism enabling multiple users to securely share and operate on the global data graph and design of an operating system without filesystem. Acknowledgment. This research has been funded by Polish National Science Centre grant File-less architecture of large scale distributed information systems number: DEC2012/05/N/ST6/03463.
References 1. Benet, J.: IPFS - content addressed, versioned, P2P file system. CoRR abs/1407.3561 (2014). http://arxiv.org/abs/1407.3561 2. Berge, C.: Hypergraphs Combinatorics of Finite Sets, vol. 45. North-Holland, North-Holland Mathematical Library (1989) 3. Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Sci. Am. 284(5), 34–43 (2001) 4. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. In: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2006, vol. 7, p. 15. USENIX Association, Berkeley, CA, USA (2006) 5. Codd, E.F.: A relational model of data for large shared data banks. Commun. ACM 13(6), 377–387 (1970) 6. Cowan, J., Tobin, R.: Xml information set 2nd edn. Technical report, February 2004. http://www.w3.org/TR/2004/REC-xml-infoset-20040204 7. Ferr´e, S., Ridoux, O.: A file system based on concept analysis. In: Palamidessi, C., Moniz Pereira, L., Lloyd, J.W., Dahl, V., Furbach, U., Kerber, M., Lau, K.-K., Sagiv, Y., Stuckey, P.J. (eds.) CL 2000. LNCS (LNAI), vol. 1861, pp. 1033–1047. Springer, Heidelberg (2000) 8. Gallo, G., Longo, G., Pallottino, S., Nguyen, S.: Directed hypergraphs and applications. Discrete Appl. Math. 42(2–3), 177–201 (1993) 9. Gantz, J., Reinsel, D.: The digital Universe in 2020: big data, bigger digital shadows, and biggest growth in the far east. International Data Corporation, December 2010 10. Gifford, D.K., Jouvelot, P., Sheldon, M.A., O’Toole Jr., J.W.: Semantic file systems. SIGOPS Oper. Syst. Rev. 25(5), 16–25 (1991) 11. Grimes, R.: Code name WinFS: revolutionary file storage system lets users search and manage files based on content. MSDN Magazine 19(1) (2004). http://msdn. microsoft.com/msdnmag/issues/04/01/WinFS/
Hypergraph Based Abstraction for File-Less Data Management
331
12. Groza, T., Handschuh, S., Moeller, K., Grimnes, G., Sauermann, L., Minack, E., Mesnage, C., Jazayeri, M., Reif, G., Gudjonsdottir, R.: The NEPOMUK project - on the way to the social semantic desktop. In: Pellegrini, T., Schaffert, S. (eds.) Proceedings of I-Semantics 2007, pp. 201–211. JUCS (2007) 13. Horrocks, I., Patel-Schneider, P.F., van Harmelen, F.: From SHIQ and RDF to OWL: the making of a web ontology language. Web Semantics: Science, Services and Agents on the World Wide Web 1(1), 7–26 (2003). http://www.sciencedirect.com/science/article/pii/S1570826803000027 14. IDC iView,: The Digital Universe Decade - Are You Ready? International Data Corporation, Framingham, MA, USA (2010). http://www.emc.com/digital universe 15. Iordanov, B.: HyperGraphDB: a generalized graph database. In: Shen, H.T., Pei, ¨ J., Ozsu, M.T., Zou, L., Lu, J., Ling, T.-W., Yu, G., Zhuang, Y., Shao, J. (eds.) WAIM 2010. LNCS, vol. 6185, pp. 25–36. Springer, Heidelberg (2010) 16. Kryza, B., Pieczykolan, J., Kitowski, J.: Grid organizational memory: a versatile solution for ontology management in the grid. In: e-Science 2006, Second IEEE International Conference on e-Science and Grid Computing, 2006, p. 16 (2006) 17. Kryza, B., Kitowski, J.: Filess - file-less architecture for future information systems. In: 2014 IEEE Fourth International Conference on Big Data and Cloud Computing, BDCloud 2014, Sydney, Australia, pp. 281–282. 3–5 December 2014 18. Kryza, B., Kitowski, J.: Comparison of information representation formalisms for scalable file agnostic information infrastructures. Comput. Inf. 34(2), 473–494 (2015) 19. Kryza, B., Slota, R., Majewska, M., Pieczykolan, J., Kitowski, J.: Grid organizational memory-provision of a high-level grid abstraction layer supported by ontology alignment. Future Gener. Comput. Syst. 23(3), 348–358 (2007) 20. Mylka, A., Mylka, A., Kryza, B., Kitowski, J.: Integration of heterogeneous data sources in an ontological knowledge base. Comput. Inf. 31(1), 189–223 (2012) 21. Orien Technologies: OrientDB project website. http://www.orientechnologies.com 22. Padioleau, Y., Ridoux, O.: A logic file system. In: Proceedings of the General Track: 2003 USENIX Annual Technical Conference, San Antonio, Texas, USA. pp. 99–112. 9–14 June 2003 23. Pawlak, Z.: Rough set approach to knowledge-based decision support. Eur. J. Oper. Res. 99, 48–57 (1995)