SCHEMA DECISION TREES FOR HETEROGENEOUS JSON ARRAYS by Davis Goulet B.Sc., University of Northern British Columbia, 2018 THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE UNIVERSITY OF NORTHERN BRITISH COLUMBIA August 2020 c Davis Goulet, 2020 Abstract Due to the popularity of the JavaScript Object Notation (JSON), a need has arisen for the creation of schema documents for the purpose of validating the content of other JSON documents. Existing automatic schema generation tools, however, have not adequately considered the scenario of an array of JSON objects with different types of structures. These tools work off the assumption that all objects have the same structure, and thus, only generate a single schema combining them together. To address this problem, this thesis looks to improve upon schema generation for heterogeneous JSON arrays. We develop an algorithm to determine a set of keys that identifies what type of structure each element has. These keys are then used as the basis for a schema decision tree. The objective of this tree is to help in the validation process by allowing each element to be compared against a single, more tailored, schema. ii TABLE OF CONTENTS Abstract ii Table of Contents iii List of Tables v List of Figures vi List of Acronyms ix Acknowledgements x 1 Introduction and Motivation 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 NoSQL Databases . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Similarity Calculations . . . . . . . . . . . . . . . . . . . . . . . 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Thesis Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 6 7 8 9 9 2 Overview of Semi-structured Data 2.1 Types of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Moving Beyond XML . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Element vs. Attribute . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Support for Arrays . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Restricting an Element’s Content . . . . . . . . . . . . . . . . . 2.4 JSON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 AJAX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 JSON Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Schema Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 JSON Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Popularity of JSON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 12 15 18 20 21 22 22 23 24 28 28 31 iii 3 Problem Characterization 3.1 JSON Arrays and Data Types . . . . . . . . . . . . . . . . . . . . . . . 3.2 Problem Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Driving Observation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Schema Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Common Structures . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 Existence of Identification Keys . . . . . . . . . . . . . . . . . . 3.6.3 Complete Input Data . . . . . . . . . . . . . . . . . . . . . . . . 3.6.4 Array of Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 34 36 38 39 41 44 44 45 45 46 4 Related Works 4.1 Related Works Involving JSON . . . . . . . . . . . . . . . . . . . . . . 4.2 Related Works Involving XML . . . . . . . . . . . . . . . . . . . . . . 4.3 Related Works Overview . . . . . . . . . . . . . . . . . . . . . . . . . 48 48 52 54 5 Solution Details 5.1 Syntactic Similarity Scores . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Tree-edit Distance . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Disadvantages of Tree-edit Distance . . . . . . . . . . . . . . . 5.1.2.1 Similarity of More than Two Trees . . . . . . . . . . 5.1.2.2 Comparing Unordered Trees . . . . . . . . . . . . . . 5.1.3 Path Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.4 Path Notation for Nested Arrays . . . . . . . . . . . . . . . . . 5.2 Similarity of a Group . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Similarity of a Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Refining the Scoring Criteria . . . . . . . . . . . . . . . . . . . . . . . 56 56 59 60 60 61 61 64 66 67 69 6 Algorithm 6.1 Starting the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 GroupNode Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Initializing a New GroupNode . . . . . . . . . . . . . . . . . . 6.2.2 Generating Groupings for a GroupNode . . . . . . . . . . . . 6.3 SplitNode Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Initializing a New SplitNode . . . . . . . . . . . . . . . . . . . 6.3.2 Generating Groupings for a SplitNode . . . . . . . . . . . . . 6.4 Algorithm Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Constructing a Schema Decision Tree . . . . . . . . . . . . . . . . . . 6.6 Integrating into JSON Schema . . . . . . . . . . . . . . . . . . . . . . 71 74 75 75 76 78 79 80 82 82 84 7 Evaluation and Analysis 7.1 Algorithm Walkthrough . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Generating the Datasets . . . . . . . . . . . . . . . . . . . . . . 89 89 93 95 iv 7.2.2 iTunes Search API Dataset . . . . . . . . . . . . . . . . . . . . . 7.2.2.1 Schema Generation Time Comparison . . . . . . . . 7.2.2.2 Schema Validation Time Comparison . . . . . . . . . 7.2.3 Open Movie Database (OMDb) API Dataset . . . . . . . . . . 7.2.3.1 Schema Generation Time Comparison . . . . . . . . 7.2.3.2 Schema Validation Time Comparison . . . . . . . . . 7.2.4 Spotify Search API Dataset . . . . . . . . . . . . . . . . . . . . 7.2.4.1 Schema Generation Time Comparison . . . . . . . . 7.2.4.2 Schema Validation Time Comparison . . . . . . . . . Runtime Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . 98 99 100 101 102 103 104 105 106 107 Conclusion 8.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.1 Validating Identification Keys . . . . . . . . . . . . . . . . . . . 8.1.2 Dynamic Similarity Threshold . . . . . . . . . . . . . . . . . . 8.1.3 Expanded API Study . . . . . . . . . . . . . . . . . . . . . . . . 110 111 111 112 112 7.3 8 Bibliography 113 v LIST OF TABLES 5.1 6.1 7.1 7.2 7.3 7.4 A table showing the three scoring criteria being applied to 9 different groupings. The similarity threshold for this example is 0.85. . . 70 A list of the groups in the best grouping, along with their corresponding splitKeys list. . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 A list of split operations for a single group containing all objects. . A list of different split operations for one of the groups generated by the wrapperType split. . . . . . . . . . . . . . . . . . . . . . . . . . The API Request Endpoint(s) used for each dataset. . . . . . . . . . An overview of the number of requests made to each API, and the number of objects returned for each request. . . . . . . . . . . . . . . vi 90 92 95 96 LIST OF FIGURES 1.1 1.2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 3.1 3.2 5.1 5.2 5.3 An illustration comparing an application using a schema and an application not using a schema. . . . . . . . . . . . . . . . . . . . . . Illustrating the issue with schema design when all elements are assumed to have the same structure. . . . . . . . . . . . . . . . . . . . . A valid nested of HTML tags. . . . . . . . . . . . . . . . . . . . . . . An invalid nesting of HTML tags . . . . . . . . . . . . . . . . . . . . An element containing attributes. . . . . . . . . . . . . . . . . . . . . A sample XML file containing information pertaining to a library. . An excerpt of figure 2.4 showing one of the book elements. . . . . . A modified figure 2.5 showing id as an element rather then an attribute. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . An excerpt of figure 2.4 showing the books element and its content. A valid XML element showing how the content can contain both text and elements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A comparison between a traditional web application and an AJAX web application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A sample JSON document inspired by the iTunes Search API [1]. . Part of the schema document generated by [2] for the JSON document in figure 2.10. This schema shows how the results array is getting interpreted. The array is called Result as an auto-generated title referenced in another part of the schema that is not shown. . . Data from Google Trends [3] showing the relative interest of JSON and XML. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data from Stack Overflow Trends [4] showing the percentage of questions involving JSON or XML for each month. . . . . . . . . . . 3 4 16 16 17 19 20 20 21 22 25 27 30 31 31 Three JSON arrays illustrating the difference between homogeneous and heterogeneous data structures. . . . . . . . . . . . . . . . . . . . Comparison between validating an array through a linear search approach versus validating an array using a schema decision tree. 40 JSON document in text-based format. . . . . . . . . . . . . . . . . . . Tree representation of the JSON document in figure 5.1. . . . . . . . Path representation of the JSON document in figure 5.1. . . . . . . . 57 60 62 vii 34 5.4 5.5 JSON document containing an array. . . . . . . . . . . . . . . . . . . Three path notations for the JSON document in figure 5.4. . . . . . 6.1 Visualization of the alternating layers of GroupNodes and SplitNodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Constructing a schema decision tree for the groups in table 6.1. . . The section of the resulting JSON schema document containing the definitions for the different sub-schemas tailored to the different types of structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Section of the resulting JSON schema document containing a decision node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 6.3 6.4 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 65 65 72 84 86 88 Schema decision tree generated from the iTunes Search API dataset. 98 Comparing the computation time of the iTunes Search API dataset for various similarity thresholds (T ) and various input array sizes. 99 Comparing the validation time of the iTunes Search API dataset for three validation methods. . . . . . . . . . . . . . . . . . . . . . . . . . 100 Schema decision tree generated from the Open Movie Database dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Comparing the computation time of the Open Movie Database dataset for various similarity thresholds (T ) and various input data sizes. . 102 Comparing the validation time of the Open Movie Database API dataset for three validation methods. . . . . . . . . . . . . . . . . . . 103 Schema decision tree generated from the Spotify Search API dataset. 104 Comparing the computation time of the Spotify Search API dataset for various similarity thresholds (T ) and various input data sizes. . 105 Comparing the validation time of the Spotify Search API dataset for three validation methods. . . . . . . . . . . . . . . . . . . . . . . . 106 viii LIST OF ACRONYMS API Application Programming Interface AJAX Asynchronous Javascript and XML AWS Amazon Web Services XML External Markup Language ECMA European Computer Manufacturers Association HTML Hypertext Markup Language ISO International Standards Organization IETF Internet Engineering Task Force JSON Javascript Object Notation OOP Object Oriented Programming OLAP Online Analytical Processing SGML Standard General Markup Language SQL Structured Query Language UML Universal Modelling Language XSD XML Schema Definition ix Acknowledgements I would first like to express my sincere gratitude to my supervisor, Dr. Alex Aravind, for his support and guidance throughout both my bachelor’s and master’s degrees. The opportunities and advice he has provided me are invaluable and will continue to benefit me throughout life. I would also like to thank my committee members, Dr. Baljeet Malhotra and Dr. Alia Hemieh, for their advice and insights. In addition, I would like to thank my friends, Daniel O’Reilly and Rory McClenagan, for helping edit my thesis, and Conan Veitch for always pushing me to pursue new things. Finally, I would like to thank my family for their support and motivation throughout my education. x Chapter 1 Introduction and Motivation The ever-increasing connectedness of the world has resulted in a vast amount of machine-to-machine communication. Many applications now involve user authentication and the dynamic updating of content; internet-of-things devices constantly upload data collected to a central server, and web APIs have emerged as a way for developers to integrate third party services into their applications. Fundamentally, the communication and transmission of information is an essential component of modern software development. With this, semi-structured data has emerged as the prominent method for structuring the messages used for this communication—largely due to its flexibility and extensibility. This comes as a result of integrating the structural metadata information into the content of the message itself. In this sense, semi-structured data is often characterized as self-describing. Of this type of data, the JavaScript Object Notation (JSON) has become one of the leading formats. This format is primarily based on two data structures: sets of key-value pairs called JSON objects and ordered lists of values called JSON arrays. The true flexibility of the format comes from nesting these structures inside each other resulting in a tree-like structure. 1 Due to the fact that messages (i.e. JSON documents) can be received from a multitude of different sources (oftentimes sources that you do not even have control over), the need has arisen to first validate the content and structure of a document before processing it. Incorrectly formatted documents will, at best, be detected and handled accordingly; at worst they can cause the entire system to crash or be unknowingly processed and cause logical consistency errors in the application or database. To address this problem, the idea of a schema document emerged as a way of describing what is considered a valid JSON document. Incoming documents are then compared against this schema. If their content and structure match that defined in it, they are passed into the rest of the program to be processed; if not, they are rejected. Figure 1.1 illustrates the problem of processing incoming documents with and without using a schema. Here, a document’s structure is visualised through its shape, i.e. two circles have the same structure, whereas a circle and a triangle have different structures. In figure 1.1a, any document received is processed by the application. Issues may arise if the application is expecting a circle structure but instead receives an ill-structured triangle document. Comparing this to figure 1.1b, documents are first validated against a schema containing information about what is an acceptable structure. Documents satisfying these constraints are passed on to the application. As creating these schemas are usually tedious and error-prone for humans to do manually, programs have been created to automate the process (examples being [2] and [5]). These programs work by accepting one or more JSON documents as input and generate a schema based on the existing structures found in them. While these programs do generate valid schemas, they run into ambiguity issues when documents contain an array of JSON objects with differing structures. 2 Application Internet ? (a) An application expecting messages to have a circle structure. Schema Application X Internet (b) An application that preemptively filters out non-circle structures. Figure 1.1: An illustration comparing an application using a schema and an application not using a schema. Consider the array of shapes in figure 1.2a, where each element has one of two possible structure types: square or diamond. Comparing the two types, some parts overlap (the middle portion), whereas some parts only exist in specific types (the points of the diamond or the corners of the square). If a schema for this array were to be generated using current automatic tools, the schema in figure 1.2b would likely be the result. Here, the portion of the structure that is common to all shapes is specified as required (denoted in black), with all other parts being specified as optional (denoted in grey). The reason behind this schema in particular being generated is that current automated tools work off the assumption that all array elements should have the same structure, and any deviations from that structure should be considered optional. These tools do not consider the possibility that 3 [ ] (a) Array of elements consisting of diamond/square structures. (b) Generated schema for the array in (a). (c) Shapes that also satisfy the ambiguous schema in (b). Figure 1.2: Illustrating the issue with schema design when all elements are assumed to have the same structure. elements in an array may have different structures that are not related. While this assumption always results in a valid schema, it also leads to unintended side effects in what other structures the schema also accepts. For example, the shapes outlined in figure 1.2c also satisfy the schema in 1.2b. They all contain the portion of the schema that is specified as required with any other components being part of the optional portion. The main issue here is that the schema in figure 1.2b does not specify the relationship between the different optional components— resulting in ambiguity. A better option would be to generate two schemas; one schema for diamonds, and one schema for squares. An element in an array would then be valid only if it satisfies one of the possible schemas. One potential solution to this problem could be to apply cluster analysis on the elements of an array. The result of this clustering process is a set of groups, where each group contains elements with related structures. A schema can then be generated based on each group. This problem has previously been explored in literature; however, previous work has largely only looked at the scenario of clustering a set containing any JSON documents. Two downsides exist with this solution. First, the number of groups generated may not match the number of different types of structures that exist. If two types 4 of structures are very similar to each other, they may get grouped together rather then be placed into their own groups. Second, an increase in the time it takes to validate a document may occur. This increase is due to now potentially having to compare each array element against multiple schemas to determine which one it satisfies. For example, if there are 100 elements in an array and 10 possible types of structures, a potential 1000 schema validation attempts may occur. This thesis introduces a different approach to clustering the elements of a JSON array. We base this approach off the observation that a JSON array is designed to be processed as a single entity, i.e. a program takes a JSON array and iterates over each element—processing it. When all array elements have the same structure, this is a relatively simple program, as each element is processed in exactly the same way. When elements have different structures however, the program becomes much more complex. Different structures may now require unique ways of being processed. A common method of specifying an element’s structure is to include a set of keys within each element whose values identify the type of structure. We call such keys identification keys. A program can then use these keys to determine how it should process each element. An example of such a key might be a version number, where the keys value can be used to determine which version of a structure an element has. Using this idea, we cluster the elements of a JSON array by choosing a key common to all the elements, and partition the elements into a set of groups based on the value of the key. The main advantage of this approach over previous clustering methods is that future elements can be assigned to a cluster in constant time. To do so, a lookup is done on the elements identification keys and, based on their values, they are assigned to one of the clusters. In the context of schemas, this method allows each element to be validated against a single schema versus potentially being compared against multiple schemas. To determine which schema an element 5 should be compared against, we introduce the concept of a Schema Decision Tree based on decision-tree classifiers. With this, each element starts at the root node of the tree and moves to a child node based on the value of the key specified at the current node. This process repeats until the element eventually reaches a leaf node containing the schema it should be compared against. To construct this schema decision tree, this thesis designs and implements an algorithm that takes a JSON array and recursively partitions the elements based on an operation we introduce called splitting. This operation takes a key that is common to all elements and partitions them based on what value they have for that key; all elements with the same value are placed in the same group, and elements with different values are placed in different groups. As more than one key may be in common to all the elements, we measure how good a split operation was by measuring the structural similarity of the resulting groups. This is done through a calculation based on the Jaccard Similarity Index previously discussed in literature. Furthermore, this algorithm works recursively, as groups generated from one split operation may have to be partitioned again. This is to account for the existence of multiple identification keys. 1.2 Applications The main motivation for the work done in this thesis is to improve schema generation for JSON arrays containing JSON objects with differing structures. Thus, the main application is any problem already involving schema documents. The next two subsections outline two specific applications that would benefit from better schema generation. 6 1.2.1 NoSQL Databases One issue with JSON (and semi-structured data in general) is that it can not easily be stored in traditional relational databases. The reason for this is twofold. First, its irregular structure means that different documents can contain different keys. To account for this, a relational table column would have to be created for each possible key. Null values would then be used for documents not having that key. Approaching database design this way is not recommended however [6]. Second, JSONs tree-like structure does not match the flat table structure found in the relational model. Existing techniques for storing JSON in relational databases involve either storing the entire document as a single string in one of the columns or breaking up the tree-like structure and storing each layer separately along with new fields to keep track of parent-child relationships [7]. To address this issue, another kind of database—commonly known as a NoSQL database—quickly gained in popularity. The defining feature of this type of database is that they are able to easily store semi-structured data, as well as provide features tailored to interacting with / managing it. Of this kind of database, MongoDB is one of the most prominent. It is built around the concept of storing semi-structured data in documents contained in collections [8, 9]. Compared to relational databases, this is analogous to storing rows in tables; the major difference is that a collection can contain documents with differing structures, whereas a relational table requires a rigid structure with a set number of specific values. For this reason, collections in MongoDB are commonly referred to as having a dynamic schema [9]. While this does give great flexibility in how data can be stored, it also leads to challenges when designing programs and queries to interact with a collection. This is because the structure of all the documents in the collection has to be considered. For example, if a document was entered into a collection with missing 7 fields, queries designed to extract documents having those fields would pass over this document. This is by virtue of the fact that the query requirements did not account for this ill-structured document. For this reason, MongoDB has a feature that allows a collection to have a schema defined for it. This schema acts like a filter to the collection and only allows documents whose structure meets certain criteria to be entered into it. Improving the schema generation process would be beneficial in preemptively detecting ill-structured documents. 1.2.2 Similarity Calculations Another application for better schema generation can be found in the problem of calculating the similarity between JSON documents. The goal of a similarity calculation is to take some number of documents and generate a numerical representation based on some notion of similarity. One common way of doing this is to create a mapping between related keys found in the documents. The larger the mapping, the more similar the documents are. A challenge with this approach is how to handle the scenario of two JSON arrays, as the number of elements in each array may differ. Not taking this into account can greatly throw off any calculations even though all the elements in the array have the same structure. To solve this problem, previous literature first generated a schema for each of the documents as a way of reducing the elements of an array. Similarity calculations were then applied to the schemas instead of the documents themselves. Improved schema generation could be applied here to better capture the different types of structures found in each array. 8 1.3 Contributions The work in this thesis has four main contributions. 1. We give a comprehensive overview of XML and JSON. This includes looking at the general syntax, as well as the history and motivation behind their creation. 2. We define a new problem, related to cluster analysis, based on the idea of partitioning elements by identification keys. 3. We introduce the concept of a Schema Decision Tree to help validate JSON arrays, and we show how this method of partitioning can be used to construct this tree. 4. We design and implement an algorithm capable of determining the best partition from which to construct the schema decision tree. 1.4 Thesis Layout The rest of this thesis is organized as follows. To begin, Chapter 2 overviews what semi-structured data is and outlines the syntaxes of XML and JSON—the two major formats. The history behind each of these is also discussed, along with issues present in XML that lead to the creation and adoption of JSON. This chapter concludes with a comparison between the popularity of the two formats. Chapter 3 then characterizes and defines the problem addressed in this thesis—namely, better schema generation for heterogeneous JSON arrays. We explain the main observation that this work is based off of and introduce the concept of a schema decision tree. Based on the problem defined in this chapter, chapter 4 then examines existing literature involving both JSON and XML. Next, chapters 5 and 6 look at the 9 solution we have created. Chapter 5 first discusses the background information that the solution is based on. This includes how we are measuring the similarity between JSON objects, as well as a set of criteria for what is considered the best partition. Chapter 6 then presents an algorithm to compute this partition. To evaluate our algorithm, chapter 7 first walks through the algorithm being applied to a simplified dataset. We then analyze the time it takes the algorithm to run for varying array sizes, as well as analyze how long it takes arrays of varying sizes to be validated against the resulting schema decision tree. An analysis of the runtime complexity is also performed. Finally, this thesis is concluded in chapter 8 along with a discussion of possible future directions for this research. 10 Chapter 2 Overview of Semi-structured Data A major component of nearly all software applications involves some variation of storing, manipulating, or transmitting data. Take the example of a banking web application that makes a request to a server to increase the balance of an account. In just a single application, the same data is likely being modelled in three different ways. First, there is the web application itself within the user’s web browser that has an internal representation of the data it needs. For it to then send a request to the server, the application has to structure the relevant data in a format that both the client and server will understand. This involves taking the internally stored data and representing it in a format that can be serialized for platform independence (e.g. a memory pointer cannot be passed in a request since the receiving end does not have the data stored at the same memory location). The server then processes the request by manipulating the data stored in a structured database before sending a response back. Data is so important to computer science that entire subfields are dedicated to studying it in different ways. Database management looks at efficient ways of organizing, storing, and retrieving data [6, 10]. Data science looks at methods of extracting patterns and information from potentially unorganized or messy data 11 [11]. Big data looks at how to process and analyse large data sets [12]. Parallel systems looks at methods for processing data simultaneously [13]. Fundamentally, computer science is a discipline based on data. 2.1 Types of Data The category of data can be divided into three main types: structured, unstructured, and semi-structured [14, 15]. Structured data means that the data follows a specific rigid format that has been predetermined [16]. Having this fixed structure results in data that is ideal at being stored in a database [14]. In a relational database for example, each row of a table consists of an n-tuple where the ith element of the tuple has one of the values specified by the ith column of the relation [6, 10]. This row has a rigid structure in the sense that it has exactly n elements arranged in a specific order. Swapping two elements of the row changes their meaning, as each column of the relation also has a specific real-world context behind what it represents. As all values for a column follow the same format, the meta-data information associated with it can be stored a single time outside the table itself. This is in contrast to storing a copy of it in each row individually. In the context of SQL, this is done through a data definition language that defines the columns of a table, its connection to other tables, constraints, data types, etc. [6]. The second type of data is unstructured. This type of data differs from structured data in that there is no predefined structure dictating what the data consists of or how it is arranged [12]. Up to an estimated 80% of all data falls under the unstructured category [17] due to the prevalence of online media formats such as video, text, photo, and audio. While these formats all have structure dictating how they encode data in binary, there is no structure regarding what their contents ac- 12 tually consists of. To illustrate this, consider the example of a plain text document [14]. Given a paragraph, specific information may be contained within a sentence; however, there is nothing outlining where the sentence occurs in the paragraph or what words were used in it. The same piece of information may even be represented in two different sentences using completely different words. Finally, semi-structured data falls in between structured and unstructured data. While structured data showcased the extreme of rigid guidelines, and unstructured data showcased the extreme of no guidelines, semi-structured data falls in between by providing structure through a more flexible format [14, 15]. It is because of this flexibility that Serge Abiteboul refers to it as irregular data [14]. The flexibility of semi-structured data comes from including the structural information in each instance itself. This way, two instances can have differing structures. As long as each instance specifies where the common data is located, they can be treated the same. For this reason, semi-structured data is commonly referred to as self-documenting [15, 14, 16]; the information needed to interact with the data is contained within the data itself. This differs from structural data, where the structural information is stored outside each instance of the data. To illustrate the importance of semi-structured data, consider the scenario of having to integrate data related to bank accounts coming from two different banking sources [14]. Creating a single structured data format that encompasses both sources is a challenging problem primarily for two reasons. First, while the general information related to a bank account is likely to be the same between sources (eg. account owner, account balance, etc.), some data will differ. One bank, for example, may include the field time-since-last-accessed, whereas the other bank may not. Creating a single data format to encompass both banking sources leads to two options. Option one is to include all possible fields in the common format, and set a field as null for any data instance that does not have it. Contrarily, option two is 13 to leave the field out and discard it for data that has it. Neither scenario is ideal as the two data sources are inherently different. An alternative solution to this problem would be to use semi-structured data. With this, each data source could maintain its original structure without the need to combine them together. Two popular formats for representing semi-structured data are XML and JSON; these are discussed in more details in the following sections. This banking example also showcases one of the biggest areas where semistructured data is used; that is, in situations where data is transmitted between machines over a network [14, 16]. In order for two machines (call them A and B) to communicate, they have to agree on a common format (or “contract”) for how messages will be sent between them. Machine A needs to know how to send data that machine B can correctly interpret. Likewise, machine B needs to know what data is included in a message from machine A and how to process it. One way to define this communication could be through a hard contract. With this, the exact structure of a message is set, and both machines know how to construct a message and extract the information from it. An example of such a contract might be that a message consists of 20 bits evenly split between two 10 bit values. Now suppose that machine A wants to add a new field to the message so that it can also send it to other machines. As machine B is expecting a message with a very specific structure, any attempts by machine A to change the structure will result in machine B failing. Thus, machine B would also have to be updated to support the new format even though the new fields do not affect it. If this communication contract had been outlined using semi-structured messages, machine A would be able to include additional fields without it affecting machine B. All the information machine B needs to work off of would still be included in each message, and it could ignore the parts it does not need. 14 2.2 XML One of the first major semi-structured formats to emerge was the Extensible Markup Language (XML). The goal behind this language was to provide a way of arranging data into structures along with annotations about the data [18]. The creators also wanted a general purpose format that could be applicable to many types of applications. In addition, it should also be fast and easy to create concise documents that both humans and machines could understand [18, 19]. Particular emphasis was also put on using it to transmit information over the world wide web [18]. To discuss the origin of XML and why it became popular involves first discussing the history of the web. When Tim Berners-Lee was developing the world wide web around 1989, he needed a way of marking up documents with hypertext (text containing links to other resources that are directly available from the link itself—usually by clicking [20, 21, 22]). In particular, he needed a format that would be independent of a particular computer, as previous hypertext packages were too computer-specific to be used in a global network run on different types of machines [23]. To solve this problem, he created a computer-independent document format called the Hypertext Markup Language (HTML). This format had primarily two functions: (1) a way of specifying hyperlinks in text, and (2) a way of formatting information for visualization purposes. With this format, an application (ie. web browser) could be created for any computing system to take an HTML file and display it to the user. Even today, the HTML format is the ubiquitous way of displaying information in a web browser. When Berners-Lee was in the process of creating HTML, rather than inventing his own format, he chose to base it on a previously established standard called the Standard Generalized Markup Language (SGML) [23]. This was a standard from the International Standard Organization (ISO) that outlined a format for defining 15 markup languages. SGML allowed for documents to be created that separated the content of the document from annotations about the content [24, 25]). For example, the title of a document serves a different purpose then a paragraph; by including this meta-information in the document itself, whoever/whatever is processing the document can be aware of what information falls in what category and, as a result, act accordingly. The basic structure of an HTML document (and thus an SGML document) consists of a series of elements [25, 23]. For the most part, an element has a structure defined by 1. an initial start-tag in the format of , 2. followed by some arbitrary content, 3. and closed by an end-tag in the format of . Further, the TAG-NAME of the end-tag must be identical to the TAG-NAME of the starttag for the element to be valid. Elements can also be nested inside each other as long as each start-tag has a corresponding end-tag, and that all inner elements are closed before the outer elements end-tag [26, 25, 24]. For example, figure 2.1 is a valid nesting since the inner tag is fully contained within the content of the outer tag. Figure 2.2 on the other hand is not a valid nesting since the inner tag is started within the content of the outer tag but not closed. . . . Content . . . Figure 2.1: A valid nested of HTML tags. . . . Content . . . Figure 2.2: An invalid nesting of HTML tags 16 Additionally, elements can have attributes included inside a start tag for data that is related to the element [26, 25, 24]. For instance, a title tag may have a font size attribute specifying how large the title should be. Figure 2.3 shows an example of an element whose start-tag contains one attribute. . . . Content . . . Figure 2.3: An element containing attributes. One of the main distinctions between HTML and SGML involves what is considered a valid tag name for an element [23]. In SGML, there are no restrictions for a tag’s name except that it has to consist of characters, begin with a character or underscore, and contain no spaces; any name satisfying these rules is considered valid. HTML on the other hand restricts the name of a tag to a small set of specific strings [26, 23]. The reason for this restricted set is that, in HTML, a tags name has external influence behind how it is visualized in a web browser. For example, the < h1 > tag represented a large heading; < p > represented a paragraph of text; < a > represented an anchor containing a hyperlink. HTML had to restrict the set of valid tag names in its specification because the companies in charge of the web browsers had to program this external information into them. As the world wide web became more popular over time, a need started to arise for a better format for structuring data. HTML was too limited by its restricted set of tags to account for all the different data scenarios [23]. Further, having each HTML tag associated with some aspect of visualization was not needed for the purpose of strictly structuring data. SGML, on the other hand, was too complex in what was considered valid syntax [19, 27, 28]. As a result, many software implementations only ever accounted for a subset of SGMLs entirety. This led to many implementations not being fully compatible with each other [27]. In essence, both 17 HTML and SGML were not suited to being general purpose data formats. To find a solution, a group known as the SGML Editorial Review Board (later known as the XML Working Group) was formed in mid 1996 with the aim of creating a “slimmed-down SGML” [19, 27, 18]. This group consisted of members such as Jon Bosak from Sun Microsystems, Tim Bray of Textuality and Netscape, Jean Paoli of Microsoft, Steve DeRose, and other prominent members who either were experts on SGML or came from various companies in the field [18]. Many of these individuals had already been working on, and even proposed, more concise versions of SGML [19, 28]; however, these were contained to more specific applications and never resulted in widespread adoption or the creation of new standards. On February 10th, 1998 (about a year and a half after the group’s formation), the first version of the XML specification was released [18] and quickly gained in popularity. Like HTML and SGML, an XML document for the most part consists of a series of elements each containing a start-tag and corresponding end-tag. Figure 2.4 shows an example of a typical XML file. Here, the library element is known as the root element since everything else is nested inside it [18]. The content of this element then consists of the location and books elements. These in turn each have more inner elements. Further, each book element contains an attribute called id. An important point to also note is that an element can have multiple inner elements with the same name; in the case of books, three elements having the name book are nested within it. 2.3 Moving Beyond XML From its initial 1.0 specification release, XML would go on to become the dominant semi-structured data format in the software community [19]. As a result, an entire 18 1 2 3 Vancouver 4
123 Park Way
5
6 7 8 < t i t l e >XML i n a Nutshell 9 < a v a i l a b i l i t y >true 10 11 12 < t i t l e >The SGML FAQ Book 13 < a v a i l a b i l i t y >f a l s e 14 15 16 < t i t l e >The SGML Handbook 17 < a v a i l a b i l i t y >true 18 19 20
Figure 2.4: A sample XML file containing information pertaining to a library. ecosystem of tools has been developed around it. Major tools include: XPath (a query language for finding data within an XML file [29]), XML Namespaces (a way of preventing collisions between element names in different XML files [30]), and XML Schema (a way of describing the structure of an XML document for validation purposes [31]). Regardless of its popularity, not everyone in the community felt that XML was the best format for structuring data. One of the main issues with XML stems from the fact that it is first and foremost a markup language for document formatting rather then a data format. This distinction can be found in many of the core features of XML. Three major ones are discussed next. 19 2.3.1 Element vs. Attribute In the context of markup languages, having a distinction between an element and an attribute is logical, as attributes can be thought of as properties of the element. In HTML for example, an element defines the general way the content will be displayed; attributes can then be added to the element to influence things like font color, font size, margins, etc. In the context of only structuring data, this distinction is not nearly important, as for the most part, elements only serve to arrange data into logical units. To illustrate this difference, consider figures 2.5 and 2.6 where id is considered an attribute in one and an element in the other. 1 2 < t i t l e >The SGML Handbook 3 < a v a i l a b i l i t y >true 4 Figure 2.5: An excerpt of figure 2.4 showing one of the book elements. 1 2 142 3 < t i t l e >The SGML Handbook 4 < a v a i l a b i l i t y >true 5 Figure 2.6: A modified figure 2.5 showing id as an element rather then an attribute. Both cases make logical sense for modelling the data. Because of this, there is much debate in the community over what data should be placed in an attribute versus an element. One option is that elements should be “the essential material that is being expressed or communicated in the XML” [32]; however, there is no 20 hard rules, and much of the choice is left up to the developer. For example, the title and availability elements could also be modelled as attributes of the book elements. However, doing this would decrease the readability of the document as all information is now being stored in a single element. 2.3.2 Support for Arrays The second issue with XML primarily being a markup language is that there are no native implementations of common data structures such as arrays. Figure 2.7 shows how a makeshift array can be constructed in XML. An outer element is created to act like a wrapper for the array elements. Each inner element then represents one of the array values. While this method does allow an array to be encoded, it also requires any people/tools using the document to have this background information. To illustrate this point, suppose the books element only had a single inner book element. By just looking at the data, it would be impossible to determine if books was an array or not without knowing in advance. 1 2 3 < t i t l e >XML i n a Nutshell 4 < a v a i l a b i l i t y >true 5 6 7 < t i t l e >The SGML FAQ Book 8 < a v a i l a b i l i t y >f a l s e 9 10 11 < t i t l e >The SGML Handbook 12 < a v a i l a b i l i t y >true 13 14 Figure 2.7: An excerpt of figure 2.4 showing the books element and its content. 21 2.3.3 Restricting an Element’s Content Finally, XML places no restrictions on what the content of an element can contain. For the most part, an element’s content consists of either a string or a sequence of inner elements as shown in figure 2.4; however, this is not a restriction the language places on the data, and as a result, a mixture of both can occur and is considered valid. Figure 2.8 shows a valid XML element that contains a mixture of text and elements as its content. Again, this syntax is useful in a markup language to annotate text that appears mid-sentence. As a data format though, it is not needed. Located i n Vancouver Canada Figure 2.8: A valid XML element showing how the content can contain both text and elements. 2.4 JSON In 2001, Douglas Crockford was working at a company he co-founded called State Software. They were attempting to build a web framework for creating dynamic web pages—an idea that was just staring to emerge in the software community at the time. Part of this project involved needing a way of structuring data when transmitting it between the web browser and a server [33]. When considering XML however, he found that it was an inefficient format because of its lack of data structures that were common to nearly all programming languages [34, 35]. Instead, he chose to create a format based on the notation Javascript uses for its data structures. In this sense, the JSON format is a subset of the ECMA-262 Specification that Javascript itself is based on [36, 33, 34, 37, 38]. One of the main benefits of this approach is that it would make parsing JSON very easy on the browser side. When approaching companies about their framework, they were reluctant to use the for- 22 mat because there was no specification attached to it [34, 35, 33]; thus, Crockford gave it the name Javascript Object Notation (JSON) and created a simple website (json.org). This website gave a description of what JSON was and provided a simple grammar outlining the format. Crockford claims to have “discovered” JSON rather than ”invent” it, as the main ideas had already existed in various formats before him. For instance, he knew of Netscape using similar ideas with different syntax as early as 1996 [35]. Crockford’s contributions mainly came in formalizing a specification for it and giving it a name and website. Besides that however, he did not do any promotion of the format such as trying to get it adopted at other companies. JSON’s popularity mainly grew as people came across it and used it for its simplicity. Because of its small grammar, JSON parsers were quickly created in all the main programming languages. Further, its two main data structures (sets and arrays) exist in some form in nearly every programming language; this made it very easy to encode / decode information [37]. However, Crockford believes the biggest contributing factor behind its popularity was likely due to its use in AJAX [35, 33]. 2.4.1 AJAX During the early 2000’s, a series of new tools and techniques were starting to emerge allowing developers to create more dynamic web applications—similar to what State Software was doing in 2001. The collective name for this new approach to web development would become known as Asynchronous Javascript and XML (AJAX) as a result of a blog post by Jessie James Garrett discussing the new trend [39]. The goal behind AJAX was to improve the user experience by having the web page dynamically update with new information received from the server. This 23 was compared to the previous approach where an entirely new web page would be requested for each UI update [39, 40]. For example, Gmail could now refresh the users inbox seamlessly while the user was reading another email; this gave a better experience in comparison to the user having to manually refresh it and wait for a new web page to be displayed. The basic idea for achieving this was to create a middle man that would act like a liaison between the web browser and server. Figure 2.9 shows the difference between the traditional approach (figure 2.9b) and the AJAX approach (figure 2.9a). In the traditional approach, whenever the web browser wants new information, it sends a request to the server. The server then returns an entirely new web page and possibly all of its attached resources (CSS, JavaScript, images, etc) if they were not cached. Comparing this to the AJAX approach, when the browser makes a request for new information, it instead calls a JavaScript function that initiates an asynchronous request. This request being asynchronous is what allows the web page to still remain active while the request is being processed. Since the web page will be dynamically updated, the server now only has to return the required data using a format like XML or JSON. The asynchronous request then receives the data and manually updates the existing web page’s HTML code. While AJAX was originally used with XML (the ’X’ in AJAX even standing for XML), JSON has quickly gained in popularity as a replacement. A large part of this is due to the synergy JSON has with Javascript and its shared notation for data structures [33]. 2.4.2 JSON Format JSON is a semi-structured data-interchange format, primarily used for moving information between different systems. Like XML, JSON is a platform independent format. This means that different systems can use it regardless of what operating 24 Web Browser Display Sends request. New HTML file for the browser. 2 1 Server (a) Traditional Web Application Web Browser Display Calls a Javascript function. 1 4 Updates existing HMTL page. 3 Returns data in XML or JSON. AJAX Layer Requests the data from the server. 2 Server (b) AJAX Web Application Figure 2.9: A comparison between a traditional web application and an AJAX web application. 25 system is running, what software is interacting with it, and what programming language the software was written in [35, 37, 41, 38]. Fundamentally, the format is based on two constructs [37]. 1. JSON Object: A set of key-value pairs enclosed within two curly braces. The order of key-value pairs does not matter. Further, each key must be unique among all other keys in the object, and a value is accessed by performing a lookup on the corresponding key. 2. JSON Array: An array of values enclosed within two square brackets. Compared to the JSON object, the order of elements does matter, and a value is accessed through a lookup on its array index. While a key is limited to only a character string, a value can be any of the available data types. This includes the simple data types listed below [37, 38] in addition to the two complex data types (JSON objects and JSON arrays) discussed above. 1. Integer: Signed decimal number which may include scientific notation. 2. String: A sequence of zero or more Unicode characters surrounded by double quotes. 3. Boolean: Either true or false. 4. Null: The empty value designated by null. This ability to nest objects and arrays within each other is what gives JSON its tree-like structure so common to semi-structured data. A JSON document then consists of either a single JSON object or JSON array as the root element. Figure 2.10 contains an example of a typical JSON document. Here, date, successful, and resultCount are examples of key-value pairs with simple data types; 26 metadata is a key-value pair whose value is another JSON object, and results is an array containing two JSON objects as elements. It’s important to note that arrays in JSON are heterogeneous meaning that all elements in the array do not need to have the same data type [38]. 1 { 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 } ” metadata ” : { ” date ” : ”11/17/19” , ” s u c c e s s f u l ” : true , ” resultCount ” : 2 }, ” results ”: [ { ” kind ” : ” song ” , ” c o l l e c t i o n I d ” : 585972750 , ” i s S t r e a m a b l e ” : tr ue , ” t r a c k T i m e M i l l i s ” : 241562 }, { ” kind ” : ” podcast ” , ” c o l l e c t i o n I d ” : 394775318 , ” genres ” : [ ” Design ” , ” P o d c a s t s ” ] } ] Figure 2.10: A sample JSON document inspired by the iTunes Search API [1]. The contents of the JSON document in figure 2.10 are inspired by the results returned from the iTunes Search API [1] 1 —a web API that returns information on products available in the iTunes store. The main component of the APIs response is an array of JSON objects, where each object of the array represents a different product. As iTunes sells more then one type of product (examples being songs, 1 Actual results from the API were not included in the thesis due to their sizeable length taking up multiple pages. Figure 2.10 is instead inspired by the results from the API to showcase the problem tackled by the thesis in a real world scenario. 27 movies, podcasts, etc.), the structure of each object depends on what type of product it is trying to model. For example, objects representing songs have a different structure compared to objects representing podcasts. 2.5 Schema Documents While the main idea behind semi-structured data formats is that their structure is flexible in what information can be included and how it is arranged, a need has still arisen to put restrictions on what classifies a valid structure for within a certain application. For example, consider the scenario where a document is passed around to different applications in a distributed system, where each application is able to modify the document before passing it on to the next application. Once all the modification have been completed, it would be very beneficial to have a way of verifying that all the changes have still resulted in a document that the program knows how to process. The document may syntactically be valid, but its structure may not be what the program processing it is expecting. If the program were to process the document, it could potentially produce logical consistency errors or even crash. To combat this, the idea of a schema document emerged as a way of specifying what is considered a valid structure for a specific application. Other semistructured data can then be compared against this schema to see if their structure and content match that defined in it. If it does match, the document is considered valid, and otherwise, it is considered invalid [31, 42]. 2.5.1 JSON Schema The JSON Schema Specification is a working specification that outlines a format for creating schema documents based on the JSON format. Each schema document 28 describes the structure/content that it considers valid. Other JSON documents can then be compared against the schema to see if their structure and content match. If it does, the document is considered valid. The JSON Schema Organization is the organization behind the widely accepted implementation [43]. They are currently on their ninth draft as of September 2019 with the goal of being standardized by the Internet Engineering Task Force (IETF) [42]. Figure 2.11 shows part of the schema generated for the JSON document in figure 2.10—in particular, for the objects in the results array. What this schema outlines is that an array object is only valid if all of the keys within it match one of the five specified. If a key is present in the object, then the datatype of its corresponding value must also match the datatype listed in the schema. Finally, every object has to have the keys listed in the required section (namely collectionId and kind) with the others being optional. If an object breaks any of these rules, then the entire JSON document is rejected, as its structure does not conform to the schema. This schema was generated using the Quicktype automated program [2]. This program is listed as one of the generators on the JSON Schema Organization’s website [42]. 29 1 ... 2 ” Result ” : { 3 ” type ” : ” o b j e c t ” , 4 ” additionalProperties ”: false , 5 ” properties ”: { 6 ” kind ” : { 7 ” type ” : ” s t r i n g ” 8 }, 9 ” collectionId ”: { 10 ” type ” : ” i n t e g e r ” 11 }, 12 ” isStreamable ”: { 13 ” type ” : ” boolean ” 14 }, 15 ” trackTimeMillis ”: { 16 ” type ” : ” i n t e g e r ” 17 }, 18 ” genres ” : { 19 ” type ” : ” a r r a y ” , 20 ” items ” : { 21 ” type ” : ” s t r i n g ” 22 } 23 } 24 }, 25 ” required ” : [ 26 ” collectionId ” , 27 ” kind ” 28 ], 29 ” t i t l e ” : ” Result ” 30 } 31 . . . Figure 2.11: Part of the schema document generated by [2] for the JSON document in figure 2.10. This schema shows how the results array is getting interpreted. The array is called Result as an auto-generated title referenced in another part of the schema that is not shown. 30 2.6 Popularity of JSON Since its initial release in 2001, JSON has slowly become one of the two dominant formats (the other being XML) for semi-structured data. Further, JSON has likely overtaken XML as the most popular format based on internet search data. JSON XML Relative Interest 100 80 60 40 20 0 2004.01 2006.01 2008.01 2010.01 2012.01 2014.01 2016.01 2018.01 2020.01 Date (Year.Month) Percentage of Questions Figure 2.12: Data from Google Trends [3] showing the relative interest of JSON and XML. JSON XML 2 1 .5 1 0 .5 2010.01 2012.01 2014.01 2016.01 Date (Year.Month) 2018.01 2020.01 Figure 2.13: Data from Stack Overflow Trends [4] showing the percentage of questions involving JSON or XML for each month. 31 Figure 2.12 shows the popularity of the JSON and XML search terms on Google Search since 2004. A score of 100 represents the peak popularity of either term, with all other points being in proportion to it. Likewise, figure 2.13 shows the percentage of questions posted to Stack Overflow involving the JSON or XML question tags. In general, what these figures show is that JSON has slowly been gaining in popularity since its creation, while XML has been decreasing in popularity. As the data does not go back before the creation of JSON, conclusive results cannot be drawn from it. XMLs large decrease in popularity may have been caused by influences not related to JSON. For example, in the early 2000s, XML could have seen exceptionally high popularity due to its novelty and the state of the world wide web. The sharp decrease in figure 2.12 could then be a result of new trends emerging, and shifting interests in the development community. Support for this interpretation also comes from comparing the relatively small interest of JSON in the late 2010s with XML in the early 2000s; even taking the combination of XML and JSON at the present date does not match the interest of XML at its peak. 32 Chapter 3 Problem Characterization In the shape motivation presented in Chapter 1, the main issue was that a single schema was generated for unrelated types of shapes. Consequently, this schema was ambiguous as it also accepted any combination of the different types. If we were to instead treat each shape type as a JSON object with a particular structure, we see that the same problem arises. Existing JSON schema generation tools work off the assumption that all objects in an array have the same structure, and thus, generate a single schema by merging together the structures of all the objects. This process also results in an ambiguous schema as information on the individual structures is lost. For example, in the JSON document of figure 2.10, the results array has two objects with different structures. When a schema was generated using the QuickType program [2], the two structures got merged together. As a result, objects having the keys kind, collectionId, isStreamable, and genres are also considered valid, regardless of the fact that they consist of a combination of both structures. Even worse is the schema generator in [5] 1 which generates a schema for the entire JSON array based only on the object in the first position; every other object—regardless if it has a different structure—is ignored. 1 Results are not shown in the thesis due to their length. 33 3.1 JSON Arrays and Data Types The primary issue with existing schema generation methods comes as a result of treating an array of JSON objects as a homogeneous data structure rather than a heterogeneous data structure. A homogeneous data structure is a data structure that contains data of the same type (e.g. arrays in Java or Golang) [44, 45]. In contrast, heterogeneous data structures can contain data with different types (e.g. lists in Python or Javascript) [44, 45]. [ [ 5, [1 ,2 ,3] , ” array ” , true , { ” id ” : 1 2 3 , ”on ” : true , } ] [ ] (a) Heterogeneous array { ”XML” , ”JSON” , ”SGML” , ” Java ” , ” Python ” , ”C” , ”C++” , ”Go” , ”name ” : ”Bob ” , ” age ” : 3 5 , }, { ” id ” : ”123” , ”name ” : ” Jim ” , } ] (b) Homogeneous array (c) Array of unknown type Figure 3.1: Three JSON arrays illustrating the difference between homogeneous and heterogeneous data structures. Figure 3.1 shows three examples of a JSON array. The first is a heterogeneous array, as it includes a combination of integers, strings, arrays, and objects; the second is a homogeneous array containing only strings, and the third is an array of two objects. This third array could fall into either the homogeneous or heterogeneous categories depending on what is considered the data type of a JSON object. One way to view the data type of a JSON object is to assume that all objects have the same type; that is, the type of a JSON object is a JSON object regardless of the contents inside of the object. When viewed this way, the array in figure 3.1c is 34 a homogeneous data structure. To construct a schema for the general JSON object data type then involves taking the union of all possible keys in the objects. This is how most schema generation tools work due to the fact that it always results in a valid schema. If an array element is a JSON object, it is valid only if each key in the object matches one of the keys in any of the objects used to generate the schema. The issue with this method is that it is possible to have two JSON objects with no relation to each other. In this scenario, it instead makes sense to treat JSON objects with different structures as if they were different data types. To illustrate this scenario, consider two JSON objects: one having the keys city, province, and country, and the other having the keys title, author, and publisher. It is obvious that the two objects represent different entities, namely, a location and a book. Thus, it would be beneficial to treat them as separate data types. This idea of two JSON objects having different data types captures the principles found in object oriented programming (OOP). In this paradigm, the object is the central method of “encapsulating state and behaviour” [46]. This is commonly implemented through the construction of abstract templates known as classes. Each class consists of a set of properties used to store data (the state), and a set of methods used to manipulate data (the behaviour). Objects are then instantiated from a class, where each object has its own values for the properties. In this regard, all objects created from the same class have the same data type—the class itself [46]. Two classes can then have the same properties with the same names but still be treated as different data types. When converting the data encapsulated within an OOP object directly to a JSON object, this data type information is lost leading to the creation of heterogeneous arrays. The receiving end of the JSON object is then responsible for processing each object with the assumption that it knows the data type of the object. In this thesis, we are assuming that each element of the array given as input 35 consists of a JSON object that is associated with a particular structure type. A structure type represents a particular structure a JSON document can have, along with some external reason for why that information is grouped together. Two structure types can have similar structures but still be considered different structure types. This is due to the different meaning behind what the structures represent. For example, as discussed in section 2.4.2, the iTunes Search API returns data containing a JSON array, where each object in the array represents a type of product available in the iTunes store. All objects representing the same type of product have very similar structures, as that structure is used to model that product. 3.2 Problem Overview A better approach to schema design for JSON arrays would be to generate a schema for each of the possible structure types. An object in an array would then be valid if it satisfies at least one of the possible schemas. The benefit of this approach is that it reduces the chance of ambiguity as keys belonging to different types are no longer being mixed together in a single schema. If an object now tries to mix together two structures (as those in the shape example in figure 1.2c), it would be rejected as it does not satisfy the schema for either type. A potential issue with this method is that it relies on the structure type of each object being known in advance. If this information is not known, then the problem becomes significantly more challenging. This is because there is no completely accurate method of deducing what structure type an object is trying to model— largely because objects may contain optional keys. If two objects have different keys, it cannot be determined whether they represent different structure types or the same type. In the second case, the differences between their structures are due to them including different optional keys. For example, the two objects in figure 36 2.10’s results array may actually be the same structure with different optional keys. Without having this external information to influence the decision, any method can only guess with varying confidence what structure type an object has. One way to avoid this issue is to draw from the existing field of cluster analysis [47]. The goal of this field is to “divide [a] set of objects into homogeneous groups [such that] two arbitrary objects belonging to the same group are more similar to each other than two arbitrary objects belonging to different groups” [48]. Clustering algorithms could be applied to a set of objects to partition them into a set of groups. A schema could then be generated for each group. The assumption behind this method is that two objects with the same structure type should have structures that are more similar compared to two objects having different structure types. This avoids the issue of needing to identify each object’s type, as most clustering algorithms work off similarity calculations that compare the objects structures. Applying clustering algorithms alone does not guarantee that each structure will be identified. If two types of structures are very similar to each other, it is possible that clustering algorithms may place the objects into the same group. Any differences between their structures would then be considered optional. How often this occurs largely depends on the number of groups generated in the clustering process. Regardless, this method does improve the accuracy of a schema as multiple tailored schemas are generated rather then a single schema. Another downside to this approach is that it may come with a decrease in performance. With a single schema (as in the shape motivation in figure 1.2b or the JSON schema in figure 2.11), each object only has to be compared against one schema to determine its validity. By generating a schema for each possible structure type, an object may have to be compared against multiple schemas to determine which one it satisfies (if any). For example, if there are 1000 objects and 10 different schemas, up to 10000 schema validation attempts may be performed. 37 3.3 Driving Observation We introduce a different approach in this thesis based on the observation that heterogeneous JSON arrays usually contain a set of keys within each object for the purpose of identifying its structure type. We will call these keys identification keys. For example, each object may have an identification key containing a version number in scenarios where the format has changed over time. This is the major difference between an array of JSON objects and a collection of random JSON objects— the array is designed to hold related content and be processed as a single unit. This assumption of the existence of identification keys comes from JSON being a data interchange format for transmitting information between two machines. Because of this, a program has to exist on the receiving end to accept the JSON document and automatically process it. With a homogeneous array, this is a simple program as each object has the same data type and can be processed in the same way. When working with a heterogeneous array, different objects may have to be processed in different ways. In the iTunes example, song and movie objects contain different keys representing different concepts. How a program processes a song will be different from how it processes a movie. One alternative to identification keys could be to hardcode a set of key lookups into the program. This action is performed before deciding how to process an object. Based on which keys are present in the object (i.e. considering only if a key is present or not rather then looking at the value of a key), the program would process the object accordingly. For example, objects with keys A and B would be processed one way, whereas objects with keys C and D would be processed another. This approach has two disadvantages. First, additional key lookups have to be performed on each object to check the different scenarios. Second, this method is not “future proof”. For example, suppose that a new structure type with keys A, B, 38 and E is added in the future that requires different processing compared to objects with just A and B. Not only would every program have to be manually modified to account for this new structure, missing a program would lead to unintended errors. Objects with keys A, B, and E will be processed as if they only had keys A and B. Over time, this method will likely lead to large if-else chains that are needed to check all possible edge cases that arise. Thus, the easiest method to identify an objects structure is to use identification keys. 3.4 Schema Decision Trees In both the clustering approach and the approach discussed in the previous section, the result is a set of groups that partition an array of JSON objects. A schema can then be generated for each group. While both methods ultimately result in a set of schemas, the second approach has an advantage in that it also determines the identification keys of the array. In the traditional clustering approach, this information is unknown, as each group may not correspond to a single structure type. Having this information is beneficial because a schema can be generated for each structure type, and all objects of that type can be validated against that specific schema. This combines the advantages of the other approaches; each object is only compared against a single schema, but that schema more accurately represents one of the possible structure types. To determine which schema an object needs to be compared against, we introduce the concept of a Schema Decision Tree—based on the idea of decision-tree classifiers. This type of classifier consists of a tree structure where each leaf node is associated with a class, and each inner node is associated with a decision. The process of determining which class a piece of data falls under consists of starting at the root node and working down through the tree. At each inner node, the data 39 is compared against the decision at that node; the result of the decision then effects which child node the data moves to next [6]. Similar to decision-trees, a schema decision tree is a tree structure that holds information about the identification keys of a given array as well as the schemas for the different structure types. Within this tree, each leaf node contains one of the possible schemas. Each inner node contains a decision involving one of the identification keys. This decision is based on the value of the given key, where each possible value is associated with one of the child nodes. An object only needs to compare its values for the specified keys against those found in the tree to determine which schema the object should be compared against. [J1 , J2 , . . . ,Jn , ] ... Ji S1 S2 S3 S4 S5 (a) Validating a JSON array through a linear search approach. d= kin [J1 , J2 , . . . ,Jn , ] kind = “so ... ack” ype = “tr wrapperT Ji wrapperT ype = “au ng” S3 kind = “fe ature-mo v kin d= diobook” S2 st” dca “po S1 “tv -ep i sod ie” S4 e” S5 (b) Validating a JSON array using a schema decision tree. This tree is part of the final schema decision tree generated for the iTunes Search API. Figure 3.2: Comparison between validating an array through a linear search approach versus validating an array using a schema decision tree. 40 Figure 3.2 shows an array of objects being validated using the two different methods. In 3.2a, each array object Ji is compared against the first schema. If the object satisfies the schema, the program moves on to the next object in the array. If the object does not satisfy the schema, the program compares it against the second schema. This process repeats until the object eventually satisfies one of the schemas or is rejected. With the Schema Decision Tree in 3.2b though, a lookup is first performed on the key wrapperType. If that key does not exist or its value is not one of the possibilities outlined, the object is rejected. Otherwise, the object moves down the tree based on the value of the key. This process repeats until the object eventually reaches a leaf node containing the schemas it should be compared against. 3.5 Problem Statement The problem explored in this thesis is improving automatic schema generation for heterogeneous JSON arrays. To do this, we aim to first determine a set of keys that identifies an object’s structure. A schema can then be generated for each type of structure that more accurately reflects its content. Based on these keys and resulting schemas, we generate a schema decision tree to help in the validation process. In order to determine an array’s identification keys, we introduce a variation of cluster analysis based on an operation we define called splitting. Let the notation val(k, J) represent the value of the key k in JSON object J. A split operation on a group of objects 2 g is then defined by choosing a key k common to all the objects in g, and partition g into g1 , g2 , ..., gn such that the following two properties are satisfied. 2 In this context, a group in considered an array where an objects index position is ignored. 41 1. ∀J1 , J2 ∈ gi : val(k, J1 ) = val(k, J2 ) 2. ∀J1 ∈ gi , J2 ∈ gj , i 6= j : val(k, J1 ) 6= val(k, J2 ) This definition says that based on the value of this key k, the objects are partitioned into at least one group. All objects with the same value for the key are placed in the same group (property 1), and objects in different groups have different values for the key (property 2). Applying the split operation one time assumes that a single identification key exists. As we have seen from the iTunes example in figure 3.2b though, this is not a valid assumption. Based on this, we define two issues that have to be addressed. 1. More than one identification key may exist. For example, the combination of two keys may identify an objects structure. 2. An identification key may only be present in a subset of the objects. For example, one identification key partitions the objects into a set of groups. A second identification key then exists for one of these groups to further partition it. This key may only be present in the objects of that group, and not in the objects of other groups. As a result of these issues, any algorithm designed to determine a groups identification keys needs to be able to recursively apply the split operation; that is, further apply the split operation to one or more of the groups generated from another split operation. Applying a recursive solution to this problem resolves the aforementioned issues. For issue one, the split operation would be applied to one of the keys. Each of the resulting groups could then be partitioned again based on the second key. This is equivalent to first partitioning on the second key and then on of the first key. In both cases, the result is a series of groups having the same 42 values for the given keys. Furthermore, issue two is solved because the split operation could be applied to one group in particular. All other groups, not having the second identification key, would be left alone. While the split operation tells us how to partition the objects for a given key k, it does not tell us how to choose this key. Contained within the objects may be many potential options for k with only a few of these being used for identification purposes. Other keys may just be common to all structure types. Thus, we need a way of determining whether a key falls under the identification key category or the non-identification key category. To solve this problem, we use the assumption that objects with the same structure type should have more similar structures when compared to objects with different structure types. To illustrate this concept, suppose all objects have two keys: version and date. Splitting based on the version key should result in groups where objects in the same group have similar structures. On the other hand, splitting based on the date key should result in groups of random objects, as there is no connection between the keys value and its structure type. Just considering this assumption alone is not enough to determine if a key is used for identification purposes. Consider a scenario where all objects in a group contain an id key that acts like a unique identifier (like a primary key in a relational database). Due to being unique, splitting on this key results in a partition where each object is placed into its own group. By only going off the assumption that splitting on an identification key results in groups of similar objects, id should be used for identification purposes. Each of the resulting groups has a perfect similarity score due to only containing a single object (i.e. there is no variance in the structure of the objects when the group only contains a single object). For this reason, maximizing the similarly of objects in the same group is not sufficient for determining whether a key is an identification key. The number of groups 43 generated also has to be considered. Another problem arises if we were to instead strictly focus on trying to minimize the number of groups. By not considering a group’s similarity anymore, the best way to partition the objects would be to simply not partition them at all, and instead, leave them in a single group—regardless of if they have the same structure type or not. This effectively renders no results, as we are back to generating a single schema for the entire array. In order to generate useful results, a balance between maximizing the similarity of each group and minimizing the number of groups needs to be found. To do this, we introduce a similarity threshold T that represents the minimum similarity a group must satisfy to be considered valid. This allows the best split to be the one with the minimum number of groups such that the similarity of each group is above the threshold. Furthermore, having a similarity threshold provides a base condition for when to stop recursively splitting. If a split operation results in all but one group being above the threshold, only that one group needs to be further partitioned. The rest of the groups can be left alone, as they have all reached the base condition. 3.6 Assumptions 3.6.1 Common Structures Related keys have the same name between objects. This removes the need to consider semantic differences when comparing keys. For example, the keys latitude and lat are assumed to have no relation to one another even though they are semantically related (one is an abbreviation of the other). Further, related keys will appear in the same location between objects. This removes the need to match sub-structures that appear in different locations. For 44 example, one object containing the keys street, city, and country in the root object and another containing the same keys nested within an address object are assumed to have no relation to one another. We believe these assumptions to be valid because all objects in an array are originating in the same document from the same source. If a single program is automatically creating the document, it should have common formats for representing information. Furthermore, the document is designed to be processed by a single problem. This means that the program needs to know which keys contain certain information and where they are located in the document. Achieving this can only be done using common names for keys and storing keys in common locations. 3.6.2 Existence of Identification Keys Each object is associated with a particular structure type and has a set of keys that identifies its structure. The aim of this thesis is to then identify such keys. Further, we assume that two objects of the same structure type have more similar structures compared to objects of different structure types. The reasoning behind this assumption was discussed in sections 3.2 and 3.3. We leave the possibility of removing this restriction as future work, and we discuss a possible method on how this could be done in section 8.1. 3.6.3 Complete Input Data In order to generate the schema, we assume that the input data is complete in the sense that all possible structure types are included in array. In addition, all optional keys related to a structure type are included in at least one object of that type. We believe this assumption to be appropriate because a schema cannot be gen- 45 erated for data that is unknown. If we assume incomplete data (i.e. the data does not include all types of structures), then the possibility arises of the schema rejecting objects that should be considered valid. We cannot automatically determine if a structure that failed to validate against the schema was because the schema did not include its structure type or if it actually was ill-structured. The only other option could be to forward invalid objects to an admin to manually decide which scenario it falls under. Regardless, there is no way of dynamically updating a schema, as doing so defeats the purpose of having a schema in the first place. 3.6.4 Array of Objects The JSON array given as input contains only JSON objects as elements. Each object can have an arbitrary number of keys with arbitrary data types and be arranged in an arbitrary nested structure. The reason for this assumption is that each element in an array falls into one of three categories. 1. The element is one of the simple data types (integer, string, etc.). 2. The element is a JSON array. 3. The element is a JSON object. In the first category, a schema can be easily generated by looking directly at its data type. In the second category, the schema for that element would be computed separately from all other elements—including other JSON arrays. The reason for this is that we cannot say if two arrays contain the same identification keys or not. As such, we do not focus on this type of element as generating a schema for it forms an identical sub-problem. By improving schema generation for an array of JSON objects, we inadvertently improve schema generation for elements that are arrays themselves. Finally, the third category is the topic explored in this thesis. 46 Based on our observations, the vast majority of situations have an array consisting of either all simple data types or all JSON objects. As computing the schema of an array of simple data types is trivial, we focus on improving schema generation for an array of JSON objects, as that is the core problem. 47 Chapter 4 Related Works To our knowledge, no previous work has directly looked at the problem of schema generation for heterogeneous arrays contained within JSON documents. Thus, the works analyzed in this chapter mainly fall into the two categories of general clustering of semi-structured data and software tools involving schema generation. Section 5.1 in the following chapter further discusses different similarity calculations used; the focus of this chapter is instead on the big picture problem. 4.1 Related Works Involving JSON The first related work is by Izquierdo and Cabot, and it focuses on generating a UML-like model for a collection of JSON documents [49]; in particular, documents returned from the various endpoints of a web API. The goal of their work is to help developers better understand and visualize the global data model hidden behind the API. Developers do not usually have direct access to this model. Instead, they interact with it through overlapping snippets of data returned from the APIs endpoints. By piecing together these related snippets, the authors hope to better understand the entire data model. The running example they use throughout their paper is a transportation API. In this API, one endpoint returns identifiers for the 48 closest train stops to a given location. These identifiers can then be passed into a second endpoint to get more details on each stop. Although the information returned from the two endpoints is different, it is still related due to a common application domain. Their algorithm works in three stages. First, each JSON document is converted into a non-JSON representation they call a model format. Secondly, all the models from the same endpoint are combined together to create a new model. The difference with this new model is that it captures the structural variation present in the documents. For example, some documents may contain optional keys not included in other documents. Finally, models from the same endpoint are combined together based on overlapping substructures. The result of this process should be a complete view of how the information and endpoints are connected. This work is then the bases for their online tool called JSONDiscoverer [50]. Of particular note is the first stage where they extract a single schema for a JSON array. Like other schema generation mechanisms, their tool considers all JSON objects to represent the same structure type and only generates a single schema for all of them. In Klettke et. al.’s work, they develop a schema extraction algorithm targeted at collections of JSON documents found in NoSQL databases [51]. The first step of their solution involves taking a group of JSON documents and building a treelike graph based on them. The purpose of this graph is to summarize the parentchild relationships of all the key-value pairs, as well as track which documents each relationship occurs in. Based on this graph, a schema can be generated that accounts for all structural variations (e.g. optional key). In addition to this, the generated graph can also be used for determining structural outliers. These are defined as patters of keys occurring in only a few of the documents. The database admin can then use this structural outlier information to classify if the documents 49 are ill-structured or not. Finally, this paper discussed a series of calculations for measure the similarity of groups of JSON documents. This is further discussed in section 5.1. One point of significance briefly discussed in this paper is a preprocessing stage where the document collection is partitioned into smaller groups based on some key (date, timestamp, etc). Their algorithm can then be applied to the resulting groups to improve the accuracy of the generated schemas. This idea is similar to that presented in this thesis; however, they only mention it as something that can be done if those keys are already known. They do not discuss any ways of actually determining said keys. Next, Spoth et. al. present an OLAP tool called SchemaDrill [52]. The aim of this tool is to help users visualize a collection of JSON documents. From this visualization, users can then create a relational mapping of the documents. Their application works by first displaying a list of all the keys whose value is not a JSON object or array (i.e. the leaf nodes of the tree). Furthermore, each key includes the path from itself to the root node. Users then groups together related keys with these groups forming the bases of a set of relational tables. As the number of keys in the list could range in the hundreds or thousands and overwhelm the user, their application preemptively groups the keys based on two calculation they define. These calculations are called correlation and anti-correlation. Correlation looks at how often two keys appear in the same JSON object. Keys appearing frequently together should then be placed in the same group. Similarity, anti-correlation looks at keys that rarely occur together, and instead, tries to place them in different groups. While some of the ideas in this paper carry over to our work (such as how they represent keys), the focus is on flattening JSON documents into relational schemas rather then on semi-structured data schema generation. A consequence of this is 50 that the authors do not give any special consideration for JSON arrays. In fact, they treat arrays as normal JSON objects where each element in the array is assigned a key; the name of this key is just the index of the element in the array. In [53], [54] and [55], Bazizi et. al. look at the problem of schema generation for very large JSON datasets (in the order of millions of documents). Because of how time consuming it is to sequentially process a collection this large, they develop an algorithm capable of generating a schema in parallel using Apache Spark and the MapReduce paradigm. Their algorithm works in two phases. In phase one (the map phase), each JSON document has its values reduced down to their datatypes. For example, the JSON object {”A”:123} is reduced down to {”A”:Num}; similarly, {”D”: [123, true, ”abc”]} is reduce down to {”D”:[Num, Bool, Str]}. Phase two (the reduce phase) then takes these simplified documents and repeatedly merges them together in a process they call fusing. Fusing works by taking the union of all datatypes at each layer of the documents. If the same key appears in two documents with two different data types, the key gets assigned the union of the two data types. The result of phase two is a single simplified document capturing all the possible structures in the collection. A schema is then generated from this document. When their algorithm reaches an arrays, it is first passed through a simplification process that reduces the elements datatypes down to their simplest form. This is done by combining all occurrences of the same datatype together. For example, an array containing two strings and an integer gets simplified down to one-ormore strings and an integer. When there is more then one JSON object in the array, they are merged together into a single JSON object. So while this work does consider heterogeneous arrays, they treat all JSON objects as the same type and merge them together. 51 Next, Perzoa et. al. define a formal definition of the syntactic and semantic meaning behind the JSON Schema format [43]. The purpose of this formal definition is to provide a uniform way for applications to interpret schema documents— especially involving some of the gray areas where interpretation of a schema document may vary between implementations (e.g. recursive schema definitions). To create this formal definition, they create a grammar for the JSON Schema specification outlining how the specification’s features should interact with each other. Finally, DiScala and Abadi design an algorithm to transform hierarchically structured JSON documents into flat structures suitable for relational databases [56]. Their algorithm works by applying unsupervised machine learning to group together keys that likely correspond to relational entities. This way, all instances where those keys occur together can be stored in the same relation. In order to detect related keys, they first perform functional dependency detection between pairs of keys. The purpose of this is it look for scenarios where the value of one key effects the value of another, as this may correspond to primary key–foreign key relationships. Phase two of their algorithm then looks for reoccurring substructures. Finally, phase three generates a relational schema that maps substructures to tables and connects them through foreign key relationships. Regarding arrays, this work treats them as independent sub-problems that are recursively computed. 4.2 Related Works Involving XML In addition to the above literature on JSON, it is also advantageous to examine other related work based on different semi-structured data formats. In particular, we now look at XML due to its widespread popularity in both industry and academia. While the two formats are not identical in how they represent informa- 52 tion, they are similar enough such that work done for one format is still applicable to the other in some fashion. To begin, one prominent area of research is developing clustering algorithms to group related XML documents together. Motivation for this can be found in applications such as document retrieval systems. These system work by accepting one or more XML documents as input and return other documents related to them. As discussed in the overview paper by Piernik et. al., most clustering applications involve three phases [57]. First, a preprocessing stage occurs to transform the text-based XML documents into more computationally compatible formats. For example, an XML document may be turned into a tree data structure that is implemented for a specific programming language. Second, a similarity calculation can be developed as a way of measuring how related two or more XML documents are to each other. Finally, a clustering algorithm is applied that uses this similarity calculation to group related documents [57]. This problem has been thoroughly explored under the context of different similarity calculations and clustering algorithms. A few examples of such works include [47] where they use a tree-edit distance calculation and hierarchical agglomerative clustering to group XML documents. [58] also takes a tree-edit distance approach, except they restrict edit operations to only leaf nodes and not inner nodes. In [59], the authors also consider the semantic similarity between XML tags through the use of the WordNet English lexical database. Similarly, [60] considers the semantics of entire sub-trees rather then just elements themselves. Finally, [61] approaches the problem by first grouping together all nodes at each level of the tree. These level structures, as the authors call them, are then compared together rather then the trees themselves. [62] expands on this idea but instead considers the edges of the trees rather then the nodes. 53 Related to clustering, [63] and [64] both look at grouping together XML schemas rather than XML documents. The goal behind their work is to reduce the number of schemas in a collection by merging related schemas together. To determine which schemas should be merged, the authors apply a clustering algorithm to create groups of related schemas. The similarity calculation they consider here is based on both the syntactic and semantic aspects of a tag’s name. All schemas in a group are then merged together into a single schema. This approach can be recursively applied depending on how many schemas is desired. The final area of related work discussed in this chapter deals with the problem of duplicate detection. Based on data cleansing, this problem looks to remove duplicate elements that appear within an XML document. In terms of JSON, this problem can be thought of as the removal of duplicate objects in an array. In [65], two elements are classified as duplicates if they have the same parent element, the same tag name, and similar content. Their algorithm works through a top-down approach. When it comes across two elements having the same parent element and same tag name, their contents are compared using syntactic and semantic similarity calculations. This differs from the approach taken in [66]. In their algorithm, they instead employ a bottom-up approach by first finding matching leaf nodes. Substructures are then built up from these leaf nodes to see how similar their parent elements are. 4.3 Related Works Overview As discussed in this chapter, all related works have either dealt with the general problem of schema generation, clustering of semi-structures data, or measuring the similarity of JSON or XML documents. The closest work we have found to the problem discussed in this thesis is by Klettke et. al. in [51]. In their paper, they 54 briefly discuss how the accuracy of a schema can be improved by first partitioning the documents based on some key. However, they only mention this partitioning as something that can be done if the key is already known. Furthermore, they do not consider many of the related problems explored in this thesis. These problems include actually determining identification keys, connecting the resulting schemas together (e.g. schema decision tree), or multiple identification keys existing. 55 Chapter 5 Solution Details Based on the problem definition in chapter 3, the main issue to address is partitioning an array of JSON objects into a set of groups by recursively applying the split operation. The smallest set of groups, such that each group has a similarity score over a given threshold, is then the best partition. A schema decision tree can then be generated based on it. 5.1 Syntactic Similarity Scores The first issue to discuss in our solution is defining the concept of similarity in the context of JSON objects. A similarity score (or similarity measure) is a function that takes two or more items as input and returns an integer representing the similarity between them [67]. What similarity is defined as depends on the application and what type of data the items consists of. To illustrate, consider an application involving 3D coordinates. One way to measure the similarity between two points could be to take the euclidean distance between them. Another method could be to measure the angle between two lines generated by connecting each point to the origin. Both of these calculations, however, are restricted to this type of data and could not be used to measure the similarity of semi-structured data without first 56 transforming it. Creating similarity calculations tailored to semi-structured data has been an active research problem ever since the concept first emerged. Most of the calculations created can be classified into either the semantic or syntactic categories. Syntactic calculations focus on measuring the differences in how the data is arranged in structures. They usually assume that two keys are related if they have the same name and not related if they have different names. Similarity is then based on how often the same keys appear in comparable locations. In figure 5.1 for example, the keys firstName and lastName are nested within the name key; however, they could just as easily be placed directly in the author key without the meaning of the rest of the document drastically changing. Syntactic calculations take this structural information into account and would likely score this scenario higher than if firstName and lastName had instead been nested within the location key. 1 { 2 3 4 5 6 7 8 9 10 11 12 13 } ” location ”: { ” c i t y ” : ” Vancouver ” , ” country ” : ”Canada” }, ” author ” : { ” id ” : 1 7 5 3 , ”name ” : { ” firstName ” : ” John ” , ” lastName ” : ” Smith ” } } Figure 5.1: JSON document in text-based format. On the other hand, semantic calculations focus on measuring the differences between the meaning of the data. Given two keys, these calculations look at prop- 57 erties such as their names, values, and locations when trying to measure how similar they are. Semantic calculations are particularly useful for comparing data originating from different sources. Different companies often have different naming practices. Thus, assuming two keys are related only if they have the same name is no longer sufficient. For example, one company may use the keys latitude and longitude, whereas another company may use the common abbreviations lat and long. There is obviously a shared meaning behind what these keys represent even though they are syntactically different. Semantic calculations look to identify this commonality. As a result of key names in JSON being restricted to strings, semantic calculations are significantly harder to create due to the involvement of human language. These calculations have to account for different scenarios such as abbreviations (id vs. identification), synonyms (city vs. town), conjoined words (e.g. firstName), etc. Furthermore, common techniques used in natural language processing are based on large data sets. By comparison, semi-structured data often only consists of a small number of keys. Additional information behind the meaning of the data may be provided through external resources. For example, one of the keys in the iTunes Search API data is artworkUrl60 which contains a URL for related artwork having a size of 60x60 pixels. This meaning behind the key represents is only provided through external developer documentation not accessible to the similarity calculation. Even if the documentation is provided—and even exists in the first place—the formats of these documents range widely with no common page structure or content. Thus, semantic calculation have to try to extract this information from the key’s name alone. For the purpose of this thesis, we are only interested in syntactic similarity calculations. The reason for this is that the primary purpose of a schema document is to validate the structure of other JSON documents. This involves making sure that 58 specific keys appear in specific locations and have specific data types. Their focus is not on trying to match common meanings as semantic similarity calculations do. 5.1.1 Tree-edit Distance One of the popular similarity calculations for semi-structured data is the tree-edit distance. This calculation is based on the inherent tree-like structure that results from nesting information inside other information [68, 69, 70, 47]. Each time this nesting occurs, a parent-child relationship emerges that can easily be modelled in a tree. For example, figure 5.2 shows the JSON document in figure 5.1 as a tree structure. At the top level, each JSON document consist of a single nameless root object. This becomes the root of the tree and is given a default name of root. Nested directly within this root object are the location and author keys, and appropriately, they become the direct children of the root node. This process repeats itself until it eventually reaches a node containing a single value rather then more nested information (these are known as leaf nodes). The value of this node can either be stored directly in the node itself (as depicted) or as an single additional child node. Originally introduced for general tree structures in the early 1970’s, the treeedit distance between two trees is defined as the length of the minimum sequence of node insertions and deletions needed to transform one of the trees into the other [68, 69, 70]. When a node is deleted, all of its children have their parent node set as the deleted nodes parent. This idea of edit operations has further been expanded upon over time to include other operations such as renaming a node and the insertion or deletion of entire subtrees as a single operation [70]. Most solutions for calculating the tree-edit distance are based off of the previously explored stringedit distance problem [69]. This problem looks at how many character insertions and deletions are needed to transform one string into another. 59 root location author city country id value = ”Vancouver” value = ”Canada” value = 1753 name firstName lastName value = ”John” value = ”Smith” Figure 5.2: Tree representation of the JSON document in figure 5.1. 5.1.2 Disadvantages of Tree-edit Distance While the tree-edit distance is a popular similarity calculation in previous literature, we find that it is not suited for the problem explored in this thesis; the major reasons for this are outlined next. 5.1.2.1 Similarity of More than Two Trees The tree-edit distance is designed to measure the similarity between two trees. In previous clustering approaches, this was not an issue as a similarity matrix was usually first created. This matrix contained the similarity scores for each pair of trees, thus allowing the tree-edit distance to be applicable. A clustering algorithm (such as hierarchical agglomerative clustering as used in [47]) would then work off this matrix when creating the clusters. This differs from our approach as groups are first generated using the split operation. How good a split operation was then depends on the similarity of these groups. As such, we need a similarity calculation capable of measuring the similarity of more then two documents simultaneously. 60 5.1.2.2 Comparing Unordered Trees Most algorithms that compute the tree-edit distance are based on ordered trees. This is a type of tree where, for each node, every child is assigned a value indicating its position among the rest of the children. Two nodes are then considered equal only if they have the same children appearing in the same order. On the other hand, unordered trees would consider two nodes equal as long as they have the same children—regardless of their order. Previous literature on XML has used ordered tree-edit distance algorithms because XML documents are considered ordered trees [18]. XML schemas, for example, have separate notation for when the order of elements does not matter. This is compared to JSON where objects are treated as unordered sets of key-value pairs. As such, tree-edit distance algorithms for ordered trees cannot be applied. Creating tree-edit distance algorithms for unordered trees has also been previously explored [71, 72]; however, these algorithms are more conceptually complex and computationally expansive. The reason for this is that the assumption that a node’s children appear in a specific order allows efficient dynamic programming techniques to be applied. For example, comparing two ordered nodes means that the first child of each node have to match, the second child of each node have to match, etc. Compared to unordered trees, the first child of a node may match any child of the other node. This results in significantly more permutations to consider. 5.1.3 Path Distance Because of these reasons, we have decided to use a different similarity score that has also been previously discussed in literature. This score is based off the well known Jaccard index used for measuring the similarity between two sets [51, 57]. Given two sets, the Jaccard index returns a rational number inclusively between 61 0 and 1 based on dividing the size of the intersection by the size of the union. A score of 1 represents the two sets having exactly the same elements. A score of 0 represents the two sets having no element in common. Before the Jaccard index can be applied to JSON documents, one problem has to be addressed; namely, a JSON document takes the form of a tree-like structure whereas the Jaccard index requires sets as input. To address this, we first apply a transformation to each JSON object to remove its nested structure and “flatten” it down to a linear structure. This notion has also been discussed in past literature [51, 52, 43]. The main concept behind flattening a JSON object is that the useful information within the object is contained in the leaf nodes (ie. the key-value pairs whose value is an integer, boolean, string, or null). All inner nodes (ie. keys whose value is an object or array) mainly exist to arrange the leaf nodes into smaller, more meaningful, structures. Using this concept, a JSON document can be flattened by taking each key that is a leaf node and appending onto it the names of the keys that occur when traversing to it from the root of the tree. For example, the JSON document in figure 5.1 can be transformed into the set of key-value pairs depicted in figure 5.3. 1 l o c a t i o n / c i t y : ” Vancouver ” 2 l o c a t i o n /country : ”Canada” 3 author/id : 1753 4 author/name/firstName : ” John ” 5 author/name/lastName : ” Smith ” Figure 5.3: Path representation of the JSON document in figure 5.1. 62 More formally, let a JSON object J, in path notation, be defined by the following equation. J = {k1 : v1 , k2 : v2 , ..., kn : vn | ki 6= kj for i 6= j} (5.1) Here, ki : vi is a key-value pair in the flattened structure (e.g. in figure 5.3, location/city: ”Vancouver” is a key value pair where the key is location/city and the value is ”Vancouver”). Because J is a set, the set intersection and set union operations can be defined for it. Let the intersection of two JSON objects be defined by equation 5.2. J1 ∩ J2 = {ki | ki ∈ J1 ∧ ki ∈ J2 } (5.2) This says that the intersection of two JSON objects is the set of keys common to both objects. The resulting set is not a JSON object, however, as a key is no longer associated with a specific value. The reason for this is that while a key may appear in both J1 and J2 , the value of the key is likely to differ between them. Thus, it does not make sense to include a value in the resulting intersection, as the two values would either have to be combined together, or two separate key-value pairs would have to be included. The union of two JSON objects is defined by equation 5.3. J1 ∪ J2 = {ki | ki ∈ J1 ∨ ki ∈ J2 } (5.3) This says that the union of two JSON objects is the set of keys appearing in either of the objects. Similar to the the intersection operation, the union also results in a set of keys with no values. With the intersection and union operations now defined, the Jaccard index for 63 two JSON objects is defined in equation 5.4. We call this equation sim for similarity. sim(J1 , J2 ) = 5.1.4 | J1 ∩ J2 | | J1 ∪ J2 | (5.4) Path Notation for Nested Arrays One concept about path notation that has not yet been discussed is how to represent nested arrays. When only considering objects, as in figure 5.1, a unique path can be created for each key, since two keys cannot have the same name in the same object. As the values in an array are nameless, this property does not hold. Previous literature has dealt with arrays in different ways. Both [52] and [43] insert the element’s array index into the path to keep it unique. [51] generates a single schema for the entire array, and thus, merges all objects together resulting in unique paths. Three main ways exist to deal with nested arrays. Figure 5.5 shows the resulting path notation based on the array in figure 5.4 for each of the three ways. The first approach is to insert the elements array index into the path (like [43] and [52] did). Figure 5.5a showcases the resulting path notation. The downside of this method is that the number of elements in the array then plays a factor when computing the intersection and union operations. For example, one array containing 10 elements and another containing 100 elements would result in a low similarity score regardless of if each element had the same structure. The second approach is to instead take the union of all keys and treat the array as a single object (like [51] did). Figure 5.5b shows this option. While this does solve the issue of having arrays of different sizes, it also introduces its own problems. First, the value of the key is lost as a result of each key likely having a different value in each occurrence. Second, by merging together all array elements and treating it as an object, we lose the distinction on whether a key originally 64 existed in an array or an object. 1 { 2 3 4 5 6 7 8 9 10 11 12 13 } ” books ” : [ { ” t i t l e ” : ” I n t r o t o JSON” , ” author ” : ” John Smith ” , ”numSold ” : 173 }, { ” t i t l e ” : ” I n t r o t o XML” , ” author ” : ” J a n e Doe” } ] Figure 5.4: JSON document containing an array. 1 books/0/ t i t l e : ” I n t r o t o JSON” 2 books/0/ author : ” John Smith ” 3 books/0/numSold : ”173” 4 books/1/ t i t l e : ” I n t r o t o XML” 5 books/1/ author : ” J a n e Doe” (a) Path notation that includes an objects array index. 1 books/ t i t l e 2 books/author 3 books/numSold (b) Path notation that takes the union of the keys. books : Array (c) Path notation that treats an entire array as the keys type. Figure 5.5: Three path notations for the JSON document in figure 5.4. 65 The third approach, and the method used in this thesis, is to treat the entire array as a single JSON array data type and not break it down any further. This is shown in figure 5.5c. The reasoning for this is that there is an intrinsic difference between a JSON object and a JSON array. The purpose on an array is to act like a container for when the number of elements is unknown. An object, on the other hand, has a more fixed structure with more meaning behind what keys are included. Furthermore, identification keys are unlikely to exist in a nested array as they would have to appear in every element. For our purposes, we find it satisfactory that two keys having the same name and an array as a value should represent the same concept. 5.2 Similarity of a Group Unlike the tree-edit distance calculation, the Jaccard index can easily be expanded to measure the similarity of more than two sets. This is a result of the set intersection and set union operations being associative and commutative. Let a group g of JSON objects be defined as follows: g = {J1 , J2 , ..., Jn } (5.5) The similarity of g can then be defined by expanding equation 5.4 to consider the intersection and union of all the sets in g rather than just two sets. In both cases, 0 ⩽ sim ⩽ 1 as the size of the intersection has a minimum value of 0 and a maximum value of the size of the union. This expanded equation is defined by: | sim(g) = T Ji | Ji ∈g | S Ji | (5.6) Ji ∈g Klettke et. al. discuss this equation in the context of JSON [51]. One point they 66 bring up is that this equation can be highly influenced by a few objects in a group. If one object in a group has no keys in common with any of the other objects, the score will always be 0—regardless of how similar the remaining objects are. The reason for this is that the equation involves the intersection operation. Thus, if one object has nothing in common with any of the other objects, the intersection will always result in the empty set and a size of 0. While the authors argue this is a downside to the equation for the purpose of measuring a groups similarity, we consider it an advantage. Schema documents have to consider the structure of a group as a whole. Whether a group contains 2 objects with identical structures or 1000, the result is still a single schema. If a group contains two objects with completely different structures, the resulting schema has to include all the keys in both. By having a similarity score that considers the differences in structures within a group rather than how many objects have the same structure, we are able to generate more accurate schemas. 5.3 Similarity of a Grouping Based on our assumptions discussed in chapter 3, we assume that an identification key is a key whose value corresponds to the type of structure an object has. Based on this assumption, a set of split operations is considered good if the resulting groups each have a high similarity score. The higher the score, the more accurate the generated schema for that group will be. We now define a calculation to measure the similarity among a set of groups which we call a grouping. The definition of a grouping G is defined as: G = {g1 , g2 , ..., gn } 67 (5.7) We use the notation of lowercase letters to represent groups and uppercase letters to represent groupings. The similarity of a grouping G is then defined as:               simGroup(G) = 0 if ∃ g ∈ G : sim(g) < T      P   sim(g)    g∈G    |G| (5.8) otherwise Here, T is the given similarity threshold that a group must satisfy where 0 ⩽ T ⩽ 1. What this equation says is that if all the groups in the grouping meet the similarity threshold, then the similarity score of the grouping as a whole is just the average of the similarity scores of the groups. However, if one of the groups does not meet the threshold, then the similarity score of the grouping is just 0. The reason for introducing a similarity threshold is that a scenario would arise where a grouping with one large group of low similarity and many small groups of high similarity would still have an overall high grouping similarity score because each group was weighted equally. The large amount of small groups were artificially bringing up the average. Another option was to have a weighted average where the score considers the number of objects in each group; however, this calculation has the opposite issue where a grouping with one large group of high similarity would ignore small groups of low similarity. Introducing a similarity threshold gave two advantages: 1. All groups in the grouping must have a satisfactory similarity score. 2. Less groupings have to be considered when computing the best grouping. If a group does not satisfy the similarity threshold, then we know it is not part of the best grouping. 68 5.4 Refining the Scoring Criteria Building off these definitions, we can now refine our criteria for selecting the best grouping. Discussed in the problem statement, the best grouping is the set of splits that results in the fewest possible groups, such that each group has a similarity score over a given threshold. Of all the possible groupings, we find the best grouping by applying these three filtering criteria in the give order: 1. Filter out groupings whose simGroup score is below the specified threshold. 2. Filter out the groupings that do not have the fewest number of groups. 3. Filter out the groupings that do not have the highest similarity score. Criteria number one removes all invalid groupings. Criteria number two then removes all the groupings that are likely to overfit the data to the groups (i.e. it removes groupings from keys like id that result in one group per object). Criteria number three then chooses the grouping with the highest similarity score. Although unlikely, it is possible for more then one grouping to meet all three criteria. In this case, any of the resulting groupings can be used as the basis for the schema decision tree. Table 5.1 shows the criteria being applied to 9 different groupings. While the first 3 groupings have a low number of groups, they do not meet the similarity threshold and are filtered out in the first criteria. Of the remaining groups, the last 4 do not have the minimum number of groups and are filtered out in second criteria. The grouping with the higher similarity score is then chosen as the best grouping. 69 Num. Groups Similarity Score Criteria One Criteria Two Criteria Three 1 0.2 8 — — 5 0.6 8 — — 7 0.72 8 — — 8 0.87 3 3 8 8 0.90 3 3 3 9 0.86 3 8 — 15 0.92 3 8 — 32 0.98 3 8 — 44 0.99 3 8 — Table 5.1: A table showing the three scoring criteria being applied to 9 different groupings. The similarity threshold for this example is 0.85. 70 Chapter 6 Algorithm Based on the calculations and scoring criteria defined in the previous chapter, an algorithm can now be developed that takes an array of JSON objects and computes the best grouping. One simple approach could be to use a brute force method and first generate every possible grouping. The filtering criteria could then be applied to narrow this list of groupings down to only the best one. Generating all possible groupings could be done by initially placing all objects into a single group and applying the split operation on each key common to all of them. This process then repeats for any group with a similarity score below the threshold until every group is above it, or there are no more keys left to split upon. The possible groupings are then generated by taking all the combinations of the the different split operations, such that no two sibling split operations are included (i.e. if two different split operations are applied to the same group, a grouping cannot contain groups from both splits, as that would duplicate the objects). The results of this approach can be visualized as a tree structure exemplified by figure 6.1. Each layer of this tree alternates between two types of nodes. • GroupNodes (denoted by gi ) that represent a group containing a subset of all the objects. 71 • SplitNodes (denoted by si ) that represent the groups generated by applying the split operation on a specific key. In figure 6.1, g1 is the initial group containing all the object. Among these, two keys are in common to all the objects, and a split operation is applied for each one. Splitting on s1 partitions g1 into g2 , g3 , and g4 , whereas splitting on s2 partitions g1 into g5 and g10 . g5 is then further split based on s3 and s4 . Assuming this is the final tree, three possible groupings can be generated from it. (g2 , g3 , g4 ) is generated by splitting on s1 ; (g6 , g7 , g10 ) is generated by taking all the groups in s3 and the right group of s2 , and (g8 , g9 , g10 ) is generated by taking all the groups in s4 and the right group of s2 . Note that (g8 , g9 ) alone cannot be a valid grouping, as it does not contain all the objects in g1 . This is because g1 was split into g5 and g10 , but the grouping (g8 , g9 ) is only derived from g5 . Furthermore, the group- ing (g2 , g3 , g4 , g10 ) is invalid because both (g2 , g3 , g4 ) and (g5 , g10 ) contain the same objects—just partitioned into different groups. Any combination of these groups then results in objects being duplicated. g1 s1 g2 s2 g3 g4 g5 g10 s3 g6 s4 g7 g8 g9 Figure 6.1: Visualization of the alternating layers of GroupNodes and SplitNodes. 72 While the brute force approach does generate the desired results, more efficient algorithms can be developed based around the same GroupNode/SplitNode tree structure. Instead of first generating all valid grouping and then applying the scoring criteria, we instead integrate the criteria into the generation processes itself. This allows us to preemptively filter our groupings that we know will not be the best grouping and reduce the amount of computation required. To increase the efficiency of the algorithm, we use the optimization that once a grouping is found whose simGroup score is above the similarity threshold, any other partially created groupings, already containing a larger number of groups, can be preemptively discarded. The reason for this is that the second scoring criteria is minimizing the number of groups. Thus, if the size of a grouping is already larger than a previously found valid grouping, it cannot be the best one. By considering this optimization, entire branches/subtrees do not have to be generated. In order to incorporate this optimization, we introduce an upper bound variable into our algorithm that keeps track of the maximum number of groups a grouping can have. When the algorithm is generating the GroupNode/SplitNode tree structure, it can compare the current number of groups to the upper bound. If the number of groups is greater, the algorithm can preemptively stop recursively splitting down the current branch. Furthermore, if better results are found mid computation, the upper bound can be lowered to reflect the new maximum. Another optimization to improve efficiency is to first examine the splits containing the fewest number of groups. The reason for this is that those splits will initially give the best chance of lowering the upper bound the furthest. For example, if two groupings exist containing 2 and 20 groups respectfully, it makes sense to first examine the split containing only 2 groups. Applying a split operation to a group always results in an least one group. Thus, the grouping with 20 groups can only ever result in more then 20 groups. By first examining the grouping with 73 only 2 groups, we run the chance of finding a valid grouping with a size under 20. If this occurs, the grouping with 20 groups can then be discarded without computing it, as it will never be the minimum. On the other hand, if we had first examined the grouping with 20 groups, all of the work would be rendered useless once the grouping with 2 groups is examined. This optimization of starting with the smallest groupings can be thought of as working up from a lower bound. By using the combination of lower and upper bounds, an algorithm can work up from the lower bound while simultaneously trying to lower the upper bound. Once the two bounds meet, the best grouping has been found. More details regarding the specifics of this method are now explained in the pseudocode shown in the following sections. Section 6.1 first describes the initialization of the algorithm. Section 6.2 shows the algorithms for initializing a new GroupNode and generating the possible groupings from it. Section 6.3 then shows the algorithms for initializing a new SplitNode and generating the groupings from it. Based on the best groupings generated by the algorithm, we then construct a schema decision tree. This is shown in section 6.5. Section 6.6 then shows how to integrate a schema decision tree into the JSON schema specification. 6.1 Starting the Algorithm When the algorithm is started, a single GroupNode is created and initialized with all the objects of the array. It is assumed that each object has already been converted into its path notation representation. In addition to the objects, an empty list is also passed in for the splitKeys parameter. This list represents that no split operations have been applied to the group yet. A function call to the nodes genGroupings function is then performed to generate the best groupings. An initial upper bound of infinity is passed in as an argument—representing that no best group- 74 ings have yet to be generated. Pseudocode for this startup procedure is given in the following algorithm. Algorithm 1: Initializing and Starting the Algorithm Input: objects– List of JSON objects in path notation. Output: List containing the best groupings. root ← A new GroupNode initialized with objects and an empty list for splitKeys. 2 results ← root.genGroupings(∞) 1 6.2 GroupNode Algorithm The purpose of a GroupNode is to represent a set of objects. When a split operation is performed on a GroupNode, the result is a set of GroupNodes. Generating the possible groupings for a GroupNode consists of performing a split operation on each key common to all objects in the node. The set of possible groupings is then the union of possible groupings generated for each split operation. 6.2.1 Initializing a New GroupNode Algorithm 2: GroupNode: Initialization Input: objects– List of JSON objects in path notation. splitKeys– List of keys that have been used to filter all the array objects down to this group. this.objects ← objects 2 this.splitKeys ← splitKeys 3 this.similarity ← sim(objects) 1 • Lines 1–2: Initialize objects and splitKeys to those passed into the function. • Line 3: Calculate the similarity score of the objects in this group. This is performed once on initialization rather then computing it every time the similarity score is needed. 75 6.2.2 Generating Groupings for a GroupNode Algorithm 3: GroupNode: genGroupings() Input: upperBound– Max number of groups this group can be split into. Output: List containing the best groupings. Global : threshold– Minimum similarity requirement. if this.similarity >= threshold then 2 Return 1 and a list containing a list with this group in it. 3 end 4 groupings ← empty-list, splits ← empty-list T 5 splitOptions ← ( J∈this.objects J) – this.splitKeys 6 foreach key in splitOptions do 7 Add a new SplitNode to splits passing in this.splitKeys, key, and this.objects. 8 end 9 foreach split in splits do 10 if split.similarity >= threshold ∧ split.numGroups < upperBound then 11 upperBound ← split.numGroups 12 end 13 end 14 foreach split in splits do 15 if split.numGroups > upperBound ∨ split.numGroups = 1 then 16 Remove split from splits. 17 end 18 end 19 Sort splits in ascending order by numGroups. 20 foreach split in splits do 21 splitNum, splitGroupings ← split.genGroupings() 22 if splitNum = 0 then 23 Continue to next iteration. 24 end 25 if splitNum = upperBound then 26 Add splitGroupings to groupings. 27 else if splitNum < upperBound then 28 upperBound ← splitNum 29 groupings ← splitGroupings 30 end 31 Filter groupings to only contains those with the highest similarity score. 32 if groupings is empty then return 0, empty-list 33 else return upperBound, groupings 1 76 • Lines 1–3: If the group itself already has a similarity score over the threshold, there is no need to further apply the split operation. Return a list containing a list with this group inside it. This represents that there is one possible grouping consisting of all the objects in one group. If the group does not have a similarity score over the threshold, then the group has to be split further. In this case, continue on with the algorithm. • Lines 5–8: Determine the possible keys to split on by finding the set of keys that are common to all the objects. Remove keys that have already been split on to reach this point. For each of these keys, generate a new SplitNode. • Lines 9–13: Check to see if any of the newly created SplitNodes satisfy the similarity threshold without needing to be split any further. Of the ones that due, find the one with the fewest groups, and set the number of groups in it as the new upper bound. • Lines 14–18: Remove any SplitNode that results in more groups than the upper bound. These can be discarded as a better best grouping has already been found. Also, remove any SplitNode that results in only one group, as there is no reason to split on a path that results in the same group. • Line 19: Sort the remaining SplitNodes in ascending order by the number of groups in the SplitNode. This results in the following loop first examining the SplitNode with the fewest groups. It then works its way up to the SplitNode with the most groups. This method increases the likelihood of lowering the upper bound the most, allowing more SplitNodes to be preemptively discarded. • Line 21: Generate the possible groupings of the given SplitNode by calling its genGrouping function. Pseudocode for this algorithm is given in algorithm 77 5. This function returns the number of groups in the possible groupings and a list of the groupings. • Lines 22–24: A return value of 0 for splitNum indicates that all valid groupings that can be generated by the current SplitNode have more groups then the current upper bound. All results from this SplitNode can be discarded as a better best grouping has already been found. • Lines 25–26: If the SplitNode has resulted in groupings with the same number of groups as the upper bound, add these to the list of current groupings. • Lines 27–29: If the SplitNode has resulted in groupings containing fewer grounds then the current upper bound, discard the current list of groupings, and set the list to the list of newly generated groupings. Lower the upper bound to the value of splitNum to reflect the new best grouping. • Line 31: Filter out groupings that do not have the highest similarity score. This satisfies criteria 3 of the scoring criteria. • Lines 32–33: If all the groupings with a similarity score over the threshold have more groups then the upper bound, return 0 and an empty list to indicate that this GroupNode cannot generate a better best grouping. Otherwise, return the new upper bound and the new groupings. 6.3 SplitNode Algorithm The purpose of a SplitNode is to represent the GroupNodes generated when applying the split operation on a specific GroupNode for a given key. Generating the possible groupings for a SplitNode consists of generating the different combinations of the groups resulting from the split. For each group, either the group 78 itself can be included, if it satisfies the similarity threshold, or the group can be partitioned further with the resulting groups being included. 6.3.1 Initializing a New SplitNode Algorithm 4: SplitNode: Initialization Input: splitKeys– List of keys that have filtered all of the array objects down to the group this split operation is being applied on. key– Key to split objects on. objects– List of JSON objects in path notation. this.splitKeys ← splitKeys 2 this.key ← key 3 this.objects ← objects 4 this.groups ← empty-list 5 foreach group in partition(objects, key) do 6 Add a new GroupNode to this.groups—passing in group and the concatenation of splitKeys with key. 7 end 8 this.numGroups ← |groups| 9 this.similarity ← simGroup(this.groups) 1 • Lines 1–3: Initialize splitKeys, key, and objects to the corresponding value passed into the function. • Line 4: Initialize the groups variable to en empty list. • Lines 5–7: Partition the objects based on the value of the given key. This means that all objects having the same value for the given key form a group. Create a new GroupNode for each of the resulting groups—passing in the objects of the new group, along with a list of the past split keys concatenated with the new key. • Line 8: Set numGroups to the number of groups in the partition. • Line 9: Calculate the similarity score for the set of groups. This is performed once on initialization and used whenever the similarity score is required. 79 6.3.2 Generating Groupings for a SplitNode Algorithm 5: SplitNode: genGroupings() Input: upperBound– Maximum number of groups that any group in this split can be partitioned into. Output: List containing the best groupings. Global : threshold– Minimum similarity requirement. if this.numGroups > upperBound then 2 return 0, empty-list 3 end 4 if this.numGroups = upperBound ∧ this.similarity >= threshold then 5 return upperBound, this.groups 6 end 7 numAccounted ← this.numGroups 8 groupGroupings ← empty-list 9 foreach group in this.groups do 10 bound ← upperBound − numAccounted + 1 11 splitNum, validGroupings ← group.genGroupings(bound) 12 if splitNum = 0 then 13 return 0, empty-list 14 end 15 numAccounted ← numAccounted + splitNum − 1 16 Append validGroupings to groupGroupings 17 end 18 groupings ← empty-list 19 Add a list of groups to groupings for each permutation of groups in groupGroupings. 20 return numAccounted, groupings 1 • Lines 1–3: If the number of groups in this SplitNode is already greater then the upper bound, return 0 and an empty list to indicate that no better grouping can be generated. • Lines 4–6: If the number of groups in this SplitNode is equal to the upper bound, then return a grouping consisting of the groups in this SplitNode. • Line 7: numAccounted keeps a running tally of the number of groups already accounted for. Initially, this is set to the number of groups in the SplitNode 80 as splitting a group can only in more groups. • Line 8: groupGrouping stores the possible groupings for each of the groups in the SplitNode. If a group itself is above the similarity threshold, groupGrouping just stores the group. If the group is below the threshold, it stores the best groupings for that group. For example, the first element of groupGrouping represents the possible groupings for the first group of the SplitNode, the second element represents the possible groupings for the second group, etc. • Line 10: Calculate the new upper bound for the group currently being examined. This is done by taking the existing upper bound and subtracting the number of groups already allocated to the previously examined groups. Add 1 since one group is already accounted for (i.e. taking one group and splitting it into two groups only yields one additional group). For example, if the upper bound is ten and the first group examined requires three groups to get above the similarity threshold, any partition of second group can only consist of a maximum of seven groups. • Lines 12–14: A return value of 0 for splitNum from the function indicates that the group cannot be partitioned into a grouping that is above the similarity threshold without exceeding the upper bound. Because of this, return 0 and an empty list indicating that this SplitNode cannot yield better results. • Lines 15–16: Update the number of groups accounted for and add the list of possible groupings to groupGrouping. • Lines 18-20: groupGrouping now contains a list of possible groupings for each group. To construct the possible groupings for the SplitNode, create a grouping for each combination of groups. This is done by choosing one grouping option for each group in the SplitNode. 81 6.4 Algorithm Remarks A benefit of having the root node contain all objects in the array is that minimal computation is required for homogeneous arrays (ie. arrays where the objects all have the same structure). This is because the root node itself will have a similarity score over the threshold, and thus, return itself as the best grouping right away. Because of this, there is little overhead for checking most arrays. Only if the array is heterogeneous does the algorithm start splitting the group. Another note about the algorithm is that it is possible for multiple best groupings to be returned; however, this is rare. The algorithm only returns multiple best groupings if two groupings have the same number of groups and the same similarity scores. In this scenario, either grouping can be used as the basis for the schema decision tree. 6.5 Constructing a Schema Decision Tree Given an best grouping, a schema decision tree can be generated using the list of splitKeys contained within each GroupNode. The basic idea behind the construction process is to start with an empty decision tree, and add each group to the tree one at a time. Adding a group to the tree consists of starting at the initial decision node and iterating through the GroupNode’s splitKeys list. Every time the group was partitioned during the generation process, the key used to partition it was appended to its splitKeys list. This means that the list is, in a sense, ordered; the first key in every splitKeys list is the key used to partition the initial GroupNode that contained all the objects. For each key in a group’s splitKeys, if the key (and corresponding value) is already present in the tree, then the group moves down the tree to the next decision node. If the key is not present in the tree, a new branch is added to the current de82 cision node with the give key. If this was the last key in the list, the branch points to a schema generated for the objects in the group. Otherwise, the branch points to another decision node. Figure 6.2 shows an example of the construction process for the 4 groups shown in table 6.1. Initially, the schema decision tree starts with an empty decision node as shown in 6.2a. In figure 6.2b, the first group, g1 , is added to the tree by taking the first key in its splitKeys list and checking if it is already present at the decision node. Since it is not present, a new branch is created pointing to another decision node. This process repeats for the second key, except this time, the branch points to a schema, as it is the last key in the list. This schema is generated by passing the objects of the group into a traditional schema generation tool. When adding g2 to the tree in figure 6.2c, the group sees that the first key is already present at the initial decision node. It instead uses the existing branch to move down the tree rather then creating a duplicate branch. Once this process has been performed on every group, the final schema decision tree is generated (as shown in figure 6.2e). Group splitKeys list from the GroupNode g1 Key1 :A, Key2 :1 g2 Key1 :A, Key2 :2 g3 Key1 :B g4 Key1 :C Table 6.1: A list of the groups in the best grouping, along with their corresponding splitKeys list. 83 (a) Initial decision node. Key2 :1 A Key 1: Sc1 (b) Adding g1 to the schema decision tree. Key2 :1 Sc1 Key2 :2 Sc2 A Key 1: (c) Adding g2 to the schema decision tree. Key2 :1 Sc1 Key2 :2 Sc2 A Key 1: Key1 :B Sc3 (d) Adding g3 to the schema decision tree. Key2 :1 Sc1 Key2 :2 Sc2 A Key 1: Key1 :B Key : 1 C Sc3 Sc4 (e) Adding g4 to the schema decision tree. Figure 6.2: Constructing a schema decision tree for the groups in table 6.1. 6.6 Integrating into JSON Schema Based on the schema decision tree constructed in figure 6.2, we now show how it can be integrated into a JSON schema document. As the actual document would span multiple pages and be difficult to read, we instead include snippets showcasing the two main components. These components are the definition section where individual schemas are stored, and decision nodes specifying how to move down the tree. 84 First, the definition section of a JSON schema document is designed for defining sub-schemas. These can then be referenced throughout the rest of document using the notation ”$ref”: ”#/definitions/SubSchemaName”. As a schema is generated for each type of structure, we choose to store these schemas in this section and reference them throughout the document. The purpose for this is to break up the document into more manageable pieces and improve readability. Figure 6.3 contains the definition section for two sub-schemas named Schema1 and Schema2. Schema1 is defined in lines 2–17, and schema2 is defined in lines 18–33. Lines 3 and 19 specify that each schema is for a JSON object. Lines 4–12 and 20– 28 then contain a list of properties that an object must contain to satisfy the given schema. The identification keys are specified in lines 5–10 and 21–26. Their corresponding enum value specifies the value of the identification key for the given schema. Lines 11 and 27 are where the rest of the schema for the given structure type would go. Finally, lines 13–16 and 29–32 specify that the identification keys are required to appear in the object. Additional keys can be included here depending on what the rest of the schema consists of. 85 1 ” definitions ”: { 2 ”Schema1 ” : { 3 ” type ” : ” o b j e c t ” , 4 ” properties ”: { 5 ” key1 ” : { 6 ”enum ” : [ ”A” ] , 7 }, 8 ” key2 ” : { 9 ”enum ” : [ ” 1 ” ] , 10 }, 11 ... 12 }, 13 ” required ” : [ 14 ” key1 ” , 15 ” key2 ” 16 ], 17 }, 18 ”Schema2 ” : { 19 ” type ” : ” o b j e c t ” , 20 ” properties ”: { 21 ” key1 ” : { 22 ”enum ” : [ ”A” ] , 23 }, 24 ” key2 ” : { 25 ”enum ” : [ ” 2 ” ] , 26 }, 27 ... 28 }, 29 ” required ” : [ 30 ” key1 ” , 31 ” key2 ” 32 ], 33 } , Figure 6.3: The section of the resulting JSON schema document containing the definitions for the different sub-schemas tailored to the different types of structure. 86 Second, each decision node is implemented using a combination of the anyOf and allOf syntaxes. The anyOf syntax specifies that at least one of the following schemas has to be satisfied for the data to be valid. Likewise, the allOf syntax specifies that all of the following schemas must be satisfied for the data to be valid. By nesting these within each other, we are able to group validation conditions together. Figure 6.4 shows the schema for the second decision node in figure 6.2. For any JSON object to be valid, it has to satisfy either the schema defined in lines 2–17 or the schema defined in lines 18–33. Both of these schemas then consist of an allOf syntax with two further schemas defined. Lines 4–10 and 20–26 are first used to perform a lookup on the given identification key. If its value matches that specified in the enum, it moves on to the second part of the allOf syntax in lines 13–15 and 29–31. This schema consists of a reference that sends the document on to either another decision node or one of the schemas defined in the definition section. 87 1 ”anyOf ” : [ 2 { 3 ” allOf ”: [ 4 { 5 ” type ” : ” o b j e c t ” , 6 ” properties ”: { 7 ” key2 ” : { 8 ”enum ” : [ ” 1 ” ] 9 } 10 }, 11 ” r e q u i r e d ” : [ ” key2 ” ] 12 }, 13 { 14 ” $ r e f ” : ”#/ d e f i n i t i o n s /Schema1” 15 } 16 ] 17 }, 18 { 19 ” allOf ”: [ 20 { 21 ” type ” : ” o b j e c t ” , 22 ” properties ”: { 23 ” key2 ” : { 24 ”enum ” : [ ” 2 ” ] 25 } 26 }, 27 ” r e q u i r e d ” : [ ” key2 ” ] 28 }, 29 { 30 ” $ r e f ” : ”#/ d e f i n i t i o n s /Schema2” 31 } 32 ] 33 } 34 ] Figure 6.4: Section of the resulting JSON schema document containing a decision node. 88 Chapter 7 Evaluation and Analysis 7.1 Algorithm Walkthrough To showcase the logic behind the algorithm, we now work through the process the algorithm takes when partitioning a dataset. This dataset consists of an array of 50 JSON objects returned from the iTunes Search API. To simplify the problem for explanation purposes, only 5 structure types were included in the dataset; however, the process is the same regardless of the number of structure types. For this example, we assume a similarity threshold of 0.70. To begin, all objects of the array are placed into a single group. As this group has a similarity score of 0.2 and a simGroup score of 0, we know that the split operation has to be applied at least once. Common to all the objects are 8 keys resulting in 8 SplitNodes being generated—one for each key. Table 7.1 shows what key each SplitNode was generated from, how many groups resulted from the split, what the average similarity of the groups was, and what the simGroup score was. Looking at this table, we see that the splits based on collectionExplicitness and wrapperType each resulted in only a few groups being generated; however, they do not satisfy the similarity threshold. 89 Of the options that do satisfy the threshold, collectionPrice has the lowest number of groups at 18. As such, this grouping becomes the current best grouping, and the lower bound is reduced down to 18. Any grouping having more then 18 groups can be discarded. Two scenarios now exist. Scenario one is that either the groups generated by splitting on wrapperType or collectionExplicitness can be further split into less then 18 groups that all satisfy the threshold. In this case, the results from collectionPrice can be discarded, as better groupings have been found. Scenario two is that none of the groups generated by splitting on wrapperType or collectionExplicitness can be further split. In this case, collectionPrice is the best grouping. Split Key Num. Groups Average Similarity SimGroup Score collectionPrice 18 0.905 0.905 artistName 35 0.986 0.986 releaseDate 41 0.984 0.984 collectionExplicitness 3 0.461 0 artworkUrl60 46 0.988 0.988 wrapperType 2 0.672 0 primaryGenreName 24 0.972 0.972 artworkUrl100 44 0.988 0.988 Table 7.1: A list of split operations for a single group containing all objects. The only way to determine which scenario is true involves further applying the split operation to the groups generated by wrapperType and collectionExplicitness. We first start with wrapperType as it has the fewest groups. Splitting on wrapperType results in two groups. One has an average similarity of 0.391, and the other has an average similarity of 0.952. As the second group already satisfies the threshold, it does not have to be split further; only the first group does. Furthermore, since 90 the combination of the 2 groups has an upper bound of 18 groups, the first group can, at most, be split into 17 groups. Because of this, 17 can be thought of as the new upper bound when examining that group. Table 7.2 shows the different split operations that can be applied to this group. Of note is the increase in options for split keys. In addition to the unused keys carried down from table 7.1, additional keys now exists. This is due to the existence of keys that are in common to all the objects of this group but not all the objects in general. As such, they did not appear in the first table but do in this table. Looking at this table, we see that 6 groupings have resulted in less than 18 groups. Of these, splitting on kind has resulted in only 4 groups that are all over the threshold. Before we can say this is the new best grouping and return to the previous recursive layer, we need to check two other options. Both trackExplicitness and collectionExplicitness resulted in fewer groups but did not meet the similarity threshold. Thus, we need to verify that they cannot be further split. We first check to see if trackExplicitness can be split into at most 4 groups. This grouping consists of 3 groups having average similarity scores of 0.537, 0.600, and 0.914. Both the first and second groups need to be split further as they are currently below the threshold. However, doing this results in at least 5 groups which is already above the threshold. As such, splitting on trackExplicitness cannot result in a better grouping then we have already found. Likewise, splitting on collectionExplicitness also results in 3 groups having average similarity scores of 0.537, 0.600, and 0.914. For the same reasons as with trackExplicitness, splitting on collectionExplicitness cannot result in a better grouping either. Thus, no groupings exist with fewer then 4 groups, meaning that kind is the current best best grouping. 91 Split Key Num. Groups Average Similarity SimGroup Score trackName 28 0.977 0.977 trackViewUrl 30 0.984 0.984 trackExplicitness 3 0.684 0 collectionPrice 9 0.939 0.939 trackCensoredName 28 0.977 0.977 artistName 23 0.978 0.978 kind 4 0.928 0.928 trackId 30 0.984 0.984 trackPrice 6 0.933 0.933 releaseDate 25 0.976 0.976 artworkUrl30 24 0.978 0.978 collectionExplicitness 3 0.681 0 artworkUrl60 24 0.978 0.978 primaryGenreName 17 0.965 0.965 artworkUrl100 24 0.978 0.978 Table 7.2: A list of different split operations for one of the groups generated by the wrapperType split. Going back up a layer of recursion to table 7.1, we had originally split wrapperType into two groups. The second group already satisfied the threshold, and we just found that the first group can be split into 4 groups. This means that a valid grouping now exists with 5 groups, and the upper bound can be further lowered from 18 down to 5. The only remaining option left to check is now collectionExplicitness. It currently has 3 groups with average similarity scores of 0.295, 0.600, and 0.486. As all these groups require at least one further split operation, the minimum number of groups 92 we could receive is 6. As such, we know that collectionExplicitness cannot result in a better split then we have already found. At this point, we have explored every option that could result in fewer groups, meaning that the lower bound has reached the upper bound. Thus, the best grouping is generated by first splitting on wrapperType and then splitting one of the groups on kind. 7.2 Evaluation Two main areas exist that we are interested in examining. The first area is the time it takes the algorithm in chapter 6 to generate the best grouping, and the second area is the time it takes to validate a JSON array using a schema decision tree. To analyze these areas, we run our algorithm against three different datasets and analyze the results in the following sections. Dataset one is based on the iTunes Search API that has been used as a running example throughout this thesis; dataset two is based on the Open Movie Database API, and dataset three is based on the Spotify Search API. Web APIs were the main source of data due to their ability to generate dataset of differing sizes with relative ease. For each dataset, we present two graphs. The first graph analyzes the runtime of the algorithm for various similarity thresholds and array sizes. This is done in figures 7.2, 7.5, and 7.8. The second graph analyzes how long it it takes JSON arrays of varying sizes to be validated. This is done in figures 7.3, 7.6, and 7.9. In each of these graphs, each array was validated using one of three different methods. Method one consists of a single schema. This schema was generated by taking the schemas created for the different structure types and arranging them in a linear order. In this sense, the schema for each structure type can be thought of as a subschema. An object is validated against the schema by comparing it to the first 93 sub-schema. If that sub-schema does not validate the object, it is compared against the second sub-schema. This process repeats until the object either satisfies one of the sub-schemas or is rejected. Method two again consists of a single schema. Instead of arranging the subschemas in a linear order, this schema integrates the schema decision tree into it. This is done using the approach outlined in section 6.6. The sub-schemas used in this tree are the same sub-schemas used in the previous method. One major drawback with the JSON schema format is that is does not support variables. This is important because it prevents the full potential of the schema decision tree from being used. Instead of performing a single key lookup and reusing the value, the schema in method two can only check to see if a key has a specific value. Because of this, a key lookup has to be performed for each value an identification key can have. Based on this drawback, method three consists of a program that models the schema decision tree. For each object, the program first performs a lookup on the identification key and stores the result in a variable. This variable is then compared against the possible values the identification key can have to determine which branch of the tree it should take. If the branch leads to another decision node, the process repeats. If the branch leads to a schema, then that schema is used to validate the object. This method differs from method two in that there is no longer a single schema document. Instead, the sub-schemas are kept separate, and the program decided which schema an object should be compared against. All experiments were performed on a 2014 MacBook Air containing a 1.4 GHz Intel Core i5 processor and 8GB memory. Each data point consists of the average time of 10 runs. Each dataset was also verified to contain all structure types. The generated schema decision trees are based on the best groupings found using a similarity threshold of 0.7. This number was chosen based on observations during 94 the algorithm development process. We find that it gives a good balance between determining the best grouping while still allowing the possibility of some variation within a particular structure type. 7.2.1 Generating the Datasets For each of the three APIs we are using in our analysis, we first generated a dataset consisting of 15000 JSON objects obtained from heterogeneous arrays in the API’s responses. As we are only focusing on JSON arrays, we only extracted the array from the response and discarded the rest. Table 7.3 shows the API endpoint URLs used for each of the datasets. For the iTunes Search API, we used the search endpoint and the term parameter. The value of this parameter is the search query, and the API will return results that it thinks are relevant to it. No authentication is required to use this API Dataset API Endpoint URL iTunes Search API • https://itunes.apple.com/search?term= OMDB API • http://www.omdbapi.com/?s= &apikey= &page= • http://www.omdbapi.com/?i= &apikey= Spotify Search API • https://api.spotify.com/v1/search?q= &type=album,track,artist,playlist,show,episode Table 7.3: The API Request Endpoint(s) used for each dataset. For the OMDB API, we used two endpoints. The first is the search endpoint, and it returns information that the API thinks is relevant to the given term. Each response only contains an array of 10 objects with more being available through subsequent requests. This is done by including the optional page parameter. For 95 example, setting page to 2 gives the next 10 results. This endpoint only returns a few value for each object in the response, namely, a title, year, ID, type, and poster. To get the rest of the information for each item, we performed an addition request to their ID endpoint that returns the full JSON object for a given ID. An API key is required to use this API, and can be obtained for free through their website. This key is passed in through the apikey parameter in each request. For the Spotify Search API, we used the search endpoint and the q parameter to pass in a search query term. In addition, we also specified a type parameter. This parameter tells the API what type of information we want in our results. As we are collecting a dataset of all structure types, we list all options so that we receive information for each type. An API key is required to use this endpoint, and can be obtained for free through their website. This key is passed in as a HTTP header for each API request. Table 7.4 shows the number of request made to each API, as well as the number of objects returned in a response. For the iTunes Search API, we made 300 requests and received 50 objects back per response (50 is the default number returned). Dataset API Requests Objects Per Response JSON Objects iTunes Search API 300 50 15000 OMDB API 27 23–3080 15000 Spotify Search API 50 300 15000 Table 7.4: An overview of the number of requests made to each API, and the number of objects returned for each request. For the OMDB API, we made 17 requests to the search API and received between 23 and 3080 objects per response. The term blue gave us the largest at 3080, and the term sponge gave us the smallest at 23. Finally, for the Spotify Search API, 96 we made 50 requests and received 300 objects per response (50 objects for each of the 6 types we listed in the request). Search terms for all APIs were made to ensure that we got a variety of data that included all structure types. Furthermore, when selecting which data to pass into our algorithm, we randomly chose n objects from the dataset and then verified that all structure types were included. If not, then we repeated the process until this condition was satisfied. 97 7.2.2 iTunes Search API Dataset The iTunes Search API is a web API that returns information regarding products available in the iTunes store [1]. Figure 7.1 shows the generated schema decision tree consisting of two identification keys. Objects are first split based on the wrapperType key. Objects then having the value track are split again based on the kind key. S4 S5 t” as c k d in = “p od u “m -v sic o” ide ge” cka -pa are d= ftw kin “so = d kin f” = “pd kind S7 ” kind = “album ” kind ” Type = “artist S1 wrapper Ji wrap perTy S2 “aud ioboo kin kin = “collection” pe = S8 kind = “book” track e=“ p perTy wrap wrapperType S6 k” S3 ki n = “po d= = S9 ” “so ng” S10 “fe atu remo “t vvie ep ” iso de ” S11 d= d dcast S12 S13 Figure 7.1: Schema decision tree generated from the iTunes Search API dataset. 98 Computation Time (Seconds) 7.2.2.1 Schema Generation Time Comparison 1 .2 T = 0 .2 T = 0.35 T = 0 .5 T = 0 .7 1 0 .8 0 .6 0 .4 0 .2 0 0 100 200 300 400 500 600 700 800 900 1,000 Number of Objects in Array Figure 7.2: Comparing the computation time of the iTunes Search API dataset for various similarity thresholds (T ) and various input array sizes. Figure 7.2 shows how long it took the algorithm in chapter 6 to generate the best grouping for various similarity thresholds and input array sizes. Setting the threshold to 0.2 results in a single group containing all the objects. Because of this, the only computation involved is a single similarity calculation resulting in a near constant runtime. When the threshold is increased to 0.35, the best grouping consists of four groups generated by splitting only on wrapperType. A similarity threshold of 0.5 results in six groups generated by first splitting on wrapperType and then further splitting one of the groups based on trackExplicitness. Finally, a similarity threshold of 0.7 results in the groups used to generate the schema decision tree in figure 7.1. 99 Schema Validation Time Comparison 1 .2 Linear Search Schema Schema Decision Tree Schema Schema Decision Tree Program Computation Time (Seconds) 7.2.2.2 1 0 .8 0 .6 0 .4 0 .2 0 0 100 200 300 400 500 600 700 800 900 1,000 Number of Objects in Array Figure 7.3: Comparing the validation time of the iTunes Search API dataset for three validation methods. Figure 7.3 now shows the time it took to validate arrays of varying sizes using the three methods. Based on this graph, we see a significant improvement when the schema integrates the schema decision tree compared to when the schema arranges the sub-schemas in a linear order. This is likely due to a few factors. First, there is a significant overlap of keys among the different structures. This means that the time it takes to invalidate an object increases, as the overlapping part may be checked before the non-overlapping part. Second, having a decision node allows multiple schemas to be skipped over. Comparing the schema implementation of the schema decision tree to the program implementation, we see another improvement. This is likely due to the reduction in key lookups that have to be performed. For example, in the second decision node, only a single key lookup has to be performed to determine which 100 schema the object should be compared against. In the schema implementation, up to 10 key lookups may have to be performed—one for each sub-schema. 7.2.3 Open Movie Database (OMDb) API Dataset Dataset two is based on the Open Movie Database API [73]. This is a web API that provides information about different movies and television series—such as actors, producers, etc. Figure 7.4 shows the resulting schema decision tree. Compared to the tree generated for the iTunes data set, this tree is relatively simple with only a single identification key and three structure types. S1 Ji Type = “ep isode” Type = “series” Type = “mov ie” S2 S3 Figure 7.4: Schema decision tree generated from the Open Movie Database dataset. 101 Computation Time (Seconds) 7.2.3.1 Schema Generation Time Comparison T = 0 .2 T = 0.35 T = 0 .5 T = 0 .7 0 .4 0 .3 0 .2 0 .1 0 0 100 200 300 400 500 600 700 800 900 1,000 Number of Objects in Array Figure 7.5: Comparing the computation time of the Open Movie Database dataset for various similarity thresholds (T ) and various input data sizes. Figure 7.5 shows the time it took the algorithm to generate the best grouping. When the similarity threshold was set to 0.2, 0.35, or 0.5, all objects were placed into a single group. Because of this, they all have near identical execution times. When the similarity score was 0.7, a single split based on Type occurred resulting in three groups. 102 Schema Validation Time Comparison 0 .6 Linear Search Schema Schema Decision Tree Schema Schema Decision Tree Program Computation Time (Seconds) 7.2.3.2 0 .5 0 .4 0 .3 0 .2 0 .1 0 0 100 200 300 400 500 600 700 800 900 1,000 Number of Objects in Array Figure 7.6: Comparing the validation time of the Open Movie Database API dataset for three validation methods. Figure 7.6 shows the time it took to validate arrays of varying sizes using the three methods. We see that the schema decision tree version only had slightly reduced validation times. We believe the reason for this is due to two factors. First, only having three structure types means that an object can be invalidated relatively quickly. Second, the schema decision tree version still has to perform a key lookup to determine the identification key’s value. Comparing the program implementation of the schema decision tree to the other two, we see a significant improvement. This provides support for the second factor having a major contribution for the similar runtimes found in the schema versions. 103 7.2.4 Spotify Search API Dataset The third dataset was generated from the Spotify Search API [74]. Like the iTunes Search API, this API returns information related to different songs, artists, shows, etc. available on the Spotify platform. Figure 7.7 shows the resulting schema decision tree. Like in the previous dataset, this decision tree consists of a single identification key and six structure types. One note for this dataset is that Spotify does not combine together different types of structure into a single array. Instead, an API response contains an array for each type of structure (i.e. an array just for objects of the artist structure type, an array just for objects of the album structure type, etc). As each structure also contains a key identifying its type, we still felt this was a useful dataset to evaluate the schema decision tree concept. Thus, the dataset was generated by combining together the different arrays into a single array. S1 ” sode type pi = “e type = Ji ” “show S2 S3 type = “playlist” type = “track” type = type S4 “artist = “a ” lbum ” S5 S6 Figure 7.7: Schema decision tree generated from the Spotify Search API dataset. 104 Computation Time (Seconds) 7.2.4.1 Schema Generation Time Comparison 0.14 T = 0 .2 T = 0.35 T = 0 .5 T = 0 .7 0.12 0.1 8 · 10−2 6 · 10−2 4 · 10−2 2 · 10−2 0 0 100 200 300 400 500 600 700 800 900 1,000 Number of Objects in Array Figure 7.8: Comparing the computation time of the Spotify Search API dataset for various similarity thresholds (T ) and various input data sizes. Figure 7.8 shows the time it took to generate the best grouping for various similarity thresholds and various input array sizes. All similarity thresholds resulted in roughly equal execution times due to there being very little overlap among the different types of structures. For instance, only four keys are in common to all the objects. 105 7.2.4.2 Schema Validation Time Comparison Computation Time (Seconds) 1 .4 Linear Search Schema Schema Decision Tree Schema Schema Decision Tree Program 1 .2 1 0 .8 0 .6 0 .4 0 .2 0 0 100 200 300 400 500 600 700 800 900 1,000 Number of Objects in Array Figure 7.9: Comparing the validation time of the Spotify Search API dataset for three validation methods. Figure 7.9 shows the time it took to validate arrays of varying sizes using the three methods. Looking at the graph, we see very similar validation times between the schema implementations. This slight reducing is comparable to the validation times for the schemas in the OMDB dataset (figure 7.6). We believe this is also due to similar reasons discussed there. Unlike the OMDB dataset, however, the validation times for the two schemas in the Spotify dataset are nearly identical. This is likely due to very little overlap among the structure types. Because of this, an object can be invalidated in roughly the same time it takes to check the value of the identification key. This is shown when looking at the validation times for the program implementation. The program used the same sub-schemas but sees a significant improvement due to no longer having to check each value for an identification key. 106 7.3 Runtime Complexity Analysis The challenge with determining the runtime complexity of the algorithm is that it is largely dependent on the structure of the data passed into it. Let J be the number of JSON objects in an array and n be the maximum number of keys in one of the objects. The best case scenario occurs when the initial group (containing all the objects) already has a similarity score above the threshold. In this case, no split operations have to be performed. Calculating the runtime complexity of the algorithm is then simply calculating the runtime complexity of the similarity calculation. As each JSON object is treated as a set of keys, this problem is equivalent to determining the complexity of the intersection and union of J sets. Both of these operations can be done in O(Jn) time through the use of data structures that have constant time key lookups (e.g. hash sets in Java). This process works by treating the first set as the base set. All other J − 1 sets are then iterated through. Each key within a set is compared against the base set to see if it is present. In the case of the union operation, the key is added to the base set if it is not already present. Once all the sets have been iterated over, the base set contains the union. In the case of the intersection operation, an integer count is associated with each key in the base set and is incremented when a lookup on that key is performed. After iterating over all the sets, the intersection of the sets consists of the keys in the base set having a count equal to the number of sets. As each set has at most n keys, the runtime complexity of the similarity calculation is O(Jn). Based on this, the runtime complexity of a single split operation can now be calculated. For one split operation, there is a cost of O(J) to partition all the objects into their corresponding groups. Calculating the similarity score of the resulting groups then has a cost of O(Jn) as all the objects of the original group are still present. Adding these two complexities together gives a final complexity of O(Jn). 107 When we are dealing with the same dataset, we can treat n as a constant, as objects of the same structure type have roughly the same number of keys. With this assumption, the runtime complexity of the entire algorithm would then be dependent on the number of objects in the array and the number of split operations performed. We can again assume that the number of split operations will be a constant value due to the same groups being generated every time. If we let s be the number of split operations performed for a given dataset, we end up with a linear runtime complexity of O(Jns) with J and s being constants. This is shown in figures 7.2, 7.5, and 7.8, where all datasets have a near linear runtime. Trying to define s in terms of J and n is a significantly more challenging problem. As such, we leave it as a future research direction and instead discuss some of the major factors that would need to be considered. The main issue is that multiple factors influence how many split operations are performed. The first factor is comparing the size of the intersection to the size of the union for a given group. This is important because these are the two components of the similarity score. Maximizing the number of splits for a group involves maximizing the size of the intersection. If the intersection is too large, the groups similarity score will already be above the threshold, and thus, does not need to be split further. This means that the most split operations occur when the similarity score is just underneath the threshold. The issue with only considering this is that a split operation always results in at least two groups—both having a higher similarity score. If the original group’s similarity score was just underneath the threshold, it is very likely that the resulting groups will have similarity scores over the threshold, and thus, no further split operations are performed. This is largely due to the fact that the size of the union set will shrink, as objects containing unique keys are separated into different groups. 108 Consequently, there is a trade-off between having a high initial similarity score that results in many split operations occurring early on, and having a low initial similarity score that results in many layers of recursion. In essence, the first option results in a wide but short tree, whereas the second option results in a thin but tall tree. The second factor that influences the number of split operations is how many groups are generated for each split. To maximize the number of splits, it is ideal to have a large number of groups that can each be split further. However, each split has to partition the same number of objects. This results in either many groups containing a few objects or a few groups containing many objects. The importance of this consideration is that the maximum number of split operations that can be recursively performed on a large group is greater then a small group. This is due to the fact that a split operation always results in groups of fewer objects. As such, the number of objects in a group influences the maximum layers of recursion that could be applied. Similar to the first factor, initially generating many groups in a split results in a wide by short tree, whereas generating a few groups with many objects results in a thin but tall tree. In both factors, the solution that results in the most split operations is likely somewhere between the two extremes. 109 Chapter 8 Conclusion The Javascript Object Notation document format is one of the prominent formats for modeling semi-structured data. Because of its popularity, the idea of a schema document emerged as a method of validating the structure and content of other JSON documents. While the general problem of schema generation has been previously explored, the problem of schema generation for heterogeneous JSON arrays has not been adequately addressed. Existing schema generation tools have worked off the assumption that all JSON objects in an array should have the same structure. As a result, these tools only generate a single schema that combines all objects together—regardless of if they have different types of structures. This thesis looks to address this problem by instead generating a schema for each of the structures types found in a JSON array. As an array is designed to be processed as a single unit, objects in a heterogeneous array usually contain a set of keys whose purpose is to identify the objects structure. This allows programs processing the array to know how to proceed for each object. In order to detect these identification keys, we design an algorithm that recursively partitions the objects of an array based on a split operation we define. In essence, this operation chooses a key and partitions the objects based on the value for that key; objects having the 110 same value are placed in the same group. Using the identification keys generated by our algorithm, we build a schema decision tree to help in the validation process. 8.1 Future Work 8.1.1 Validating Identification Keys The main challenge addressed in this thesis is determining which keys are used for identification purposes. To do this, we work off the assumption that the groups generated by splitting on an identification key will have a greater similarity score when compared to groups generated by splitting on a non-identification key. While we believe this assumption to be largely valid, it is still possible for the algorithm to return a grouping that split on a non-identification keys. Determining when this occurs is predominantly based on the structure and values of the data passed into it. One idea to help detect when this occurs is to adopt an approach commonly used in machine learning and statistics. This approach is where the input data is first partitioned into two sets. The first set is used to train the model, and the second set is used to test the accuracy of it [75]. Using this idea, objects of a given JSON array could be first partitioned into two sets. The first set would be used to generate the schema decision tree. Objects of the second set would then be passed through the tree to see if they are validated correctly. If one of the objects is not validated, then either the resulting grouping applied the split operation on a non-identification key or the object represented an entity that was not part of the training data set. How to determine which scenario is true could be based on how many objects were not validated. If many objects were not valid, option one is likely. If only a few objects were not valid, option two is more likely. 111 8.1.2 Dynamic Similarity Threshold The algorithm presented in this work assumes a static similarity threshold. This, however, runs the risk of either being too low and not detecting all identification keys or being too high and having non-identification keys be part of the best grouping. Determining when this occurs is mainly dependent on the dataset. A different approach could instead be to make the threshold dynamic. This would allow the threshold to take on different values for different datasets—or even change during the algorithm itself. This idea of a dynamic threshold could also be used in conjunction with the validation method discussed above to better refine the results. If it was determined that the best grouping includes a non-identification key, the dynamic threshold could be lowered to see if more accurate results can be found. Likewise, if the best grouping validates all the objects in the test data, the dynamic threshold could be raised to see if more identification keys could be detected. Furthermore, the algorithm itself could potentially be extended to return groupings at different similarity thresholds. 8.1.3 Expanded API Study For the analysis done in this thesis, we focused on datasets generated by three web APIs, namely, the iTunes Search API, the Open Movie Database API, and the Spotify Search API. While these datasets do showcase how our algorithm performs on real word data, we think it would be beneficial to perform a more in-depth analysis on a larger sample size. In particular, the three data sources for our analysis are focused on the music / entertainment sectors. We would like to expend this to include other industries and vendors, as they might use the JSON format in different ways. 112 Furthermore, the focus on our work was on heterogeneous JSON arrays, and as such, we only generated the schema for that part of the JSON documents. Another area we would like to look at in the future is on expanding our work to identification keys in general rather then focusing on identification keys only present in arrays. For example, Amazon Web Services (AWS) allows users to define Identity and Access Management (IAM) policies through JSON documents. Over time, the format for these documents has evolved, and now, two different version exist. Users can define which version of the format they are using by defining a version key in the root of the JSON document that either contains the date 2008-10-17 or 2012-10-17. This key acts like a unique identifier for the JSON document even though the data is not part of a heterogeneous array. We think it would be very beneficial to expand our work to also cover scenarios like this as it would open up more data sources that we could use for analysis. 113 Bibliography [1] Apple Inc., “iTunes Search API,” 2019. Accessed on 02-11-2019. [2] Quicktype, “Json schema generator,” 2019. Accessed on 18-11-2019. [3] Google Inc., “Google Trends,” 2019. Accessed on 04-11-2020. [4] Stack Overflow Inc., “Stack Overflow Trends,” 2019. Accessed on 04-11-2020. [5] JSONschema.net, 2012. Accessed on 18-11-2019. [6] A. Silberschatz, H. Korth, and S. Sudarshan, Database Systems Concepts. USA: McGraw-Hill, Inc., 6 ed., 2011. [7] R. Bahta and M. Atay, “Translating json data into relational data using schema-oblivious approaches,” in Proceedings of the 2019 ACM Southeast Conference, ACM SE ’19, (New York, NY, USA), p. 233–236, Association for Computing Machinery, 2019. [8] “MongoDB.” https://www.mongodb.com/. Accessed: 2020-03-28. [9] K. C. nd Michael Dirolf, MongoDB: The Definitive Guide. ” O’Reilly Media, Inc.”, 2010. [10] E. F. Codd, “A relational model of data for large shared data banks,” Commun. ACM, vol. 13, p. 377–387, June 1970. [11] V. Dhar, “Data science and prediction,” Commun. ACM, vol. 56, p. 64–73, Dec. 2013. [12] M. Chen, S. Mao, and Y. Liu, “Big data: A survey,” Mobile networks and applications, vol. 19, no. 2, pp. 171–209, 2014. [13] G. S. Almasi and A. Gottlieb, Highly Parallel Computing. USA: BenjaminCummings Publishing Co., Inc., 1989. [14] S. Abiteboul, “Querying semi-structured data,” in Proceedings of the 6th International Conference on Database Theory, ICDT ’97, (Berlin, Heidelberg), p. 1–18, Springer-Verlag, 1997. 114 [15] P. Buneman, “Semistructured data,” in Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS ’97, (New York, NY, USA), p. 117–121, Association for Computing Machinery, 1997. [16] S. Abiteboul, P. Buneman, and D. Siciu, Data on the Web: From Relations to Semistructured Data and XML. USA: Morgan Kaufmann Publishers, 2000. [17] C. C. Shilakes and J. Tylman, “Enterprise information portals,” Nov 1998. https://web.archive.org/web/20110724175845/http://ikt.hia.no/ perep/eip_ind.pdf. [18] World Wide Web Consortium (W3C), Extensible Markup Language (XML) 1.0, 2 1998. Available at https://www.w3.org/TR/1998/REC-xml-19980210.html. [19] D. C. Denison, “The road to xml: Adapting sgml to the web,” World Wide Web Journal, vol. 2, pp. 5–12, 1997. [20] J. Bollen, A cognitive model of adaptive web design and navigation. PhD thesis, Vrije Universiteit Brussel, 10 2001. Available at https://www.cs.odu.edu/ ~jbollen/diss_pdfs/chapter2.pdf. [21] V. Bush and J. Wang, “As we may think,” Atlantic Monthly, vol. 176, pp. 101– 108, 1945. [22] Hypertext., Merriam Webster Dictionary. Merriam Webster Inc. Available at https://www.merriam-webster.com/dictionary/hypertext. Accessed: 2020-05-12. [23] D. Raggett, J. Lam, I. Alexander, and M. Kmiec, Raggett on HTML 4 (2nd Ed.). USA: Addison-Wesley Longman Publishing Co., Inc., 1998. [24] ISO Central Secretary, “Information processing — Text and office systems — Standard Generalized Markup Language (SGML),” Standard ISO 8879:1986(en), International Organization for Standardization, Geneva, CH, 1986. [25] C. F. Goldfarb, The SGML Handbook. USA: Oxford University Press, Inc., 1991. [26] HTML Working Group, “HyperText Markup Language Specification - 2.0,” Standard RFC 1866, Internet Engineering Task Force (IETF), Fremont, USA, 1995. [27] E. R. Harold, W. S. Means, and L. Petrycki, XML in a Nutshell: A Desktop Quick Reference. USA: O’Reilly Associates, Inc., 3 ed., 2004. [28] S. J. DeRose, The SGML FAQ Book: Understanding the Foundation of HTML and XML. USA: Kluwer Academic Publishers, 1997. 115 [29] World Wide Web Consortium (W3C), XML Path Language (XPath) 1.0, 11 1999. Available at https://www.w3.org/TR/1999/REC-xpath-19991116/. [30] World Wide Web Consortium (W3C), Namespaces in XML 1.0 (Third Edition), 12 2009. Available at https://www.w3.org/TR/REC-xml-names/. [31] World Wide Web Consortium (W3C), W3C XML Schema Definition Language (XSD) 1.1 Part 1: Structures, 4 2012. Available at https://www.w3.org/TR/ xmlschema11-1/. [32] U. Ogbuji, “Principles of xml design: When to use elements versus attributes,” Mar 2004. [33] Douglas Cockford, “The JSON Saga.” 2011. [Online] Available: https://www.youtube.com/watch?v=-C-JoyNuQJs, last accessed on 10/23/2019. [34] D. Crockford, How Javascript Works. USA: Virgule-Solidus LLC, 2018. [35] C. Severance, “Discovering javascript object notation,” Computer, vol. 45, pp. 6–8, apr 2012. [36] “ECMAScript 2019 Language Specification,” Standard ECMA-262, ECMA International, Geneva, CH, 1995. [37] Douglas Crockford, “Introducing JSON.” https://www.json.org/, last accessed on 04/23/2020. [Online] Available: [38] ECMA International, The JSON Data Interchange Syntax, 12 2017. Available at http://www.ecma-international.org/publications/files/ECMA-ST/ ECMA-404.pdf. [39] J. J. Garrett, “Ajax: A new approach to web applications,” Feb 2005. [40] N. C. Zakas, J. McPeak, and J. Fawcett, Professional Ajax, 2nd Edition. GBR: Wrox Press Ltd., 2007. [41] L. Bassett, “Introduction to javascript object notation: A to-the-point guide to json,” (USA), O’Reilly Associates, Inc., 2015. [42] J. S. Org, “Json schema,” 2019. Accessed on 12-11-2019. [43] F. Pezoa, J. L. Reutter, F. Suarez, M. Ugarte, and D. Vrgoč, “Foundations of json schema,” in Proceedings of the 25th International Conference on World Wide Web, WWW ’16, (Republic and Canton of Geneva, CHE), p. 263–273, International World Wide Web Conferences Steering Committee, 2016. [44] B. Kommadi, Learn Data Structures and Algorithms with Golang. Birmingham, UK: Packt Publishing, March 2019. 116 [45] G. J. Bronson, A First Book of C++. Boston, MA, USA: Course Technology Press, 4th ed., 2011. [46] K. B. Bruce, Foundations of Object-Oriented Languages: Types and Semantics. Cambridge, MA, USA: MIT Press, 2002. [47] A. Nierman and H. V. Jagadish, “Evaluating structural similarity in xml documents,” vol. 2, pp. 61–66, 2002. [48] S. Wierzchon and M. Klopotek, Modern Algorithms of Cluster Analysis. Cham, CH: Springer International Publishing, 2018. [49] J. L. C. Izquierdo and J. Cabot, “Discovering implicit schemas in json data,” in Proceedings of the 13th International Conference on Web Engineering, ICWE’13, (Berlin, Heidelberg), pp. 68–83, Springer-Verlag, 2013. [50] J. Canovas Izquierdo and J. Cabot, “JSONDiscoverer: Visualizing the schema lurking behind JSON documents,” Knowledge-Based Systems, vol. 103, pp. 52– 55, 04 2016. [51] M. Klettke, U. Störl, and S. Scherzinger, “Schema extraction and structural outlier detection for json-based nosql data stores.,” 03 2015. [52] W. Spoth, T. Xie, O. Kennedy, Y. Yang, B. Hammerschmidt, Z. H. Liu, and D. Gawlick, “Schemadrill: Interactive semi-structured schema design,” in Proceedings of the Workshop on Human-In-the-Loop Data Analytics, HILDA’18, (New York, NY, USA), pp. 11:1–11:7, ACM, 2018. [53] M.-A. Baazizi, D. Colazzo, G. Ghelli, and C. Sartiani, “Counting types for massive json datasets,” in Proceedings of The 16th International Symposium on Database Programming Languages, DBPL ’17, (New York, NY, USA), Association for Computing Machinery, 2017. [54] M.-A. Baazizi, D. Colazzo, G. Ghelli, and C. Sartiani, “Parametric schema inference for massive json datasets,” The VLDB Journal, vol. 28, pp. 497–521, Aug 2019. [55] M.-A. Baazizi, H. Ben Lahmar, D. Colazzo, G. Ghelli, and C. Sartiani, “Schema Inference for Massive JSON Datasets,” in Extending Database Technology (EDBT), (Venise, Italy), Mar. 2017. [56] M. DiScala and D. J. Abadi, “Automatic generation of normalized relational schemas from nested key-value data,” in Proceedings of the 2016 International Conference on Management of Data, SIGMOD ’16, (New York, NY, USA), pp. 295–310, ACM, 2016. [57] M. Piernik, D. Brzezinski, T. Morzy, and A. Lesniewska, “Xml clustering: a review of structural approaches,” The Knowledge Engineering Review, vol. 30, no. 3, p. 297–323, 2015. 117 [58] T. Dalamagas, T. Cheng, K.-J. Winkel, and T. Sellis, “A methodology for clustering xml documents by structure,” Inf. Syst., vol. 31, p. 187–228, May 2006. [59] A. Tagarelli and S. Greco, “Semantic clustering of xml documents,” ACM Trans. Inf. Syst., vol. 28, Jan. 2010. [60] J. Tekli and R. Chbeir, “A novel xml document structure comparison framework based-on sub-tree commonalities and label semantics,” Web Semant., vol. 11, p. 14–40, Mar. 2012. [61] R. Nayak and S. Xu, “Xcls: A fast and effective clustering algorithm for heterogenous xml documents,” in Proceedings of the 10th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD’06, (Berlin, Heidelberg), p. 292–302, Springer-Verlag, 2006. [62] P. Antonellis, C. Makris, and N. Tsirakis, “Xedge: Clustering homogeneous and heterogeneous xml documents using edge summaries,” in Proceedings of the 2008 ACM Symposium on Applied Computing, SAC ’08, (New York, NY, USA), p. 1081–1088, Association for Computing Machinery, 2008. [63] A. Algergawy, E. Schallehn, and G. Saake, “A schema matching-based approach to xml schema clustering,” in Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services, iiWAS ’08, (New York, NY, USA), pp. 131–136, ACM, 2008. [64] M. L. Lee, L. H. Yang, W. Hsu, and X. Yang, “Xclust: Clustering xml schemas for effective integration,” in Proceedings of the Eleventh International Conference on Information and Knowledge Management, CIKM ’02, (New York, NY, USA), p. 292–299, Association for Computing Machinery, 2002. [65] M. Weis and F. Naumann, “Detecting duplicate objects in xml documents,” in Proceedings of the 2004 International Workshop on Information Quality in Information Systems, IQIS ’04, (New York, NY, USA), pp. 10–19, ACM, 2004. [66] S. Puhlmann, M. Weis, and F. Naumann, “Xml duplicate detection using sorted neighborhoods,” in Proceedings of the 10th International Conference on Advances in Database Technology, EDBT’06, (Berlin, Heidelberg), p. 773–791, Springer-Verlag, 2006. [67] X. L. Wang and X. Wang, “An efficient approximate emst algorithm for color image segmentation,” in Machine Learning and Data Mining in Pattern Recognition (P. Perner, ed.), (Cham), pp. 147–159, Springer International Publishing, 2018. [68] S. Lu, “A tree-to-tree distance and its application to cluster analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-1, pp. 219– 224, April 1979. 118 [69] k. Zhang and D. Shasha, “Simple fast algorithms for the editing distance between trees and related problems,” SIAM journal on computing, vol. 18, pp. 1245–1262, Feb 1989. [70] S. S. Chawathe, A. Rajaraman, H. Garcia-Molina, and J. Widom, “Change detection in hierarchically structured information,” SIGMOD Rec., vol. 25, pp. 493–504, June 1996. [71] T. Akutsu, T. Mori, T. Tamura, D. Fukagawa, A. Takasu, and E. Tomita, “An improved clique-based method for computing edit distance between unordered trees and its application to comparison of glycan structures,” in Proceedings of the 2011 International Conference on Complex, Intelligent, and Software Intensive Systems, CISIS ’11, (USA), p. 536–540, IEEE Computer Society, 2011. [72] T. Mori, T. Tamura, D. Fukagawa, A. Takasu, E. Tomita, and T. Akutsu, “A clique-based method using dynamic programming for computing edit distance between unordered trees,” Journal of computational biology : a journal of computational molecular cell biology, vol. 19, pp. 1089–104, 10 2012. [73] Brian Fritz, “The Open Movie Database API,” 2020. Accessed on 05-23-2020. [74] Spotify Inc., “Spotify Search API,” 2020. Accessed on 05-23-2020. [75] G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning: With Applications in R. Springer Publishing Company, Incorporated, 2014. 119