Next Patent: Communications system
Next Patent: Communications system
Plaque It!
|
[0001] The present invention relates to the encoding of XML (Extensible Markup Language) documents and, in particular, to at least one of the compression, streaming, searching and dynamic construction of XML documents.
[0002] To make streaming, downloading and storing MPEG-7 descriptions more efficient, the description can be encoded and compressed. An analysis of a number of issues relating to the delivery of MPEG-7 descriptions has involved considering the format to be used for binary encoding. Existing encoding schemes for XML, including the WBXML proposal from WAP (the Wireless Application Protocol Forum), the Millau algorithm and the XMill algorithm, have each been considered.
[0003] With WBXML, frequently used XML tags, attributes and values are assigned a fixed set of codes from a global code space. Application specific tag names, attribute names and some attribute values that are repeated throughout document instances are assigned codes from some local code spaces. WBXML preserves the structure of XML documents. The content as well as attribute values that are not defined in the Document Type Definition (DTD) can be stored in-line or in a string table. It is expected that tables of the document's code spaces are known to the particular class of applications or are transmitted with the document.
[0004] While WBXML tokenizes tags and attributes, there is no compression the textual content. Whilst such is probably sufficient for the Wireless Markup Language (WML) documents, proposed for use under the WAP, and for which WBXML is designed, as such documents usually have limited textual content, WBXML is not considered to be a very efficient encoding format for the typical text-laden XML documents. The Millau approach extends the WBXML encoding format by compressing text using a traditional text compression algorithm. Millau also takes advantage of the schema and datatypes to enable better compression of attribute values that are of primitive datatypes.
[0005] The authors of the Xmill algorithm have presented an even more complex encoding scheme, although such was not based on WBXML. Apart from separating structure and text encoding and using type information in DTD and schema for encoding values of built-in datatypes, that scheme also:
[0006] (i) grouped elements of the same or related types into containers (to increase redundancy),
[0007] (ii) compressed each container separately using a different compressor,
[0008] (iii) allowed atomic compressors to be combined into more complex ones, and
[0009] (iv) allowed the use of new specialized compressors for highly specialized datatypes.
[0010] Nevertheless, existing encoding schemes are only designed for compression. They do not support the streaming of XML documents. In addition, elements still cannot be located efficiently using the XPath/XPointer addressing scheme and a document cannot be encoded incrementally as it is being constructed.
[0011] In accordance with one aspect of the present disclosure, there is provided a method of communicating at least part of a structure of a document described by a hierarchical representation, said method comprising the steps of:
[0012] identifying said representation of said document;
[0013] packetizing said representation into a plurality of data packets, said packets having a predetermined size, said packetizing comprising creating at least one link between a pair of said packets, said link representing an interconnection between corresponding components of said representation; and
[0014] forming said data packets into a stream for communication wherein said links maintain said representation within said packets.
[0015] In accordance with another aspect of the present disclosure, there is provided a method of communicating at least part of the structure of a document described by a hierarchical representation, said method comprising the steps of:
[0016] identifying at least one part of said representation and packetizing said parts into at least one packet of predetermined size, characterised in that where any one or more of said parts of said representation do not fit within one said packet, defining at least one link from said one packet to at least one further said packet into which said non-fitting parts are packetized, said link maintaining the hierarchical structure of said document in said packets.
[0017] In accordance with another aspect of the present disclosure, there is provided a method of facilitating access to the structure of an XML document, said method comprising the steps of:
[0018] identifying a hierarchical representation of said document;
[0019] packetizing said representation into a plurality of packets of predetermined packet size;
[0020] forming links between said packets to define those parts of said representation not able to be expressed within a packet thereby enabling reconstruction of said representations after de-packetizing.
[0021] The presently disclosed encoding and decoding schemes separate structure and text encoding and use the schema and datatypes for encoding values of built-in datatypes. In addition, the disclosure provides support for streaming and allows efficient searching using XPath/XPointer-like addressing mechanism. Such also allows an XML document to be encoded and streamed as it is being constructed. These features are important for broadcasting and mobile applications. The presently disclosed encoding scheme also supports multiple namespaces and provides EBNF definitions of the bitstream and a set of interfaces for building an extensible encoder.
[0022] One or more embodiments of the present invention will now be described with reference to the drawings and Appendix, in which:
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
[0032]
[0033] Appendix provides a definition useful for the encoded bitstream and the parameters thereof.
[0034] The methods of encoding and decoding XML documents to be described with reference to FIGS.
[0035] The computer system
[0036] The computer module
[0037] Typically, the application program is resident on the hard disk drive
[0038] In operation the XML document encoding/decoding functions are performed on one of the server computer
[0039] The methods of encoding and decoding may alternatively be implemented in part or in whole by dedicated hardware such as one or more integrated circuits performing the functions or sub functions of encoding and/or decoding. Such dedicated hardware may include graphic processors, digital signal processors, or one or more microprocessors and associated memories.
[0040] Encoding and Compressing XML
[0041] Separating Structure and Text
[0042] Traditionally, XML documents are mostly stored and transmitted in their raw textual format. In some applications, XML documents are compressed using some traditional text compression algorithms for storage or transmission, and decompressed back into XML before they are parsed and processed.
[0043] According to the present disclosure, another way for encoding an XML document is to encode the tree hierarchy of the document (such as the DOM representation of the document). The encoding may be performed in a breadth-first or depth-first manner. To make the compression and decoding more efficient, the XML structure, denoted by tags within the XML document, can be separated from the text of the XML document and encoded. When transmitting the encoded document, the structure and the text can be sent in separate streams or concatenated into a single stream.
[0044] As seen in
[0045] The approach shown in
[0046] In general, the volume of structural information is much smaller than that of textual content. Structures are usually nested and repeated within a document instance. Separating structure from text allows any repeating patterns to be more readily identified by the compression algorithm which, typically, examines the input stream through a fixed-size window. In addition, the structure and the text streams have rather different characteristics. Hence, different and more efficient encoding methods may be applied to each of the structure and text.
[0047] The structure is critical in providing the context for interpreting the text. Separating structure and text in an encoder allows the corresponding decoder to parse the structure of the document more quickly thereby processing only the relevant elements while ignoring elements (and descendants) that it does not know or require. The decoder may even choose not to buffer the text associated with any irrelevant elements. Whether the decoder converts the encoded document back into XML or not depends on the particular application to be performed (see the discussion below on Application Program Interfaces—API's).
[0048] Code Tables
[0049] The elements of a document description and their attributes are defined in DTD's or schemas. Typically, a set of elements and their associated attributes are repeatedly used in a document instance. Element names as well as attribute names and values can be assigned codes to reduce the number of bytes required to encode them.
[0050] Typically, each application domain uses a different set of elements and types defined in a number of schemas and/or DTD's. In addition, each schema or DTD may contain definitions for a different namespace. Even if some of the elements and types are common to multiple classes of applications, they are usually used in a different pattern. That is, an element X, common to both domains A and B, may be used frequently in domain A, but rarely in domain B. In addition, existing schemas are updated and new schemas are created all the time. Hence, it is best to leave the code assignment to organisations that overlook interoperability in their domains. For instance, MPEG-7 descriptions are XML documents. MPEG may define the codespaces for its own descriptors and description schemes as well as external elements and types that are used by them. MPEG may also define a method for generating codespaces. Ideally, the method should be entropy based—that is, based on the number of occurrences of the descriptors and description schemes in a description or a class of description (see the section on generating codespaces).
[0051] Separating Element and Attributes
[0052] An XML tag typically comprises an element name and a set of attribute name/value pairs. Potentially, a large set of attributes can be specified with an element instance. Hence, separating an element name from the attributes will allow the document tree to be parsed and elements to be located more quickly. In addition, some attributes or attribute name/value pairs tend to be used much more frequently than the others. Grouping attribute name, value and name/value pairs into different sections usually results in better compression.
[0053] Encoding Values of Built-In Datatypes and Special Types
[0054] The encoder operates to encode the values of attributes and elements of built-in (or default) datatypes into more efficient representations according to their types. If the schema that contains the type information is not available, the values are treated as strings. In addition, if a value (for instance, a single-digit integer) is more efficiently represented as a string, the encoder may also choose to treat it as string and not to encode it. By default, strings are encoded as a Universal Text Format (UTF-8) string which provides a standard and efficient way of encoding a string of multi-byte characters. In addition, the UTF string includes length information avoiding the problem of finding a suitable delimiter and allowing one to skip to the end of the string easily.
[0055] Special type encoders can be used for special data types. These special type encoders can be specified using the setTypeEncoder( ) interface of the Encoder API (as discussed below). Information about the special type encoders is preferably stored in the header of the structure segment, advantageously as a table of type encoder identifiers. Further, the default type encoders (for the built-in datatypes) can be overridden using the same mechanism. As such where some built-in data type would ordinarily be encoded using a default encoder, a special encoder may alternatively be used, such necessitating identification within the bitstream that an alternative decoding process will be required for correct reproduction of the XML document. Each encoded value is preceded by the identifier of the type encoder that was used to encode the value.
[0056] In this fashion, an XML document encoder implemented according to the present disclosure may include a number of encoding formats for different types of structure and text within the XML document. Certain encoding formats may be built-in or default and used for well known or commonly encountered data types. Special type encoders may be used for any special data types. In such cases, an identification of the particular type encoder(s) used in the encoding process may be incorporated into the header of a packet, thereby enabling the decoder to identify those decoding processes required to be used for the encoded types in the encoded document. Where appropriate, the particular type encoders may be accessible from a computer network via a Uniform Resource Indicator (URI). Where the decoder is unable to access or implement a decoding process corresponding to an encoded type encountered within a packet in the encoded document, a default response may be to ignore that encoded data, possibly resulting in the reproduction of null data (eg. a blank display). An alternative is where the decoder can operate to fetch the special type decoder, from a connected network, for example using a URI that may accompany the encoded data. The URI of an encoder/decoder format may be incorporated into the table mentioned above and thereby included in the bitstream (see the Appendix).
[0057] In a further extension of this approach, multiple encoding formats may be used for to a single data type. For example, text strings may be encoded differently based upon the length of the string, such representing a compromise between the time taken to perform a decoding process and the level of compression that may be obtained. For example, text strings with 0-9 characters may not be encoded, whereas strings with 10-99 and 100-999 characters may be encoded with respective (different) encoding formats. Further, one or more of those encoding formats may be for a special data type. As such the encoder when encoding text strings in this example may in practice use no encoding for 0-9 character strings, a default encoder for 10-99 character strings, and a special encoder for string having more than 100 text characters.
[0058]
[0059] The Structure Segment (or Structure Stream)
[0060]
[0061] Each section
[0062] An ID table section
[0063] A section
[0064] There are sections for the code tables for namespaces
[0065] The local code tables are usually followed by a section containing a table of attribute name/value pairs
[0066] The document hierarchy section
[0067] Apart from using code tables and type encoders for encoding, in most cases, the encoder also compresses each section using a compressor. Instead of compressing each section of the body of the structure segment
[0068] Potentially, a different compressor can be used for each section taking into account the characteristics of the data in each section. Information about the compressors used is provided in the header. The default is to use ZLIB for compressing all the sections in the structure segment as well as the text segment. The ZLIB algorithm generates a header and a checksum that allow the integrity of the compressed data to be verified at the decoder end.
[0069] The Text Segment (or Text Stream)
[0070] The text segment
[0071] The Encoder and Decoder Models
[0072] The Encoder Model
[0073]
[0074] The Decoder Model
[0075]
[0076] In most cases, the decoder
[0077] Locating Elements
[0078] XML elements can be referenced and located using ID's or XPath/XPointer fragments. As mentioned earlier, the ID table
[0079] Below are some examples of XPath fragments that can be appended to an Uniform Resource Indicator (URI):
[0080] /doc/chapter[2]/section[3]
[0081] selects the third section of the second chapter of doc
[0082] chapter[contains(string(title),“Overview”)]
[0083] selects the chapter children of the context node that have one or more title children containing the text “Overview”
[0084] child::*[self::appendix or self::index]
[0085] selects the appendix and index children of the context node
[0086] child::*[self::chapter or self::appendix] [position( )=last( )]
[0087] selects the last chapter or appendix child of the context node
[0088] para[@type=“warning”]
[0089] selects all para children of the context node that have a type attribute with value “warning
[0090] para[@id]
[0091] selects all the para children of the context node that have an id attribute.
[0092] An XPath/XPointer fragment consists of a list of location steps representing the absolute or relative location of the required element(s) within an XML document. Typically, the fragment contains a list of element names. Predicates and functions may be used, as in the examples above, to specify additional selection criteria such as the index of an element within an array, the presence of an attribute, matching attribute value and matching textual content.
[0093] The compactness of the encoded document hierarchy allows it to be parsed (and instantiated) without expanding into a full object tree representation. The fragment address is first translated into an encoded form. One of the consequences of such a translation process is that it allows one to determine immediately whether the required element(s) actually occurred in the document. Matching the components of the encoded fragment address is also much more efficient than matching sub-strings. The design allows simple XPath/XPointer fragments (which are most frequently used) to be evaluated quickly. Searching the document hierarchy first also greatly narrows the scope of subsequent evaluation steps in the case of a more complex fragment address.
[0094] Packetizing the Bitstream for Streaming
[0095] Streaming XML
[0096] Traditionally, XML documents are mostly stored and transmitted in their raw textual format. In some applications, XML documents are compressed using some traditional text compression algorithms for storage or transmission, and decompressed back into XML before they are parsed and processed. Although compression may greatly reduce the size of an XML document, under such circumstances an application still must receive the entire XML document before parsing and processing can be performed.
[0097] Streaming an XML document implies that parsing and processing can start as soon as sufficient portion of the XML document is received. Such capability will be most useful in the case of a low bandwidth communication link and/or a device with very limited resources.
[0098] Because an ordinary XML parser expects an XML document to be well-formed (ie. having matching and non-overlapping start-tag and end-tag pairs), the parser can only parse the XML document tree in a depth-first manner and cannot skip parts of the document unless the content of the XML document is reorganized to support it.
[0099] Packetizing the Bitstream
[0100] Encoding an XML document into a complete structure segment
[0101] Apart from the need for processing a document while it was being delivered, an encoder/decoder typically has an output/input buffer of fixed size. Accordingly, except for very short documents, the encoder
[0102] For each structure packet
[0103] If an ID, an element name, an attribute name, or an attribute value is longer than a pre-defined length, it will be encoded in a text packet and a string locator rather than the actual string will appear in the tables.
[0104] The document hierarchy section of a structure packet contains a sequence of nodes. Each node has a size field that indicates its (encoded) size in bytes including the total size of its descendant nodes encoded in the packet. The node can be an element node, a comment node, a text node or a node locator. Each node has a nodeType field that indicates its type.
[0105] The document hierarchy may contain:
[0106] (i) a complete document tree: this is only possible for very short document;
[0107] (ii) a complete sub-tree: the sub-tree is the child of another node encoded in an earlier packet; and
[0108] (iii) an incomplete sub-tree: the sub-tree is incomplete because the whole sub-tree cannot be encoded into one packet due to time and/or size constraints.
[0109] Node locators are used in the manner shown in
[0110] Each element node preferably contains a namespace code, an element (name) code, and, if the element has attributes, the byte offset of the first attribute in the attribute name/value pair table and the number of attributes.
[0111] Each text node or comment node typically contains a text locator rather than the actual text. The text locator specifies the packet number of a text packet and a byte offset into the text packet.
[0112] In some cases, a string may exceed the maximum size of a packet. Where such occurs, the string is stored as fragments over multiple text packets, as shown in
[0113] Commands for Constructing Document Tree
[0114] An XML document may be packetized for streaming to the receiver as it is being encoded or even generated (according to some pre-defined DTD or schema). In this case, the XML document is typically constructed in real-time using an API such as a DOM APL Instead of parsing an XML file, the encoder
[0115] Since the nodes transmitted are parts of the same document (that conforms to some pre-defined DTD or schema) and the document is on-line and in-sync between the encoder
[0116] A command packet contains the path of (the root of) the sub-tree to be appended or inserted and the packet number of the structure packet that contains the sub-tree. For example, returning to
[0117] The Definition of the Bitstream
[0118] The bitstream
[0119] API
[0120] API for Documents and Schemas
[0121] It is not always necessary for the decoder
[0122] An application may also have to access information stored in schemas. As schemas are also XML documents, they can be encoded in the same way. Using existing SAX or DOM API for accessing and interpreting schema definitions is extremely tedious. A parser that supports a schema API, such as the Schema API defined in Wan E., Anderson M., Lennon A.,
[0123] To allow the values of built-in datatypes and special types to be encoded efficiently, an encoder has to be able to obtain type information from the schemas. Hence, a schema API is also extremely important to the encoder
[0124] API for Encoders
[0125] The binary format proposed below allows for the implementation of encoders of various capabilities and complexity. The interfaces described in this section allow one to construct a basic encoder that can be extended to provide the more complicated features supported by the encoding scheme.
[0126] Encoder Interface
[0127] void SetMaxPacketSize(in unsigned long maxPacketSize)
[0128] Set the maximum packet size in bytes.
[0129] void SetMaxPrivateDataSize(in unsigned long maxPrivateDataSize)
[0130] Set the maximum size of the private data in byte. Note that the amount of private data that can be included in a packet is limited by the maximum size of the packet. A large amount of private data is not expected as such works against the objective of reducing the size of the bitstream.
[0131] void SetHeaderUserData(in ByteArray headerData)
[0132] Write the user data to the header packet. Any existing data will be overwritten.
[0133] void UseCodeTable(in CodeTable codeTable, in Boolean encodeIt)
[0134] Inform the encoder of a pre-defined code table and whether the code table should be encoded with the data.
[0135] void SetCompressor(in Section section, in Inflater compressor)
[0136] Instruct the encoder to use the specified compressor for the specified section. Section is an enumeration with the following values: STRUCT_BODY=1, TEXT_BODY=2, ID_TABLE=3, NS_SECT=4, ELEMENT_SECT=5, ATTR_NAME_SECT=6, ATTR_VALUE_SECT=7, ATTR_PAIR_SECT=8, DOC_HIERARCHY_SECT=9. Inflater has the same interface as Inflater of the java.util.zip package.
[0137] void Flush( )
[0138] Flush the packets in the buffer to the output stream.
[0139] void OnOutput( )
[0140] Receive notification before the set of packets in the buffer is output to allow the application to insert application specific-data to the packets.
[0141] void SetPacketUserData(in ByteArray userData)
[0142] Write the user data to each of the packets except any header packet in the buffer. Any existing user data will be overwritten.
[0143] Code Table Interface
[0144] unsigned short GetSize( )
[0145] Get the number of entries in the code table.
[0146] wstring GetNamespace(in unsigned short i)
[0147] Get the namespace of the value associated with the ith entry of the code table.
[0148] wstring GetValue(in unsigned short i)
[0149] Get the value associated with the ith entry of the code table.
[0150] wstring GetType(in unsigned short i)
[0151] Get the type of the value associated with the ith entry of the code table.
[0152] ByteArray GetCode(in unsigned short i)
[0153] Get the code associated with the ith entry of the code table.
[0154] unsigned short GetIndexByCode(in ByteArray code)
[0155] Get the value associated with a code.
[0156] unsigned short GetIndexByValue(in wstring value)
[0157] Get the value associated with a code.
[0158] unsigned short GetMaxCodeValue( )
[0159] Get the maximum code value reserved by the code table. The encoder is free to use code value above the maximum code value. Depending on application, an encoder may also be implemented to use holes left by a pre-defined code table.
[0160] Type Encoder Interface
[0161] ByteArray Encode(in wstring text)
[0162] Encode the value into a byte array given its text representation.
[0163] wstring Decode(in ByteArray encodedText)
[0164] Decode an encoded value into the text representation of the value.
[0165] Encoding the XML Data, in Particular MPEG-7 Descriptions of a Presentation
[0166] If (fragments of) XML data including MPEG-7 descriptions (which are XML data used for describing audio-visual (AV) content) are to be streamed and presented with AV content, the timing of and the sychronization between the media objects (including the XML data) have to be specified. Like XML, the DDL (the description definition language of XML) does not define a timing and synchronization model for presenting media objects. As mentioned above, a SMIL-like MPEG-7 description scheme called herein Presentation Description Scheme is desired to provide the timing and synchronization model for authoring multimedia presentations.
[0167] It has been suggested that MPEG-7 descriptions can be treated in the same way as AV objects. This means that each MPEG-7 description fragment, like AV objects, used in a presentation will be tagged with a start time and a duration defining its temporal scope. This allows both MPEG-7 fragments and AV objects to be mapped to a class of media object elements of the Presentation Description Scheme and subjected to the same timing and sychronization model. Specifically, in the case of a SMIL-based Presentation Description Scheme, a new media object element such as an <mpeg7> tag can be defined. Alternately, MPEG-7 descriptions can also be treated as a specific type of text.
[0168] It is possible to send different types of MPEG-7 descriptions in a single stream or in separate streams. It is also possible to send an MPEG-7 description fragment that has sub-fragments of different temporal scopes in a single data stream or in separate streams. This is a role for the presentation encoder, in contrast to the XML encoder
[0169] The presentation encoder wraps an XML packet with a start time and a duration signalling when and for how long the content of the packet is required or relevant. The packet may contain:
[0170] (i) multiple short description fragments (each with their own temporal scope) concatenated together to achieve high compression rate and minimize overhead;
[0171] (ii) a single description fragment; and
[0172] (iii) part of a large description fragment.
[0173] In the case where the packet contains multiple description fragments, the start time of the packet is the earliest of the start times of the fragments while the duration of the packet is the difference between the latest of the end time of the fragments (calculated by adding the duration of the fragment to its start time) and the start time of the packet.
[0174] In broadcasting applications, to enable users to tune into the presentation at any time, relevant materials have to be repeated at regular interval. While only some of the XML packets have to be resent as some of the XML packets sent earlier may no longer be relevant, the header packet needs to be repeated. This means that, in the case of broadcasting applications, the header packet may be interspersed among structure, text and command packets to reset the transmission to a known state.
[0175] It is apparent from the above that the arrangements described are applicable to the computer and data processing industries and to the efficient use of communication resources associated therewith whilst affording the ability to work with partially received information.
[0176] The foregoing describes only one or more embodiments of the present invention, and modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiment(s) being illustrative and not restrictive. For example, whilst described with reference to XML documents, the procedures disclose herein are applicable to any hierarchical representation, such as a tree representation of a document.