Log in / create account | Login with OpenID
DocForge
Programmer's Wiki

XML

From DocForge

XML (eXtensible Markup Language) is a W3C-recommended general-purpose markup language that's useful in a wide variety of applications. XML languages, or dialects, may be designed by anyone and may be processed by conforming software. XML is designed to be reasonably human-legible. XML is a simplified subset of Standard Generalized Markup Language (SGML). Its primary purpose is to facilitate the sharing of data across different information systems, particularly systems connected through the Internet. Formally defined languages based on XML allow diverse software to reliably understand information formatted and passed in these languages.

Contents

[edit] Features

XML provides a text-based means to describe and apply a tree-based structure to information. At its base level, all information manifests as text, interspersed with markup that indicates the information's separation into a hierarchy of character data, container-like elements, and attributes of those elements. In this respect, it is similar to the Lisp programming language's S-expressions, which describe tree structures wherein each node may have its own property list.

The fundamental unit in XML is the character, as defined by the Universal Character Set. Characters are combined to form an XML document. The document consists of one or more entities, each of which is typically some portion of the document's characters, stored in a text string or file.

XML files may be served with a variety of Media types. RFC 3023 defines the types "application/xml" and "text/xml", which say only that the data is in XML, and nothing about its semantics. The use of "text/xml" has been criticized as a potential source of encoding problems but is now in the process of being deprecated. RFC 3023 also recommends that XML-based languages be given media types beginning in "application/" and ending in "+xml"; for example "application/atom+xml" for Atom.

The ubiquity of text file authoring software facilitates rapid XML document authoring and maintenance. Prior to the advent of XML, there were very few data description languages that were general-purpose, Internet protocol-friendly, and very easy to learn and author. In fact, most data interchange formats were proprietary, special-purpose, "binary" formats (based foremost on bit sequences rather than characters) that could not be easily shared by different software applications or across different computing platforms, much less authored and maintained in common text editors.

By leaving the names, allowable hierarchy, and meanings of the elements and attributes open and definable by a customizable schema, XML provides a syntactic foundation for the creation of custom, XML-based markup languages. The general syntax of such languages is rigid; documents must adhere to the general rules of XML, assuring that all XML-aware software can at least parse and understand the relative arrangement of information within them. The schema supplements the syntax rules with a set of constraints. Schemas typically restrict element and attribute names and their allowable containment hierarchies, such as only allowing an element named 'birthday' to contain 1 element named 'month' and 1 element named 'day', each of which has to contain only character data. The constraints in a schema may also include data type assignments that affect how information is processed; for example, the 'month' element's character data may be defined as being a month according to a particular schema language's conventions, perhaps meaning that it must not only be formatted a certain way, but also must not be processed as if it were some other type of data.

In this way, XML contrasts with HTML, which has an inflexible, single-purpose vocabulary of elements and attributes that, in general, cannot be repurposed. With XML, it is much easier to write software that accesses the document's information, since the data structures are expressed in a formal, relatively simple way.

XML makes no prohibitions on how it is used. Although XML is fundamentally text-based, software exists to abstract it into other, richer formats, largely through the use of datatype-oriented schemas and object oriented programming paradigms (in which the document is manipulated as an object). Such software might treat XML as serialized text only when it needs to transmit data over a network, and some software doesn't even do that much. Such uses have led to "binary XML", the relaxed restrictions of XML 1.1, and other proposals that run counter to XML's original spirit and thus garner criticism.

[edit] Benefits

  • It is a simultaneously human and machine-readable format.
  • It supports Unicode, allowing almost any information in any written human language to be communicated.
  • It can represent the most general computer science data structures: records, lists and trees.
  • Its self-documenting format describes structure and field names as well as specific values.
  • The strict syntax and parsing requirements make the necessary parsing algorithms relatively simple, efficient, and consistent.
  • Its robust, logically-verifiable format is based on international standards.
  • The hierarchical structure is suitable for many types of documents.
  • It is platform-independent, thus relatively immune to changes in technology.

[edit] Criticisms

  • Its syntax is verbose and redundant. This was intentional and contributes to ease of parsing and error detection, but also poses some problems. Intensively marked-up documents can become so cluttered as to be difficult to read. XML is favorable at least when compared to binary formats, which may not even render correctly in a text editor at all.
  • Redundancy may affect application efficiency through higher storage and transmission costs, even though compression is commonly very effective for XML data (precisely because its redundancy leads to low entropy). This is particularly problematic for multimedia applications running on small devices which attempt to use XML to describe images and video.
  • Generic XML parsers must be able to recurse arbitrarily nested data structures and may perform additional checks to detect improperly formatted or differently ordered syntax or data (this is because the markup is descriptive and partially redundant, as noted above). This amounts to overhead for very basic uses of XML, which may be significant where resources are scarce - for example in embedded systems. Furthermore, additional security considerations arise when XML input is fed from untrustworthy sources and resource exhaustion or stack overflows are possible.
  • Some consider the syntax to contain a number of obscure, unnecessary features born of its legacy of SGML compatibility. However, an effort to settle on a subset called "Minimal XML" led to the discovery that there was no consensus on which features were in fact obscure or unnecessary.
  • The basic parsing requirements do not directly support data types; XML provides no specific notion of "integer", "string", "boolean", "date", and so on. For example, there is no provision for mandating that "3.14159" be a floating-point number rather than a seven-character string. Such conversions require processing. Again this is intentional: expressing the quantity ten as the string "10" completely avoids problems such as the integer storage formats of various computers (see endianness), which also require conversions when data moves between machines. Consequently, this can be an advantage or disadvantage depending on the application. Most XML schema languages add datatype functionality.
  • The hierarchical model for representation is limited when compared to the relational model, since it only gives a fixed, monolithic view of the tree structure. This is countered with the thought that the hierarchical model for representation is more powerful because it can directly represent recursive and iterative structures without resorting to serial number fields, recursive join operations, and sorting; which model holds the practical advantage depends on the data and operations required. This objection does not apply to the use of XML to serialize relational data, but pertains to the use of XML to express data relations directly, because XML naturally expresses the hierarchical syntax of the SQL query language which is used to serialize relational databases as one method of database backup.
  • Expressing overlapping (non-hierarchical) node relationships requires extra effort. Such techniques are a current research topic among researchers interested in XML.
  • Mapping XML to the relational or object oriented paradigms is often cumbersome, although the reverse is typically easy. Some take this as evidence that XML's model is indeed the more powerful.

[edit] Processing XML

SAX and DOM are object oriented programming APIs widely used to process XML data. The first XML parsers exposed the contents of XML documents to applications as SAX events or DOM objects.

SAX is a lexical, event-driven interface in which a document is read serially and its contents are reported as "callbacks" to various methods on a handler object of the user's design. SAX is fast and efficient to implement, but difficult to use for extracting information at random from the XML, since it tends to burden the application author with keeping track of what part of the document is being processed. It is better suited to situations in which certain types of information are always handled the same way, no matter where they occur in the document.

DOM is an interface-oriented API that allows for navigation of the entire document as if it were a tree of Node objects representing the document's contents. A DOM document can be created by a parser, or can be generated manually by users (with limitations). Data types in DOM Nodes are abstract; implementations provide their own programming language-specific bindings. DOM implementations tend to be memory intensive, as they generally require the entire document to be loaded into memory and constructed as a tree of objects before access is allowed. A form of XML access that has become increasingly popular in recent years is push parsing, which treats the document as if it were a series of items which are being read in sequence. This allows for writing of recursive-descent parsers in which the structure of the code performing the parsing mirrors the structure of the XML being parsed, and intermediate parsed results can be used and accessed as local variables within the methods performing the parsing, or passed down (as method parameters) into lower-level methods, or returned (as method return values) to higher-level methods. For instance, in the Java programming language, the StAX framework can be used to create what is essentially an 'iterator' which sequentially visits the various elements, attributes, and data in an XML document. Code which uses this 'iterator' can test the current item (to tell, for example, whether it is a start or end element, or text), and inspect its attributes (local name, namespace, values of XML attributes, value of text, etc.), and can also request that the iterator be moved to the 'next' item. The code can thus extract information from the document as it traverses it. One significant advantage of push-parsing methods is that they typically are much more speed- and memory-efficient than SAX and DOM styles of parsing XML. Another advantage is that the recursive-descent approach tends to lend itself easily to keeping data as typed local variables in the code doing the parsing, while SAX, for instance, typically requires a parser to manually maintain intermediate data within a stack of elements which are parent elements of the element being parsed. This tends to mean that push-parsing code is often much more straightforward to understand and maintain than SAX parsing code. Some potential disadvantages of push parsing are that it is a newer approach which is not as well known among XML programmers (although it is by far the most common method used for writing compilers and interpreters for languages other than XML), and that most existing push parsers cannot yet perform advanced processing such as XML schema validation as they parse a document.

Another form of XML processing API is data binding, where XML data is made available as a custom, strongly typed programming language data structure, in contrast to the interface-oriented DOM. Example data binding systems are the Java Architecture for XML Binding (JAXB) and the Strathclyde Novel Architecture for Querying XML (SNAQue).

A filter in the Extensible Stylesheet Language (XSL) family can transform an XML file for displaying or printing.

  • XSL-FO is a declarative, XML-based page layout language. An XSL-FO processor can be used to convert an XSL-FO document into another non-XML format, such as PDF.
  • XSLT is a declarative, XML-based document transformation language. An XSLT processor can use an XSLT stylesheet as a guide for the conversion of the data tree represented by one XML document into another tree that can then be serialized as XML, HTML, plain text, or any other format supported by the processor.
  • XQuery is a W3C language for querying, constructing and transforming XML data.
  • XPath is a DOM-like node tree data model and path expression language for selecting data within XML documents. XSL-FO, XSLT and XQuery all make use of XPath. XPath also includes a useful function library.


Additional copyright notice: Some content of this page is a derivative work of a Wikipedia article under the GNU FDL. The original article and author information can be found at http://en.wikipedia.org/wiki/XML.