format	bandm ^meta_tools	xantlr

dtd, a Versatile Model of w3c-xml-dtds

(related API documentation: package dtd, package dtm )

1          Purposes and Principles
2          Structure and Fundamental Operations of the DTD and DTM models.
2.1          Parsing and Checking
2.2          Writing out
2.3          Important Details
2.3.1          Entity Expansion
3          Auxiliary Tools
3.1          Content Simplifier
3.2          Whitespace Ignorer
3.3          Utilities
3.4          Entity Classifier
4          Main dtd Tool
4.1          Restrictions On PERef Placement
4.2          Sample Applications

^{^ToC} 1 Purposes and Principles

The structure of the XML documents processed in and by ^meta_tools is defined by "Document Type Definitions" = "DTDs", as defined by the w3c [xml]

For processing and code generation a computer-internal representation is needed. This is contained in the ^meta_tools package dtd .

It allows easy parsing, editing, visiting and serializing of DTD models. It includes consistency checks and the resolution of so-called "entities".

^{^ToC} 2 Structure and Fundamental Operations of the DTD and DTM models.

Basically, our dtd model is an application of the model generator umod tool. So the easiest way of explaining it is using the umod-style model definition, which can be found in the appendix source file dtd.umod .

Please note the possibly confusing naming convention:

eu.bandm.tools.dtd is the package,
eu.bandm.tools.dtd.DTD is the umod-generated model, containing classes for element declarations, attribute lists, entity defs, and all the other stuff which makes up a classical dtd,
and eu.bandm.tools.dtd.DTD.Dtd is the class representing one single given dtd as a whole.

Since it is an umod model instance, each dtd model can be constructed, visited, rewritten, serialized, visualized, etc. like any other, as explained in umod user doc.

The data of a DTD model represents verbatim the textual elements of a given DTD text in its original sequential order, without any simplifications, expansions or consistency checks applied. In many cases this is not appropriate for further processing.

Therefore there exists a second class, dtm.DTM. This reads "Document Type Model". First, it realizes a "pure algebraic" model of a W3C-DTD: attributes are linked to elements, all entities and comments are eliminated, process instructions are sorted by targets. Secondly, all naming can be interpreted according to XML name space logic. In most cases, a DTM is much more appropriate for further procesing.

The name space support is controlled by tdom processing instructions ("<?tdom xmlns:PREF="URL" ..") in the original dtd text, or programmatically.

Class dtm.DTD2DTM is a converter which also performs some consistency checks.

^{^ToC} 2.1 Parsing and Checking

For parsing a DTD which comes in textual representation, there is the parser class dtd.TunedDTDParser.

It works as ususal, ie. takes an input source and a message channel for generating its error messages and warnings, and creates and fills in a new instance of class dtd.DTD.Dtd.

If this parsing is successful, the object structure reflects the verbatim appearance of the source file, which is not appropriate for further processing.

For further processing there are the auxiliary classes ElementIndex, AttlistIndex and Statistics, the latter recurring to the two former.

These classes offer indexing functions, consistency checks and further analyses, in cases the user does not want to convert the DTD into a DTM as a whole.

^{^ToC} 2.2 Writing out

The generation of an external textual representation is achieved by "FORMAT" directives in the umod source. These are processed by the format front-end language compiler and thus define the code for printing dtds in the format as usual and in compliance to the standard.

Simply calling public Format DTD.toFormat(Dtd) will result in a format object, which in turn can be printed as usual (cf format user doc), e.g. by "format.printFormat(PrintWriter,int)".

^{^ToC} 2.3 Important Details

^{^ToC} 2.3.1 Entity Expansion

The expansion of "entities in the w3c-xml-sense" (see [xml] ) is a complicated process: There are five different flavours of entities and many different categories of syntactical positions where expansion may happen. For nearly all of these combinations the rules are different.
(This documentation includes a dedicate white paper on this complicated subject!)

"Internal parameter entities" ("PEs" in the following) are macros the value of which is defined with its declaration (as opposed to being contained in an external file) and wich are intended for use within in the dtd's declaration text.

When parsing a dtd, nearly all references to these PEs are immediately replaced by their text contents, which becomes subject of further parsing instead. The only exception are PEs which appear in a content model, and which yield a sub-expression of a content model. These translated into an Abbrev object, which combines the content model and memorizes the entity's name.

As clearly indicated in the umod file, a content model consists either of some special forms (ANY, mixed content, EMPTY), or of a term of a freely compositional expression language, representing extended regular expressions. In our model these are represented by subclasses of CP.

Variants of CP are the sequence and the choice constructor, as usual. Beside these, there is a distinction between Singleton, which is a reference to some element declaration, and Abbrev, which is a reference to an entity definition, the expansion of which must in turn be a content model.

In our model the original name of the PE and the content model resulting from its expansion are both contained therein. This is a redundancy, since the entity declaration as such is additionally modeled by an Entity and an EntityValue object. Nevertheless this solution is appropriate in so far as on one hand not every entity corresponds to a content model, but on the other hand all finally resulting content models of all element declarations should be visitable in a transparent way.

^{^ToC} 3 Auxiliary Tools

^{^ToC} 3.1 Content Simplifier

The dtd.ContentSimplifier transforms content models into the canonical, shortest form. This can be very useful in all cases where dtd models are generated automatically.

In particular, the resulting transformation is the fixpoint of the repeting application of the following transformations:

                                                (outer term)
   ,  a  a? a+  a*      |  a  a? a+  a*      o  x  x? x+  x*         
   a  -  -  -   a+      a  -  a? a+  a*      a  -  a? a+  a*   
   a? -  -  a+  a*      a? a? -  a*  a*      a? -  a? a*  a*   
   a+ -  a+ aa+ a+      a+ a+ a* -   a*      a+ -  a* a+  a*   
   a* a+ a* a+  a*      a* a+ a* a*  -       a* -  a* a*  a*     
   
   "-" means "no change", and out=in

The content simplifier is a realized as a rewriter, derived from the basic rewriter code generated by umod (see umod rewriter doc)

Please note that in tdom the simplification is not employed, that means, the nomenclature of tdom-generated classes follows verbatim the dtd as given to the tool, even when it contains things like

      ( a , a* )          // allowed
      ( a | (a+)? | a* )  // non LL(1), forbidden

^{^ToC} 3.2 Whitespace Ignorer

Oftenly we employ reading routines which parse an XML source text and convert it to sax events. This routines can come from differente vendors and operate in different contexts. What they all have in common is that they must run in "validating mode", or at least have access to the DTD, when you want them not to produce "ingorable whitespace".

Since the consumer behind the sax interface (e.g. a tdom generated model) already contains all the validating information and routines, this would be a (possibly tedious) duplication.

The whitespace ignorer allows to run the third-party parser non-validating (let it produce character data whenever there are characters in its input), and then throws away the sax events which correspond to "ignorable whitespace", the decision to do so based on the dtd.

^{^ToC} 3.3 Utilities

The class Utilities currently supports the automated insertion of tdom-pis and the extraction of pis for a certain target and other small tasks.

^{^ToC} 3.4 Entity Classifier

The notion of "entity" in the xml standard [xml] is rather akward. (See the dedicated white paper on this complicated subject!) First of all the same word is used for very different things. Here we talk about "parameter entities", which start with an percent "%", and are a kind of macros to be expanded in a document type definition, not in a document itself.

These are identifiers which have to be replaced by their "body", i.e. the text they stand for. This text can have widely varying syntax, and different entities DO HAVE a kind of "type", namely the syntactical positions where they can be used.
Eg. if the expansion text is "IGNORE", the entity can be used as the name of a parameter, as part of a type definition of a parameter (enumeration type), as the reference to an element declaration in a content model, or as switch in a "conditional section".
But if the expansion text is "|a|b", it can be either part of a content model, or of an enumeration type, but not an attribute name.
And finally, if the expansion text is "a CDATA #IMPLIED b CDATA", it is exactly one-and-two-thirds of lines of an attribute list.

Obviously its not sensible to define a strict and correct type system. But what you can do is to apply some heuristics.

The class dtd.EntityRole tries to define some of the more sensible combinations, as sketched above, and the class dtd.EntityClassifier assigns one of them to each entity (in a given dtd) for which it is called.

This is done by heuristics: The tuned dtd parser, as described above, is applied to the expansion text assuming very different "start symbols", and then the arising error messages are monitored to determin the syntactic class of the entity's expansion text.

We do not know yet whether that makes much sense, but it is the only thing you CAN do about these things!

^{^ToC} 4 Main dtd Tool

The class eu.bandm.tools.dtm.Tool provides very different ways for visualizing, transforming and analysing a given dtd. It is called "dtd tool" in this section, and "bandmDtdTool" when appearing in a file name.

The dtd tool is intended esp. when a large dtd must be analysed and understood by a human reader. We need it ourself when the processing instructions for tdom must be constructed, to control content model abstractions, common attribute factorization, etc. For this sake the tool takes a DTD, parses it into the internal model, and outputs a rendered version, which is an XHTML document which looks exactly like the original DTD document, but has colors for orientation, clickable links, explaining tool-tips and graphical representations of the "called-by" relations. The generated XHTML document may be dynamic using JavaScript, so that parts can be imploded and expanded.

But the tool can also be employed for total automated processing, eg. bay expanding all parameter entities, eliminating all comments, filtering PIs, etc.

The dtd tool can be run from commandline and by a GUI. Its parameters are ...

( definitions from file ../../src/eu/bandm/tools/dtm/toolOptions.xml )

Visibilities for the different categories of DTD components.
	--PIs	( on\| off\| onOff\| offOn) (=on)
	how to treat processing instructions in html rendering
	--comments	( on\| off\| onOff\| offOn) (=onOff)
	how to treat comments in html rendering
	--inserts	( on\| off\| onOff\| offOn) (=offOn)
	how to treat file insertions (external 'parameter entities') in html rendering
	--elements	( on\| off\| onOff\| offOn) (=on)
	how to treat element declarations in html rendering
	--attlists	( on\| off\| onOff\| offOn) (=on)
	how to treat attlist declarations in html rendering
	--PEs	( on\| off\| onOff\| offOn) (=onOff)
	how to treat parameter entitiy declarations in html rendering
	--GEs	( on\| off\| onOff\| offOn) (=onOff)
	how to treat general entity declarations in html rendering
Switches for additional surveys and analyses.
	--entityGraph	( on\| off\| onOff\| offOn) (=onOff)
	whether to include the entity usage graph
	--elementGraph	( on\| off\| onOff\| offOn) (=onOff)
	whether to include the element contents reference graph
	--alphabeticDir	( on\| off\| onOff\| offOn) (=off)
	whether an alphabetic directory for elements, attributes and entities shall be included..
	--analyses	( on\| off\| onOff\| offOn) (=off)
	whether to include diverse analyses' results
	--commonAttThreshold	int(=1)
	from which multiplicity on a comman attribute shall be listed
Include meta information.
	--showFullInstructions	bool(=true)
	whether to include the full instruction text at the beginning
	--showCommandLine	bool(=true)
	whether to print the full command line
	--showCreationDate	string
	whether to show date and time, plus the additional text
Select between source text or expanded text.
	--expandDefNames	bool(=false)
	whether entities standing for names in defintions shall be expanded
	--expandRefTooltips	bool(=false)
	whether the text in tool-tips for REFERENCES (to elements and entities) shows the expanded definition
	--expandContents	bool(=false)
	whether the text in content model declarations is expanded (the not-selected form is always in the tool-tip of the defined name!)
	--expandAttlists	bool(=false)
	whether the text in attribute lists is expanded (the not-selected form is always in the tool-tip of the defined name!)
	--expandEntities	bool(=false)
	whether the text in entity declarations is expanded (the not-selected form is always in the tool-tip of the defined name!)
Overall operational parameters and style variants.
	--collapseWS	bool(=true)
	whether to collapse all subsequent white-space when collapsing a declaration
	--acceptIncomplete	bool(=false)
	indicates when set that an incomplete DTD, with dangling references (and possibly syntax errors), is nevertheless accepted and rendered, producing warnings instead of errors.
	--outputFormat	( xhtml\| xhtml_stand_alone\| xml\| text) (=xhtml)
	whether to generate colorful dynamic xhtml, xhtml including script and css, xml for further processing, or pure text like the input format.
	--gui	bool(=false)
	whether to show the graphic user interface before any processing (this is also done when command line is totally empty!)
	--source	uri
	source dtd for processing
	--windowtitle	string
	window title, iff not simply dtd xml coordinates
	--result	uri
	file where to put the result
	--linewidth	int(=78)
	number of columns for generated output

The enumeration type "on|off|onOff|offOn" indicates for the different syntactical elements whether they are included, left out, included and collapsable, or initialliy collapsed and expandable.

(Please note that a comment in which the word "copyright" or the character "©" appears can neither be collapsed nor eliminated!)

Whenever a declaration value contains PERefs, the boolean value of the option "expand<>" indicates whether the declaration text shall be expanded. The tooltip appearing on the name of the declared structure shows the other state of the declaration text.

The output format for the human reader is "xhtml". This generates one xhtml file which, depending on the parameter settings, probably will be dynamic, using java-script. In most cases the output includes additional files, e.g. the general files ".js", ".css" and the specially generated ".svg" graphic files.

When selecting "xhtml_stand_alone", all these files will be included in one single file, for ease of transportation.

When selecting "text", only the text contents of the generated output are written to the output file. By this selection an automated transformation of a dtd file is possible.

When selecting "xml", the structure of the dtd and the results of the analyses will be written out in an encoded format, for further automated processing. ((FIXME still missing))

((FIXME LL(1) analyse von bt/tdom fehlt noch))

When calling the tool from commandline without any parameter, then the GUI is started. When the option --gui is given additionally to other parameters, then first these are parsed and copied into the GUI's widgets as their initial values, before the GUI is started.

The dtd tool can be installed and run by downloading the bandmDtdTool.jnlp file, see also our download page.
Have some fun!

(Please note that this tool comes "as is" and without any guarantee to fit any purpose. For security reasons you can additionally check the down-loaded jar file against its MD5-checksum f2ff38705c77ef75ade4b5a77461dc75 )

^{^ToC} 4.1 Restrictions On PERef Placement

Please note that our dtd tool (currently !-) cannot process "very ugly" entity usage. As can be read in the white paper, the places where references to "parameter entities" may occur are somehow limited by the rules from [xml] , which require that pairs of all kinds of parentheses are either contained both or none in the replacement text. Nevertheless there are ugly ways of using PERefs, not prohibited by the specs, which our tool does not support.
These places are ...

the name position in any declaration (either expanding to the name or to the name and the subsequent definition value),
the percent character in a PE declaration,
for PEs which are external, eg. "files", any position below the top-level grammar of "dtd/external subset". (Indeed, the spec does not forbid to put each simple element reference from a content model declaration in a file of its own !-)

In practical applications all these (ab-)usages will not appear. Our tool would probably crash, and we still have no idea to repair this with a sensible amount of investment.

^{^ToC} 4.2 Sample Applications

As bonus material we present three versions of the xhtml dtd rendered with our tool:

xhtml_10_rendered.html. This file contains all entities, analyses and graphs.
xhtml_10_expanded.html. Here the entities are still contained, but their declarations, and all content models and attribute lists are expanded. The un-expanded version, iff it differs, is always in the tooltip.
xhtml_10_compact.text. Here content models and attribute lists are expanded and entities. all (our) PIs and all comments are deleted. This text can be fed into any tool without loss of structural information, beside PEs.

Please note that the copyright with the XHTML-DTD is with w3c.org. The license text can be found at http://www.w3.org/Consortium/Legal/copyright-software-19980720

Please note further that our rendered versions come without any guarantee to fit any special purpose !-)

On our website about "models of music" you find different rendered versions of the MusicXML dtd . (See there for details.)


format	bandm ^meta_tools	xantlr

made 2018-12-30_10h54 by lepper on linux-q699.site

produced with eu.bandm.metatools.d2d and XSLT FYI view page d2d source text

^ToC 1 Purposes and Principles

^ToC 2 Structure and Fundamental Operations of the DTD and DTM models.

^ToC 2.1 Parsing and Checking

^ToC 2.2 Writing out

^ToC 2.3 Important Details

^ToC 2.3.1 Entity Expansion

^ToC 3 Auxiliary Tools

^ToC 3.1 Content Simplifier

^ToC 3.2 Whitespace Ignorer

^ToC 3.3 Utilities

^ToC 3.4 Entity Classifier

^ToC 4 Main dtd Tool

^ToC 4.1 Restrictions On PERef Placement

^ToC 4.2 Sample Applications