CS99I Meeting
3 Notes: HTML
By Gio Wiederhold,
Updated 20 Jan 2001.
Topics Covered briefly
HTML
Hyper (multi-linked) Text (documents) Markup (with
format annotations) Language,
Used to markup documents so they can be easily shown on a variety of computer devices, and reference ( HREF ) local and remote documents and images.
Remote documents require a computer address (http://www.somewhere.xxx ) so they can be found.
Document Formats
Paper: arbitrarily structured/unstructured; physical order.
Books: somewhat structured/unstructured; layout order; metadata: ToC, index.
Tables: very structured. Exceptions awkward -- footnotes
Databases: very structured. Machine processable, queryable. Exceptions awkward.
relational: tabular based, links by references, join operator; unordered. student|><|course-info
object-oriented: tree-based, structural (and optional reference) links; ordered (often)
SGML: for document printing, hierarchically structured; ordered
HTML: for document transmittal, varied presentation, hierarchically structured + links; ordered
Components
Three older inventions combined:
- Document Markup for typesetting: SGML [IBM -- Air Force about 1975].
Markups are metadata for presentation
( HTML intro).
- Hypertext linkages to create a hierarchical document [Nelson, about 1960].
Uses Hyperlinks: http://computer/directory/file+/entrypoint$ (see Regular expression syntax)
- Simplified FTP, with embedded site address (http://cs.stanford.edu/account/...) avoiding having to login [BernersLee@CERN], uses
Internet-based addressing for remote documents
Two Technologies:
- A means to access and documents remotely: Hypertext transfer (Http) --
an FTP that includes linkslinked
- A browser [Mosaic by [Andreesen, Bina the Univ.of Illinois HPPC center.
A browser program interprets HTML, with http, and integrates text, images, and remote references (hyperlinks)
and a business requisite
A community of high-energy physiscists who
- benefitted from rapid access
to complex documents and
- had the computers on which the (free) browsers
could be installed.
Browser competition [Clark-Netscape] [Gates-Microsoft]
Learn by reading and doing
Reading: Bring in a simple HTML web document (like this one),
and see what it looks like
- in Netscape [View] [Pagesource]
- in MS Internet explorer [View] [Source]
If you look at a `commercial' web page you will find many markups that we
won't have to care about. Make notes about the ones that puzzle you and
discuss them in class. The essential ones are listed in our
CS99I HTML
notes.
Doing, indirectly: Create a document with, say, Microsoft Word,
save it as HTML, and look at it.
Doing, directly: Create a document with HTML markups yourself, as shown
in the notes, and then save it as text. Change (rename) the postfix
from .txt to .html, and then look at what you have created.
Some preliminary hints for future meetings
Role of HTML
Selling over the Internet
- fungible versus unique goods
- return policies and problems
Important for formulating
- Representation grammars
- queries (getting some subset of the representation)
sequence: (a,b,c)
alternatives: (x|y), in combination (x|y, b,c) {x,b,c or y, b, c}
optional: q$ {q | nothing}
any: r* {nothing | r | rr | rrr | rrr... }
repeats: s+ { s | ss | sss | sss... }
Example:
(((S|s)ection|paragraph(s$) )*.)
matches all citations looking like
Section xx., section xx., paragraph xx., paragraphs xx.
By setting a marker for xx, those text can be retrieved for display ot processing.
A regular language is capable, but not really user-friendly.
- Would such a query language help your browsing?
- Would such a language help in screen-scraping?
Notes
See
Brief
intro to HTML.
See also the references.