CS99I Freshman Seminar

Winter 1997/1998.

Traveling the Information Highways: Libraries

Maps, Encounters, and Directions

Master copy on Earth
Draft 27Nov1993, rev 1feb94,4Apr94, 9May94, 30May94, 20Jun94, 19Oct1994, 1Feb1998. 25Jan2000, minor update 28dec03, needs revision.
This material is

©Gio Wiederhold and CS99I students, Stanford University, 1998.

Chapter: Digital Libraries

Why do you rob libraries, publishers and authors? Because that's where the information is. [modern corrolary of Wille Sutton's reason for robbing banks]
Previous chapter: Entertainment and education - Next chapter: Healthcare

LIBRARY.Intro

A Digital Library is intended to be an electronic version of a public library. But replacing paper by electronics leads to three major differences, and these differenes in turn lead to a plethora of further differences, so that eventually the digital libary no longer mimics the traditional library. With increased use of digital libraries, major changes will occur on how readers, librarians, publishers, and authors interact. How soon these changes will occur is hard to predict, but the technoligical foundations for digital library structures are in place and the social changes are sure to follow. Ignoring the emergence of digital libraries structures because paper is still dominant is unwise for authors, publishers, and bookseller, and librarians. After all, the invention of printing books is barely 500 years old, and obsoleted the armies of scribes that existed in the middle ages. The diminished role of monks in turn weakened the dominance of religious institutions.

  1. The initial difference is that the works: books, magazines, newspapers, records, videos, and the like are stored in digital form. New formats and standards are being devised to deal with the variety and complexity of the information. To what extent traditional *Good Old Paper (GOP) formats should be imitated is still unclear; there is much good experience in print media, but electronic access will rapidly supersede many old habits.
  2. The second difference is that readers no longer need to come to a library, they use the electronic highways to have the material delivered to them. That removes the important social aspect of congregating in a real library building, where learn from the librarian, from friends and colleagues, drop kids off for a story-telling afternoon, or wait out a traffic jam in a pleasant atmosphere. Electronic surrogates for these social functions are likely to appear, but will no longer be coupled to the library. Technical problems remain. Giving everyone fair access to the roads leading to the public libraries must be concern. Travel on electronic roads can also be clogged. Complex documents, specifically images, occupy wide information lanes, so that traffic and distribution of digital libraries along the highways remains an issue.
  3. The third difference is that the material is always copied from the master version of the work in the electronic library. The copy need not be a literal image, but is likely to be a composite of items selected from source works. That means that the mechanical aspect of publishing and printing an inventory of books or magazines has disappeared. If the loss of the production function causes publishers to disappear, then an author will no longer have then as a resource to turn to for advice and income. Without librarians and publishers the reader is deprived of guidance to trustworthy material. The need for editing, selection, abstracting, and marketing literary works continues to exist, but these functions might be very differently distributed.
The introduction of digital libraries is having a revolutionary impact on the traditional functions of authorship, editing, refereeing, publication, critical review, book-selling, libraries. and the readers, as shown in Figure participants. The revolution will not be instantaneous, it is likely to have an impact that will start slowly, but over time accelerate, permitting many particpants to remain unaware until they are trapped. There are many examples of this pattern, especially in our analogy, where we recall how trains were affected by airlines, big American cars were affected by imports, and businesses along the highways were affected by the interstate highway network.


Figure Library participants
.: Current Participants in the Creation, Dissemination, and Use of Information.

As the redistribution of functions in the digital libraries takes place, each participant has to consider what their information product is apart from its physical manifestation, i.e., what added information value is generated by their efforts. An author, while viewed as primarily being creative, also depends greatly on abstracting and reformulating earlier work, especially in scientific areas. An editor makes the work of a good author accessible, and filters out poor work. A publisher advances money to an author, enabling more rapid generation of material, employs reviewers to help with selecting authors' submissions, hires editors, and invests in typesetting, printing, and inventory. Critics assess new publications, and disseminate their judgements to a wide audience. The booksellers focus on market understanding, distribution and provide a pleasant environment that encourages people to browse.

LIBRARY.History

The history of passing on information orally and by writing extends into pre-history, since the concept of history itself is bound to our ability to record events, so that later generations can connect their life to that of their ancestors. We cite three historical analogies.

  1. The ability to read and write gave priests and kings power. Today, the ability to create and understand computer codes empowers computer *hackers. The autonomy they gain often frustrates their managers, !!if they are employed on projects defined by others>. Using the electronic highways * digital gurus can collect followers anywhere, loosening the traditional binding of influence to land and locality.
  2. Sharing a spoken language defined communities and sharing a written language defined nations. Today. some computer communities are defined by the computer systems and languages they use. Fierce intellectual battles are being fought to establish that one language is superior to another, even if they both can express the same *computer program, albeit in more or fewer sentences.
  3. The invention of the printing press broadened access to writing and led to religious reformations and political revolutions Dissemination of information by facsimile machins (fax) has enabled political protests. The innovation brought by digital information management is a broadening of active participation. The technical barriers to becoming a popular author are disappearing, although intellectual ones remain.
Early computers lacked the storage capability and reliability to serve the library domain. While there were a number of early computer languages oriented towards manipulating data represented by characters (COMIT, SNOBOL), they did not have large bodies of text to work with, and remained isolated.

LIBRARY.History.SDC

!check biblio for TDMS and successors
These systems demonstrated that computers could be used for processing textual data. They spawned a variety of successor systems, but the benefits of all of them was limited because without widespread communication linkages they could only reach a small audience.

LIBRARY.History.cataloguing

As libraries grew a need evolved to locate books other than by scanning the shelves [Baker:94]. Librarians made lists cataloguing the books. In 1791 the French revolutionary government had confiscated many private and monastic libraries, and to create a union catalog they initiated a collection of reference information by listing every individual book on a playing card, reserving aces and deuces for lengthy titles. The cards were filed in drawers. Eventually those cards became resources themselves. The cards would be copied, so they could be filed in a drawer arranged by the author's names, in drawers arranged by title, and in subject drawers. Remington Rand, a predecessor (via Rand Univac) of the UNISYS computer company, made many of these drawer cabinets.

In 1901 the Library of Congress (LoC) started printing its cards, and made copies available to other libaries, providing a consistent basis for the majority of new !!local information publications !!!!huh>>>. Onto the LoC cards specific local information was typed, and sometimes useful observations by the librarians. Over the years complex cataloging rules have emerged: "Rule Interpretations".(RI) issued for U.S. librarians by the LoC. Examples of rules for alphabetization appear in [Knuth:73]. The paper card catalogues grew at increasing rates, and as more people used them, became increasingly awkward, as well as targets for vandalism.

In the 1960's the LoC devised a format for computerizing catalog entries, and started creating Machine Readable Cataloging (MARC) records. By the end of the decade most other libraries found that they could not keep up with cataloging their new acquisitions, and started using the MARC records for their own books, just adding some retrieval and perhaps purchase or gift information. The MARC-linked records use the LoC provided ISBN number so that books can be mechanically cross referenced among all libraries. About 8M MARC records have been produced.

In 1968 the Ohio College Library Center (OCLC), and its successor, the Online Computer Library Center started printing MARC cards using computers. OCLC cards are slightly more expensive, but can be delivered sorted according to many criteria, reducing labor for the libraries. When member librararies acquired new books they could check online, for a modest fee, if there was a pre-existing MARC or OCLC record, and save the effort to make a new a completely new catalog record according to the LoC RI. If no prior reference existed, or if the record was inadequate, the librarian could create a new catalog record, and receive a credit. The distributed nature of updating, however, meant that the quality of entry varied so that, for instance, many author's names, especially foreign ones, appear inconsistently, making it hard to assure thorough recall. We encounter such inconsistencies in other efforts using networks, and will present technology addressing such problems in the Chapter on MEDIATOR Technologies.

Subsequently !year! OCLC started providing a retrospective conversion service (RETROCON) to libraries which converts their older paper catalog card data to computer records as well. Through these processes it created a huge database of catalog information. By 1994 the OCLC database contained nearly 30M entries. Local shelving data is now added to the online OCLC records. However, most supplemental information from specific cards is lost, except from a few libraries (The New York Public Library, Harvard) who microfilmed their cards before shipping them to OCLC.

Today most card catalogues are gone and the cabinets are being recycled. In 1993 the University in Berkeley replaced its catalog, superseded by its MELVYN system, by eight study tables.

LIBRARY.History.MEDLINE

!linkage to SDC? via Batelle Orbit
The National Library of Medicine (NLM) had operated since 19!!xx a computerized cataloging systems for the medical literature. It distinguished itself in two aspects from most contemporary library systems: individual journal articles were catalogued and their indexing was based on a * controlled vocabulary.

Most libraries put only entries to entire works, as books, into their catalogs. When an entire year's worth of a serial journal has appeared, the journals will be bound and indexed as new volume of the serial title. That approach is fine for management of the holdings on the shelves of a library, but does little to help an individual researcher who is interested in a specific topic. A journal volume contains a wide range of articles within a scientific area, the articles will not be in any meaningful sequence, so that without further assistance entire volumes must be browsed to find relevant information.

Creation of an index to topics contained in serial volumes provides the needed assistance for researchers. Such an index is not easy to construct. Authors should follow the terminology established by their predecessors, but when they move into new areas, invent their own terms. When topics merge, use of terms becomes inconsistent. New terms are often essential to make crucial distinctions in domain or applicability, but also isolate work from work of others that may be similar in result, even though different in approach. To help readers locate all relevant materials * thesauri can be constructed that place the terms used into a hierarchy so that a researcher can use a more general term to locate more specific material indexed under their own terms. The * vocabulary for access is now controlled. A controlled vocabulary also helps in resolving with simple confusions in word usage, as * synonyms, where two words have been used to denote the same concept, and overloading, where a term means two different things based on context, as shown in Figure: Disambiguation


Figure: Synonyms and terms requiring disambiguation
.

Within a domain, say woodworking, there should be no ambiguity, but when the domain is unknown, or when discourse covers multiple domains, then disambiguation is often needed. When we have a bishop who enjoys woodworking, then the term miter is ambiguous.

Indexing of published articles using a thesaurus requires understanding the domain. To train indexers in using the MEdical Subject Headings (MESH) thesaurus requires !!6 months!! and to remain certified a MESH indexer participates annually in a !!3 week!!> update course. Since science, and scientific terminology changes the ongoing certification enables the MESH thesaurus to be updated as needed.

The indexing of every article in a large number (!n=?) of medical journals provides a very high added value to the users of NLM resources and a sound basis for ongoing research. Duplication of research effort is minimized, results can be rapidly accessed by a broad community, and, when errors are published, that information is also rapidly disseminated. The technical means of dissemination is of course a computer network. Initially NLM operated its own network, with terminals in the libraries of many medical schools in the country (MEDLINE). As computer capabilities increased MEDLINE was improved, so that now also abstracts for most of the indexed articles are available on-line Today more and more access to MEDLINE is provided through Internet, although the primary MEDLINE interface is oriented towards expert librarians, and hard to use casually.

!also MeSHT , Index Medicus T. MEDLARS T, The National Network of Libraries in Medicine (NN/LM T), TOXNET T, DOCLINE T


Figure: A query issued to Medline for an papers of interest. (not yet in)

To make the entire contents of the articles available requires overcoming several barriers: the volume of material is significantly larger, probably beyond the capacity of the mainframe technology now used for MEDLINE, the inclusion of formulas and figures would require the adoption of standards for their transmission, and, last but not least, the management of copyrights for the journals and their authors. We address copyright issues later, and now focus on tools for authors.

The National Library of Medicine has sponsored a major effort to unify medical terminologies The outcome of this research is UMLS, described in the Chapter on HEALTH-CARE


>
Figure: Smoking: An illustration of negative feedback in a chain of causal relationships(not yet in)
smoking … increased blood pressure … heart attack … reduction of smoking


A total of 15 other thesauri and vocabularies contributed information to UMLS. Major, active sources are the * LoC list of Subject Headings (LCSH), and * MEDLINE, * MeSH T.
!also used were AI/RHEUM, Physicians Data Query (PDQ), DXplain, Quick Medical Reference (QMR), Dorland's Illustrated Medical Dictionary, Online Mendelian Inheritance In Man (OMIM), ICD9-CM

Like any ontology, UMLS requires long-term knowledge maintenance. It is now updated annually, but quarterly updates are contemplated. Some of the changes needed are to correct errors and omissions, but others will acount for the *evolution in the use of the terms. The existence and utilization of a sharable ontology brings the community and its use of terms together through practice. The latter is probably the most important result in the long run. By having tools more people will share concepts and naturally evolve towards common meanings for terms, just like the Kings James Bible provided a major impetus towards a consistent spelling in the !!16th?? century. The focus of UMLS research is indeed now the development of applications, to obtain user coherence and feedback. As such UMLS, in its domain, is a good precedent setter for HPCC and the general effort to establish highways of the future. The medical domain does have the advantage that its participants are well-educated and demanding of high quality services. Since healthcare delivery is costly, even relatively modest benefits can have a high financial impact.

LIBRARY.History.printing

Proper presentation of results is nearly as important as the contents. Around 1965 the first high-capacity computer printers appeared which could print both upper- and * lower-case, removing one barrier from the acceptance of computers for the production of literary works. But these printers were costly and slower, and rarely used. I recall an argument from the early 1970's where I was told that the computed results I presented were faked, since a computer `could not produce text in lower case'.

However, timesharing systems in academic settings were already adapting teletypewriters to their needs. Later models teletypewriters and other electric typewriters were able to print text in lower case. The character sets of these typewriters and printers both were limited to less than 120 letters and symbols, and only one font type and size was available without physically changing printing heads. Boldface printing had to be imitated by printing the same characters twice, without advancing the paper.

Printing based on Xerography, eventually controlled by lasers, provided the flexibilty needed for attractive documents. Laser printing was also pioneered in the 1970's at XEROX *PARC, who had the technological resources to innovate in document production. Soon other companies with expertise in making copiers entered the field. Today there are also printer mechanisms based on i< printing was also pioneered in the 1970's at XEROX *PARC, who had the technological resources to innovate in document production. Soon other companies with expertise in making copiers entered the field. Today there are also printer mechanisms based on Inkjets that provide high-quality output at low cost and less speed. Since they can squirt inks of several colors they are popular for pretty presentations. For black-and-white printing laser-printing dominates.

Library.History.TEX

Around 197x Donald Knuth learned that the publisher of his epic series "The Art of Computer Programming" [Knuth73], Addison-Wesley, was discarding hot lead linotype equipment in favor of photo- typesetting, and could no longer set mathematical formulas as beautifully as before. There is, of course, no fundamental reason for a reduction in quality due to the use of computers, and Prof. Knuth proceded to learn what made books and type beautiful and to create software that allowed computers to do the same. His Metafont and TeX software revolutionized the production of mathematical and academic works [ref]. Authors could now specify typefaces and document layout in exquisite detail, or let *default choices take over, which embodied standard conventions and simplified document preparation.

NAME="TEX"> to be embedded
Figure An example of \TeX as written by an author and its result in print. (not yet in)

TeX language uses *macros to assemble its primitive commands into easy to use commands for document preparation. An example of \TeX macro use and its output is shown in Fig.\tex, the corresponding macro definition is shown below in Fig. Texmacro.. A popular default collection of TeX macros, LATEX, was established by Leslie Lamport of SRI International. Donald Knuth donated his rights to TeX to the American Mathematical Society (AMS), which today expects all manuscripts to be supplied in TeX format. Other academic *publishers, even those that focus on computing, as the *ACM and *IEEE, are still struggling with electronic submission of manuscripts, trying to follow the leadership of the mathematicians. This an example where a group of users with a valid need and leadership were empowered by technology, while the technologists fell behind.

Today TeX is available for nearly any combination of computer, operating system, and printer. Processing a source document through TeX creates a file in a * device-independent format (.DVI). Today .DVI format is most often translated for printers that have * Postcript (.PS) capabilities. Postscript is a commercial * standard, marketed by Adobe, which was derived from earlier work in high-quality word-processing at * PARC.

Unfortunat ely web browsers do not routinely accept TeX as input. A proposal for HTML 3.0 did include a useful subset, but has not been implemented. Until this lacunae is remedied, it remains difficult to present mathematical text directly in web pages. The best solution available today (1998) is for the author to procees TeX into Postscipt form, and for the reader to obtain a plugin for the browser that can handle postscript, such as Ghostscript.

LIBRARY.History.WYSIWYG

A disadvantage for a casual author in using TeX is that a manuscript in TeX form does not look on the screen as it will be presented in print, as shown above in Figure TeX. Much preferable is an authoring environment where * `What-You-See-Is-What-You-Get' (WYSIWYG, pronounced "wizzy-wig"). Early WYSIWYG approaches were hindered by lack of processing capabilities and storage of their computers, so that formats and typefaces were quite limited. It was easy to tell when a document was prepared by a computer. The output may be legible, some cute fonts were included, but in general the results were ugly.

These limits are now largely overcome. Authoring tools as Microsoft Word, Wordperfect, etc. are WYSIWYG systems. However, when their output is exported to Web pages as HTML, they have to cede the specific formatting to the browser, so that the results only approximate `What You saw'.

Today computer storage, processing capacity, transmission, and presentation of literary text exceeds greatly what is possible without computers. The entry of pre-existing documents remains tedious, however most current works are prepared using computers. so that recent material can made available.

LIBRARY.Functions

The functions of a library are to acquire works, store them, make the works available to the reader, and reimburse the author and the publisher for their efforts. We discuss these four functions for a Digital Libary now, building on the support functions presented in earlier chapters. An important, informal mode of accessing digital information, by *browsing, was covered as Browsing.Functions.browsing .

When we discuss libraries here, we consider the entire process of getting information from concept to the reader. Building libraries on the Information Highways will affect all phases of the publication path.



Figure: Electronic displacements in the publication path]
.

Two important metrics are used in library science: *relevance and *precision. Relevance indicates how much of what is retrieved by a query is of interest and benefit to the searcher. Precision indicates how few relevant documents have been missed by the searcher. These metrics are interdependent. If the system presents all documents`it has than the desired material is certainly included, the relevance is high, but the precision is low. A system which presents a single relevant documents has a high precision.

LIBRARY.Functions.acquisition

If the library does not contain much, it will be hard to retrieve relevant material. Complete acquisition within its domain must be an initial goal of a library. A digital library must acquire material in digital form. There is a variety of formats available for text as well as for images.

An efficient represention for text is * character encoding. A sequence of 7 bits can represent 1^7 = 128 symbols. The common * ASCII standard uses 7 bits to represent 95 printing characters, the remainder are available for control and checking purposes. While 95 characters comfortably cover the alphabet, digits, and punctuation, it limits literary and scientific expression. Since most character transmission today is in terms of 8-bit units, * bytes, more symbols (up to 2^8=256) can be represented in a byte-sized character. However, no 8-bit * standard has been widely adopted, so that common non-ASCII characters in DOS, the Apple Macintosh, UNIX-based systems, and in much word-processing software are differently represented. In Japan 16-bit representations are used for * Kanji (Chinese-based) symbols. A five- and six- digit * telegraph code is sometimes used in China, where a larger set of symbols is needed.

Using more bits per character does not address the general problem found in a library. The works in a library use a variety of fonts, underlining, subscripts, mathematical symbols, Chinese ideographs, right-to- left writing, and the like. The solution is to use * embedded commands, consisting of sequences of control and printing characters to indicate changes of font, font size, spacing, and characters beyond the defined ASCII set, but again, no general standard exists. In Section \L\T\TEX you can find some examples of embedded commands. TeX uses the character '\backslash' as a control character to prefix its commands, as seen in Fig. TeX.

Transforming published text from its printed pages into ASCII is tedious and error-prone. The paper sheets have to be placed on a * scanner and the scanned image has to be analyzed to obtain ASCII. If the text contains unusual characters, equations, tables, drawings, photographs, or handwriting, the limits of automated analysis are soon reached, although ongoing research is steadily increasing the flexibility and capacity of *optical character recognition and reducing the error rate.

When character recognition is not feasible, the scanned images can be directly stored in the digital library. Now, however, there is no indexable material, so that ancilliary information has to be attached to each image so that it can be retrieved. Finding a page by content is much more powereful than by attribute, as shown in Fig. Associate.


Figure Querying a Digital Library a) by Content and b) by Attribute. (not yet)

Most modern writing is done using computer systems, so that scanning and optical character recognition are not needed. However, most *word-processing software packages have their own conventions for controlling layout, font changes, and the like. Layout standards are starting to take hold, the Structured Graphical Markup Language (SGML) is introduced in Section LIBRARY.Technology.SGML below. A language for representing text of publishable quality is TeX, as described in Section LIBRARY.Technology.TeX, but it is not the only candidate for preparing material for publishing. The user-friendliness of * What-you- see-is-what-you-get (WYSIWYG) interfaces for word-processing hides internal complexities from authors, but also reduces the need for internal consistency. When WYSIWYG documents enter digital libaries considerable conversion and editing may be needed to achieve make the contents of the library consistent in form. Inconsistencies hinder search and further processing.

Pictures and graphs are poorly integrated in most text-processing systems. Pictures scanned into a computer system are aquired as *digital images. They are often *compressed to save space. Some images are created by digital technology, as Computer-Aided Tomography (CAT) for medical diagnosis, and handled similarily. The capability of optical and related systems is increasing rapidly, with better resolution, i.e., more *pixels, and more resolution, i.e., more bits per pixel. Images used by intelligence agencies can require nearly 1 Gigabyte each, or 3000 times as much storage as a TV image. Compression, as was presented in ENTEDU.Technologies.compression can reduce this volume drastically, but also requires the integration of compression software.

Graphics produced on a computer can be captured by recording the sequence of commands that created them. Engineering drawings, mainly generated by interaction with computer, are stored according to graphical standards, which reduce the drawing to a collection of lines and curves, each described by their location in thedrawing and some attributes, as thickness, curvature, color, and the like. We present the emerging technologies in ENTENDU.Multimedia .

Eventually the problems of acquisition of old, printed material will be minor. Most past works will have been entered and be stored somewhere, as now is being accomplished with library catalogs. Nearly all new material is produced digitally, and problems to be faced will be compliance with standards and protection of intellectual property when no paper master copies exist.

Intellectual Property: A Cross-National Perspective [David Arulanantham]

STORAGE

A major function of any library is storage of its contents and the indexes needed to access its contents. Storage capacities of computers ave been increasing so rapidly that storage was not even deemed to be crucial issue in the HPCC technology initiatives. Still, the digital library must be concerned about storage, especially if it acquires much existing material in image form.

Storing the content of a page in image form requires approximately 1 Megabyte of storage prior to compression, about three times that amount if it is in color. This estimate is greatly affected by higher resolution, needed for fine print and mathematics, and levels of greyscale. Acquisition or conversion of material into character-encoded form (ASCII) greatly reduces its bulk, and also permits automation in indexing, retrieval analysis, and processing of literary material. Indexes permit rapid access to selected, relevant documents in large collections. Indexes, and the tools to use them, represent the value added by a library, and can make use of library effective even for works stored in one's own studio. Comprehensive indexes can be large, often equaling in size the volume of the documents themselves.

Digital indexes must be processable to be effective, and hence should be never stored in image form. Indexes can refer to data in text or in multimedia forms, as images, video clips, or voice clips. The reference format must adapt itself to the type of the work. For text a filename, and page and character range are fine. It is more awkward to refer to part of an image. In video or voice clips a temporal range must be indicated. Indexes need not be stored with the documents they refer to, and their maintenance differs considerably from document maintenance. When documents in an electronic library combine text and other material, the text is often stored in a more accessible location, and other material, as images, is retrieved by a reference from the text.

LIBRARY.Functions.access

Public access to digital libraries requires widespread availability of the networks, rapid identification of useful material, and rapid transmission of desired documents. Low cost is important to achieve broad public benefits. While early libraries maintained their own networks, general transmission technology is making rapid strides, and digital libraries are expected to become a major motivation for support of the * Internet and its successors. Today most of the information being transmitted is textual, so the load on the network is managable. Data in image form is more voluminous, as described in Section LIBRARY.Functions.storage. * Indexes are compact, so that selection of works should not be overly problematic. Supporting a * user-friendly interface can require much interaction, but an effective * software architecture mitigates the transmission load. Getting the documents will be the major factor, especially if images, video, and speech are included Chapter ENTEDU.Technologies.compression can reduce the requirement probably by a factor 100. Note that no single transmission path will ever need that capacity, since the sources and readers are distributed and will use separate facilities. However, perfect balance will not be achieved, and the average demand per second estimated above is also affected by imbalance over the day, leading to rush hours on the information highways. Additional technical factors to be considered are the increments needed to control transmision, avoid collisions, switch among paths, manage traffic at intersections. Theae technical factors can easily double the requirements. Over time, demands on responsiveness and image quality will also increase. High quality images, as described in Section LIBRARY.Functions.acquisition. can require $1000 ' $ the bandwidth.

In Chapter ENTEDU we presented video for entertainment and education. In that setting transmission is bundled by broadcasting the same material to many people at the same time and rate. In information retrieval bundling is very undesirable, since every reader has to be able to proceed at their own speed.

LIBRARY.Functions.querying

When browsing through the resources`of the ubiquitous information world we encountered the need for formulation of effective queries.


A simple standard, the Common Command Language (CCL) Z39.58

FIND AU  Find a document by name of author
FIND TI  Find a document by title
FIND SU  Find a document by subject
FIND TO  Find a document by topic, presupposes a thesaurus 
DISPLAY what's found
HELP
Figure: Ccl: Main commands of the Common Command Language.

LIBRARY.Functions.vector-search

Once an electronic library is well organized, then better ways of browsing than shown earlier can be provided. A prime candidate is navigation in a *vector-space [Gelernter:94]. To direct the search the navigator uses as vectors conceptual categories that are organized in some natural order, as chronologically (past to present), by space (here to yonder), by physiological development (embryo to senility), by ontogeny (one-celled animals to human beings), by size (insects to elephants, for instance), etc. The system provides a means to move up and down along those vectors, and brings the most relevant material at the intersection of all the current vectors into view. New vectors van be added at anytime and the range of interest along any vector can be varied. The generation of vectors or information spaces has been active research direction for some time !![Salton.xx]> , but their use has been limited, partially because of a lack of access to powerful and interactive tools.


Figure Vectorspaces !!from Salton if possible!!

LIBRARY.Functions.dynamic-book

!check for replicated info
A Many book are prepared by multiple authors, and composed for presentation. Textbooks, even those written by a single author, consist of distinct chapters, and are organized so that a reader or teacher can skip those chapters`that are not relevant at some point or for some curriculum. A paper book that is composed to contain much possibly relevant material gets heavy, costly, and awkward. Some publishers (McGraw-Hill) now allow such books to be composed from selected chapters to order, specifically those destined for classroom use.

It is a small step to move to *dynamic books, where a skeleton designating the desired chapter are predefined, but the actual chapters are obtained from digital resources when needed. The chapters would be latest versions, and convey the most-up-to-date knowledge. Some smarts will be required to assure consistency. A cross reference to a chapter that is not included can be satisfied later, unless that chapter is subsequently deleted. If the referenced chapter has been revised, the intended reference may be confusing. Including *timestamps with such cross-references can help, and permit going back to past versions of the chapter that should have been archived. Avoiding this problem, by inhibiting updates of referenced material, would disable the up-to-dateness that dynamic books contribute over paper books. The table-of- contents and the index can include all candidate chapters. The bibliography may only collect entries of chapters`that have actually been obtained.

Unless the reader controls the entire composition of a dynamic book, there are now two levels of authors:

The editing author and the base author both contribute, and deserve intellectual and financial rewards`for their efforts. The traditional copyright mechanisms need to be adapted to digital libraries and the new dissemination methods that are enabled by the digital highways.

LIBRARY.Functions.copyright

As alluded to above, the need to reimburse authors and publishers for their efforts in bringing information to their audience is the major bottleneck to successful deployment of digital libraries. This barrier is not universal. Material produced by the government should be freely available to all citizens. Many scientists are supported by government grants, and do not expect to derive income from their publications. Some journals, in fact, assess page charges to be paid by the author. But in any case the publisher incurs costs. Reviewers, editors, translators, typesetters, binders, marketing, and postal charges must be paid. Reviewers for many scientific publications are actually unpaid, they are motivated to review manuscripts of others by a desire to stay professionally up-to-date, to keep the science in their field at a high level, and as a quid-pro-quo for getting their manuscripts reviewed in turn. Still, scientific journals tend to be costly, because the care going into their production is high and the number of copies sold is small.

Non-academic authors do depend on their income from selling their works, directly or indirectly. If their work is part their employment, their employers may have the right to it. Owners can sell their work for one- time use, for unlimited use, or for exclusive use. Publishers who obtain such works can similarly resell rights obtained from authors. Readers can freely use the information conveyed in literary works, but not redistribute the material in its fixed form.

In the U.S. protection for intellectual property (IP), as defined in copyright and patent law derives from the constitution:

Copyright conceptually provides the legal basis for protecting literary works, but has been extended to cover all written works, music, and images. Copyright is being adapted for digital works as well, for instance documents on computer storage, music on tapes or CD-ROMs, and digital images. Extrapolations to the Internet are not yet well understood. In software the scope of copyright and patents overlaps. For library material we focus on copyright.

Since 1968 copyright for an author in the U.S. is established as soon as it is fixed, that is committed to paper or electronic media. To simplify protection of literary works, copyrights can be registered with the U.S. Copyright office, by supplying a simple claim form and two copies of the work [1976 Copyright Act]. For complex works it is adequate if only part of the work is submitted; the submission should be sufficient to disambiguate eventual conflicts. Violations of copyright require that the offended party bring their claims forward as a legal suit. Having the work registered helps in bringing a claim, but is not essential.

The copyright laws spell out conditions for reuse of literary works. Recognized are rights for cases of

  1. (1) fair use, covering individual use, use in critical review, limited, non-profit redistribution, say, to a class,
  2. (2) for public performance,
  3. (3) for redistribution for profit,
  4. (4) for exclusive reuse, and
  5. (5) for making derived works.
The establishment of digital libraries requires a reevaluation of all these cases. In the simple cases, the reader, having acquired a copy, has the right to the information and, with attribution, make small quotations from the work. For making copies, a copyright fee is due, to be paid to the owner of the copyright. The ubiquity of paper copiers causes many violations of this obligation to pay.

To enforce copyright while transporting or storing electronic documents encryption may be used. Decryption key must be distributed with care, as described in the Chapter on Security.

We see older material being scanned and placed on on-line, as the 1911 Webster's dictionary and the 1911 Love-To-Know Encyclopedia.

Add: new milleium - Sono Bono - copyright act

Digital Libraries

Since copying of entire works is simplified by digital technology, the transfer of works in digital form to libraries has been hampered by security concerns. For instance it took two years of negotiation before publishers of works on CD-ROM would release their copies to be deposited in the Library of Congress (LoC). The LoC must ensure physical security; any copying of data from these works requires a signed agreement certifying that the purpose is only for research and study; and network access is to be limited to a five-client LAN.

For software it is now acceptable to only deposit the intial (5?) pages and (?) of the work, that amount being adequate to establish a definite fingerprint.

The access limitations to digital works, established for the LoC by the Information Industries Association (IIA), the National Federation of Abstracting and Information Services (FAIS), and the Association of American Publishers (AAP) with the LoC !!disable implementation of the concept of digital libraries.

The public performance of a work, traditionally applied to music and plays, has broad implications in digital libraries. It not only covers broadcast of music, but also the execution of a copyrighted computer program. A person who obtains a program legitimately, obtains a copy of the work fixed in some digital storage medium for personal use, but not necessarily the right to its use as a performance, say, as part of a for-profit service. Useful services along the digital highways require obtaining multiple rights. An information service emulating a library will require having rights to the documents it can retrieve, or a method for reimbursing the owner of their copyright, as well the rights to the performance of programs which provide the retrieval service. If the programs use some domain or expert knowledge, as is often provided by librarians or other experts, then that knowledge may be protected by its own copyright.

Much early work in digital library technology was carried out in scientific or governmental laboratories where copyright issues were easily ignored. Many such systems exist, and will grow and continue to provide important services. At the same time, future digital libraries will need to incorporate technologies that support reimbursement of producers of information. Without such mechanisms much useful information will be withheld from public use, and the creation of high-quality, new information will be stunted. We present a rights system which is testing concepts that are emerging to deal with digital recordation and reimbursement for use of copyrighted material.

[[Material to be edited ]]
[Kahn at HPC:If copyright getting registration easy, will be there overload, No central deposit is needed. Distributed but secure. Today deliverer has liability - shipping a book by federal express. Also cable TV programs incorporating. Digital object is equivlant package and program, Most our use is copies of performance = display. Putting on the net is performance?? ] TimeOnLine: supply taste and put it in paper for convenience [digital objects` identified by RSA fingeprint]

Bob Kahn 6Apr95
Intellectual property is crucial. Tole of patent copyright Interfaces can be both copyrighted and patented - does on supersede the other? "The congress shall hace the Power... to promote the progress of Science and useful Arts by securing for limited times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries. Works are copyrighted, fixed. Can flow around. Duality in Kahns view waves vs particles. Protected in tangible and intangible form

  1. paper, CD-ROM, mag tape etc
  2. communication, public performance, displays (An intermediate, common carrier, is explicitlty exempted from liability.)
  3. derivative work
Five protected Rights
  1. Reproduction, i.e., publication
  2. Distribution
  3. Public Performance (like fair use doctrine)
    Big cultural differences in countries (case Supreme court video tape, ok for copying performance at home. No rights in sound recording. Network distribution looks like performance)
  4. Public display (even of pictures you own)
  5. Derivative Works.
Big cultural differences in countries (case Supreme court video tape, ok for copying performance at home. No rights in sound recording.) Legal note: You don't know what information etc. See results or Urugay round of GATT: talks about information moving across national boundaries. Advertising on the net:

Copyright protection architure. IETF sponsor. Allow owners to negotiate terms and conditions "Money is not made from information, but from validating ownership" Architecture is minimal see cnri web server. has also software for local naming authority. http://www.cnri.reston.va.us Global services identified now by IETF Naming authority is local global/local //global/local or >>global>naming get digiatl objects (handle) give handle get metadata handle is close, but not an URN When sombody left"re cpyright ... [[end of material to be edited - into technology section?]]

LIBRARY.Technologies

Technologies for digital libraries have been encountered throughout the preceding chapters. Bulletin boards provide access to information that bypasses any editing, publishing, and selling intermediaries. Uncontrolled access to copyrighted works is frequent, and typically unknown to the reader. Browsing and querying of library documents are strongly related. With the concept of libraries comes an increasing level of professionalism and formalisms, and technological support for these aspects will be stressed in this section.

LIBRARY.Technologies.digitizing

!here or earlier!

LIBRARY.Technologies.hypertext

!here or earlier!

LIBRARY.Technologies.macros

TEXMACROS


to be embedded
Figure: A TeX macro definition for a Section heading, with comments . (not yet in)


LIBRARY.Technologies.SGML

SGML !! ref to HGML>

LIBRARY.Technologies.indexing

[[relate to compression ]]
!controlled vocabulary

LIBRARY.Technologies.image-analysis

Images collected in digital libraries will have some external parameters, for instance satellite images can be located in NASA's files by specifying the spacecraft, the observing instrument, and the time or the orbit and longitude. This information says little about the image content itself. When images have been related to textual descriptions we can use the indexes created for the descriptive text to locate appropriate images. To search for images by their content the image itself has to be processed so that the contained objects can be identified. There are two obvious choices for automation, but both approaches are still in the research stage: \item{1} provide for image model based search: the reqestor shows or sketches an example of the object. The corresponding index contains image features to be matched to the features of the example. For instance houses, seen from space, are mainly composed of rectangles that have areas of about 10 to 1000 square meters and edge ratios from 1 to 5. \item{2} analyze the images in the library to recognize objects and create index terms: say, {\sl many houses}. Ancilliary data is needed to label individual houses, as {\sl Clinton's house} \enditem The first choice provides more flexibility, the second choice presupposes that the indexing process has a viable model of what the searchers will be looking for. If the searcher is looking specifically for greenhouses many candidates will be missed or lost in a region of houses.

Processing of images, or *image understanding (IU) is a crucial topic for dealing with the volume of images that is arriving in the warehouses along the digital highways. We only summarize the phases, so that readers can recognize when they encounter writings on the topic where the work falls, since no complete paths exist today.

  1. Feature extraction translates the *pixels of an image to features as areas of similar brightness with their position, size and shape and lines with their end-points. curvatures and crossings. An alternative to searching for such semantically meaningful features is to apply automatic filtering, a good candidate are *wavelets. This phase produces the input for approach (!1.) above.
  2. Model matching compares features of the example with the features in the image. It is unlikely that feature sets will match precisely so that non-perfect matches are candidates as well. Now many more candidates will not be relevant, and a further selection phase is needed.
  3. A candidate image can be further analyzed. There will be expectations on how features are arranged, and those can be modeled as well. For instance isolated small rectangles will not belong to a house, and may be greenhouses or cars. The remaining images are likely to be more relevant.
  4. The remaing images are presented to the searchers for evaluation. When the number is still large, only a random subset may be shw, to enable phase e.)
  5. Feedback about relevance is solicited and used to improve the system. For instance, the features of images that were evaluated to be relevant can be used to redo phase b.)
A system of the type summarized above needs an overall planning component. The selection of features is crucially affected by the types of objects searched.

The issue of *precision, i.e., missing relevant images altogether is not easy to assess. The effort to check an entire large image library is beyond reason today, so that precision cannot be evaluated outside of modest samples. The quality of image processing software can be assessed by placing artificially relevant images in the library and checking that these are retrieved.

!!!!expand> IU: spec: planning (feature selection (easy->finer) ((features (brightness, clustering into regions (size shape) mutual arrangemnents) testing searcher evaluation refinemnet>

LIBRARY.Technologies. personal-versions

!Not yet

LIBRARY.Technologies. abstracts

Abstracts provide the most useful compression of documents. There is success with specialized documents, using metamodels Military messages K-10 reports from businesses to tabular form Price Waterhouse, using SOAR Problem summaries from medical records? Morning report [Zdonik]

Also work on financial articles from the Wall Street Journal using templates !!Check dup>

Work at Xerox PARC

LIBRARY.Technologies.rights

The management of copyrights, the collection of copyright fees is an enterprise that is hidden from most readers, but provides the economic underpinning for a large enterprise. This backroom business is also under increasing stress, as the volume and complexity of copying material increases. The publishers' own production is covered by agreements with the author. Payment for copying of journal articles is handled by the *Copyright Clearance Center. Payments for the performance of works in the theater, on TV or on radio is managed by *ASCAP. The remainder of rights, say for use of an existing figure in a new book, an anthology of papers to be used in a course, or for inclusion of an entire work in an electronic library, is handled by the copyright owner. No standards exist here, although the individual publishers do have their conventions for what rights they cede freely, for what they require payment, and what they will not release. Since most payments are modest, the handling of rights is an unprofitable operation for most publishers. Response times are often slow, causing requestors to accept the risk of violating copyrights. Increasing the fees to pay for more processing staff will create new reasons for ignoring copyright laws.

The rights and payment conventions are being stressed by electronic acquisition and dissemination. An ongoing effort by the Corporation for National Research Initiatives (CNRI) in cooperation with the LoC is defining the architecture for an Electronic Copyrights and Permissions Management System (ECMS). The participants in such a system are the rights holders, the works being protected, and the requestors. Works will need unique identifiers, and standard notations for various types of uses will need to evolve. By formulating rules and fees the rightsholders will be able to assign their processing to an ECMS. Secure transmission of rights will be needed to protect the works and the reimbursements. The actual works are stored outside of the ECMS, on any repository that affords sufficient protection. Figure Architecture for Rights sketches the linkages among the participants and system components.


Figure: Model for a candidate Electronic Copyrights and Permissions Management System (ECMS) [Inf. courtesy of CNRI]
.

Automation of rights management should greatly reduce the time needed to get permissions, and mitigate a major motivation for infringing on copyright protection. Automation can also reduce the costs, and bring them in line with fees that encourage compliance with copyright laws. Smaller rightsholders will be able to participate in a shared, automated ECMS, removing yet another incentive for authors to deal with traditional publishers.

LIBRARY.Technologies.presentation

If many documents rank, visualization [Hirshman@MITRE] allocates real estate on result page proportional to hits in that DB.
TIPSTER extraction according to template Example joint ventures reports {date, source, partners, p's industry and industry type (SIC code) p's location, p's investmnet, objective, means ... } Can generate sentences by TEXPLAN

Refining queries !!\X?>

LIBRARY.Alternatives

LIBRARY.Alternatives.bboards

!here or as internet service?>

LIBRARY.Alternatives.databases

The term 'database' is often applied to digital libraries, but it is useful to distinguish databases from this topic. To make the distinction crisp we will focus on the extremes, although many intermediate forms exist.

Aspect Digital Library Database |
------------ ------------------------------- ------------------------------- |
Content Processed *Information Observed Facts, i.e., Data |
Source Authors, editors, publishers Clerks, Data acquisition devices |
User People (students, ...) Analysis Programs |
Validation Expert editors Validation Programs |
Dissemination By assignment of rights Via proprietary services |
------------ ------------------------------- ------------------------------- |

LIBRARY.Alternatives.self-publishing

LIBRARY.Alternatives.on-demand printing

"We are pleased to announce the availability of an electronic primer on geometric constraint solving developed for the ONR research community. The primer is an electronic book available on world-wide-web through XMosaic. It can be read following four predefined "tours", or following hyperlinks. The primer also contains instructions on downloading the constraint solver to be run locally, on Sun workstations, as well as the capability to run the constraint solver locally at Purdue. The URL access is http://www.cs.purdue.edu/homes/pjv/book/intro.html.

LIBRARY.Alternatives.derived-catalogs

!Grateful MED

LIBRARY.Alternatives.IP

Writings and pictures, as protected by copyright, are just one form of Intellectual Property (IP). Other forms are Patents, Trademarks, and business Trade Secrets.

Trademarks

Trademarks can be filed with State !or Federal! authorities, and when consistently is used with the trademark symbol (T) become the property of the filer. Their use is not specific to the Internet, although there is a useful service to compare a proposed trademark with those already filed.

LIBRARY.Bio

Don Lindberg?

If the digital highways had existed at that time <1978> he might not have wanted to leave Columbia, MO.

Biography

Fred Kilgour Worked at Harvard, founded the Online OCLC in 1968 to deal with the catalogoing problem.and exploit the benefits of MARC. (see OCLC above)

LIBRARY.Conclusion

When we summarize the activities grouped under digital libraries we see that the changes are many, and are likely to be wrenching. The concept of the 'free public library', maintained by a local community for the benefits of its population, is hard to model in the digital world. A model I foresee personally, although hazily, is that communities of authorscand readers will regroup based on domains of interest, rether than geographic domains. Within domains *ontologies will become consistent . Being freed from constraints of physical proximity, most readers will be members of several such communities. Within each such community will be a variety of services, some for payment and profit, others made available cheaply or for free, as now recognized by the fair use provisions in the copyright world. Within those community taxes may be collected to enable the provision of some free services, and volunteer labor will help to provide access. Even local philantropy may have a role, especially in those domains of interest where funds are scarce, as in the arts or religion. Participants will have to agree on certain rules of behavior, mainly dealing with staying within the boundaries of fair use; just as today's library patron agrees to return a book or tape on time, and not to copy it for resale.

<>

There will still be the underserved, for whom the free libraries provided relief in the past. People who live far from the main information highways now pay high access charges, in the form of rural telephone tolls, which can be more to reach the nearest town than the price of an interstate connection. In the 1930's the "Rural Electrification Service Administration (REA) helped bring light and appliances to every farm. Its bureaucracy still exists in the Department of Agriculture and spends over \dol140M per year. It might find a justification for its existence in repeating its success in the next century in disseminating information rather than power to rural communities.

The publishing enterprise will experience the greatest disruption. Its members will only be able to survive if they learn to focus on the value they are providing, and those values are likely to be domain specific as well [Lederberg:92]. Responding to calls for more scientific information by increasing the number of publications, raising the prices of the journals because of low subscription rates for these publications, causing libraries financial distress, which in turn forces many libraries to cancel subscriptions, has intiated a viscous spiral where in the end only a few paper copies will be sold, and the remaining distribution will be by copying and interlibrary loan. Interlibrary loan is also simplified through the use of networks [Dutcher:89]. In this swamp independent digital library services may flourish.

Commercial publications, supported mainly by advertising revenue, will feel the pinch from other directions. Some advertisement revenue is already flowing today to the few cable shopping channels. When these services become interactive, allowing customers to browse through domains of interests, ask specific questions on, say, quality, from an independent source, see demonstrations on-line and the like, as presented in Chap.\F., then their revenues will be severely impacted. The only advertisers that may remain are perfumes, with their printed odor-strips, for which I have not yet seen a digital replacement.

LIBRARY.Lists

LIBRARY.Lists.resources

We can only display a sampler of collections made available by digital libraries today, and even if we were complete at one point, we'd be out-of-date a week later. Any comprehensive list should be kept on-line, as a high-level library service itself. Maintaining such a listing will be another valued-added service, and deserving of financial reimbursement. A current, free resource reference is the sourcebook on digital libraries, compiled by Edward Fox [Fox93].

Update with web pages
advanced very high resolution radiometermark]| licensed, planned for 19xx /TABLE>

To get information about NASA data, you can use a browser: From the NASA home page -> GSFC page -> psace data and computing Division->

research onDigital Library Technologt, with examples for Browsing


Fin

Previous chapter: Entertainment and education - Next chapter: Healthcare

List of all Chapters.
CS99I CS99I home page.


NOTES

------- ARPS Meeting 25Nov 94
Needed services
Discovery searching
data intercghnage and Format conversion
Authentication and security
electronc payments
Linkage
Repository 500 years
Registration and publishing
----------
Missining abstarctio again
-
Digital Library and electronic commerce mutually support each other.

micro payments
Barry diller quote Nov 24th "we have overhyped the Information Highways"
John Young "Let's do experimnts"

Commercenet has secure Mosaic


Digital Libraries
name owner / address content charging size [ref] | %source AVFRR NASA Goddard / cjtucker@gfscmazil.nasa.gov Land and ocean emissions obtained by the
free | Ball Aerospace Earth images / 1 meter resolution commercial planned for 19xx [Bill
Compustat company business, stock data subscription [Goh:94]|
CNRI Consortium for National Research Initiatives Internet conventions, rightsmodel private non-profit, grants [www.cnri.reston.va.us]|
CRSP company business, stock data subscription [Goh:94]|
DataStar Knight-Ridder / Europe, Miami FL bibliographic super service |
Datastream company business, stock data subscription [Goh:94]|
defense conversion Pasha Publications / Arlington VA 800 952-0122 , 800 227 8431 |
Dialog Knight-Ridder / Palo Alto CA, Miami FL bibliographic super service / 450 databases subscription +/ fees |
Disclosure company business, stock data subscription [Goh:94]|
DoD TRP Defense Conversion Clearinghouse technology transition projects 1 800 352-2949 orinternet|
DVMS Ziff Information Services / Medford MA |
Edgar Securities and Exchange Commisssion company business, stock data subscription [Goh:94]|
Einstein NASA Space and planetary observations free |
Electronic Book Technology Providence RI SGML viewers |
Envirotext DoE Un.Illinois, Champaign Library of environmental information |
Equinet Knight-Ridder / Australia, Miami FL financial info.service subscription |
ExPASy Hospital Cantonal / Geneva Switzerland Molecular Protein Biology / electrophoresis images with links from spots to data WWW |
Eyeglass Earth images and processing / commercial planned |
Fortune 2000 U.S. company business, stock data subscription [Goh:94]| GDE systems, Orbital Sciences, Litton Earth images / 1 meter resolution commercial
|
GOES NASA Images from Geostationary Earth Observations Sattelite |
IRAS Caltech /wheelock@ipac.caltech.edu Infrared Sky Survey atlas on-line |
Landsat EROS / NASA Earth observations cost/free subsets |
!lockheed! Lockheed Space Imaging Earth images and processing / 1 meter resolution commercial licensed, planned for 1997? [Bill mark]|
LEXUS MeadData Central / Columbus OH? legal reference service |
NEXUS MeadData Central / Columbus OH? bibliographic super service |
OASIIS Western Atlas Software / Houston TX geographic data |
PDS Jet Propulsion lab / pds_operator@jplpds.nasa.gov Mars digital image map free |
Regnet NPR planned governmental regulations |
Reuters Great Britain company business, stock data subscription [Goh:94]|
SMMR NASA Cloud, polar ice, snow etc.radiation data from scanning multichannel microwave radiometer |
SPOT Spot / France Earth images / 10 meter resolution * 54km commercial operational|
SSM/I NASA hydrology data from special sensor microwave imager |
TIROS NOAA Television infrared observing sattelite |
TOVS NOAA-NASA Sea and earth altidudes from Operationl vertical sounder |
TOMS NOAA Total Ozone Mapping spectrometer |
Visible Human NLM Atlas providing 3-D cross-sections of a human being [Ackerman:94] there should be a better reference |
Worldscope company business, stock data subscription [Goh:94]|
Worldview WorldView Corp Earth images / 3 meter resolution commercial licensed 1993 / planned for 19xx |
Yellow Pages Prodigy business phone numbers, NY, New England free |<
1911 Encyclopedia http://www.1911encyclopedia.org/