During the course of the Datacomputer Project, many people have
contributed to the development of datalanguage.
The suggestions and criticisms of Dr. Gordon Everest (University of
Minnesota), Dr. Robert Taylor (University of Massachusetts), Professor
Thomas Cheatham (Harvard University) and Professor George Mealy (Harvard
University) have been particularly useful.
Within CCA, several people in addition to the authors have participated
in the language design at various stages of the project. Hal Murray,
Bill Bush, David Shipman and Dale Stern have been especially helpful.
1.1 The Datacomputer System
The datacomputer is a large-scale data utility system, offering data
storage and data management services to other computers.
The datacomputer differs from traditional data management systems in
First, it is implemented on dedicated hardware, and comprises a separate
computing system specialized for data management.
Second, the system is implemented on a large scale. Data is intended to
be stored on mass storage devices, with capacities in the range of a
trillion bits. Files on the order of one hundred billion bits are to be
Third, it is intended to support sharing of data among processes
operating in diverse environments. That is, the programs which share a
given data base may be written in different languages, execute on
different hardware under different operating systems, and support end
users with radically different requirements. To enable such shared use
of a data base, transformations between various hardware representations
and data structuring concepts must be achieved.
Finally, the datacomputer is designed to function smoothly as a
component of a much larger system: a computer network. In a computer
network, the datacomputer is a node specialized for data management, and
acting as a data utility for the other nodes. The Arpanet, for which
the datacomputer is being developed, is an international network which
has over 60 nodes. Of these, some are presently specialized for
terminal handling, others are specialized for computation (e.g., the
ILLIAC IV), some are general purpose service nodes (e.g., MULTICS) and
one (CCA) is specialized for data management.
Datalanguage is the language in which all requests to the datacomputer
are stated. It includes facilities for data description and creation,
for retrieval of or changes to stored data, and for access to a variety
of auxiliary facilities and services. In datalanguage it is possible to
specify any operation the datacomputer is capable of performing.
Datalanguage is the only language accepted by the datacomputer and is
the exclusive means of access to data and services.
1.3 Present Design Effort
We are now engaged in developing complete specifications for
datalanguage; this is the second iteration in the language design
A smaller, initial design effort developed some concepts and principles
which are described in the third working paper in this series. These
have been used as the basis of software implementations resulting in an
initial network service capability. A user manual for this system was
published as working paper number 7.
As a result of experience gained in implementation and service, through
further study of user requirements and work with potential users, and
through investigation of other work in the data management field, quite
a few ideas have been developed for the improvement of datalanguage.
These are being assimilated into the language design in the iteration
now in progress.
When the language design is complete, it will be incorporated into the
existing software (requiring changes to the language compiler, but
having little impact on the rest of the system).
Datacomputer users will first have access to the new language during
1.4 Purpose of this Paper
This paper presents concepts and preliminary results, rather than a
completed design. There are two reasons for publishing now.
The first is to provide information to those planning to use the
datacomputer. They may benefit from knowledge of our intentions for
The second is to enable system and language designers to comment on our
work before the design is frozen.
1.5 Organization of the Paper
The remainder of the paper is divided into four sections.
Section 2 discusses the most global considerations for language design.
This comprises our view of the problem; it has influenced our work to
date and will determine most of our actions in completion of the design.
This section provides background for section 3, and reviews some
material that will be familiar to those who have been following our work
Section 3 discusses some of the specific issues we have worked on. The
emphasis is on solutions and options for solution.
In sections 2 and 3 we are presenting our "top-down" work: this is the
thinking we have done based on known requirements and our conception of
the desirable properties of datalanguage.
We have also been working from the opposite end, developing the
primitives from which to construct the language. Section 4 presents our
work in this area: a model datacomputer which will ultimately provide a
precise semantic definition of datalanguage. Section 4 explains that
part of the model which is complete, and relates this to our other work.
Section 5 discusses work that remains, both on the model and in our
2. Considerations for Language Design
Data management is the task of managing data as a resource, independent
of hardware and applications programs. It can be divided it into five
(1) _creating_ databases in storage,
(2) making the data _available_ (e.g., satisfying queries),
(3) _maintaining_ the data as information is added, deleted and
(4) assuring the _integrity_ of the data (e.g., through backup and
recovery systems, through internal consistency checks),
(5) _regulating_access_, to protect the databases, the system, and
the privacy of users.
These are the major data-related functions of the datacomputer; while
the system will ultimately provide other services (such as accounting
for use, monitoring performance) these are really auxiliary and common
to all service facilities.
This section presents global considerations for the design of
datalanguage, based on our observations about the problem and the
environment in which it is to be solved. The central problem is data
management, and the datacomputer shares the same goals as many currently
available data management systems. Several aspects of the datacomputer
create a unique set of problems to be solved.
2.2 Hardware Considerations
2.2.1 Separate Box
The datacomputer is a complete data management utility in a separate,
closed box. That is, the hardware, the data and the data management
software are segregated from any general-purpose processing facilities.
There is a separate installation dedicated to data management.
Datalanguage is the only means users have for communicating with the
datacomputer and the sole activity of the datacomputer is to process
Dedicating hardware provides an obvious advantage: one can specialize it
for data management. The processor(s) can be modified to have data
management "instructions"; common low-level software functions can be
built into the hardware.
A less obvious, but possibly more significant, advantage is gained from
the separateness itself. The system can be more easily protected. A
fully-developed datacomputer on which there is only maintenance activity
can provide a very carefully controlled environment. First, it can be
made as physically secure as required. Second, it needs to execute only
system software developed at CCA; all user programs are in a high-level
language (datalanguage) which is effectively interpreted by the system.
Hence, only datacomputer system software processes the data, and the
system is not very vulnerable to capture by a hostile program. Thus,
since there is the potential to develop data privacy and integrity
services that are not available on general-purpose systems, one can
expect less difficulty in developing privacy controls (including
physical ones) for the datacomputer than for the systems it serves.
2.2.2 Mass Storage Hardware
The datacomputer will store most of its data on mass storage devices,
which have distinctive access characteristics. Two examples of such
hardware are Precision Instruments' Unicon 690 and Ampex Corporation's
TBM system. They are quite different from disks, and differ
significantly from one another.
However, almost all users will be ignorant of the characteristics of
these devices; many will not even know that the data they use is at the
datacomputer. Finally, as the development of the system progresses,
data may be invisibly shunted from one datacomputer to another, and as a
result be stored in a physical format quite different from that
In such an environment, it is clear that requests for data should be
stated in logical, not physical terms.
2.3 Network Environment
The network environment provides additional requirements for
2.3.1 Remote Use
Since the datacomputer is to be accessed remotely, the requirement for
effective data selection techniques and good mechanisms for the
expression of selection criteria is amplified. This is because of the
narrow path through which network users communicate with the
datacomputer. Presently, a typical process-to-process transfer rate
over the Arpanet is 30 kilobits per second. While this can be increased
through optimization of software and protocols, and through additional
expenditure for hardware and communications lines, it seems safe to
assume that it will not soon approach local transfer rates (measured in
the megabits per second).
A typical request calls for either transfer of part of a file to a
remote site, or for selective update to a file already stored at the
datacomputer. In both of these situations, good mechanisms for
specifying the parts of the data to be transmitted or changed will
reduce the amount of data ordinarily transferred. This is extremely
important because with the low per bit cost of storing data at the
datacomputer, transmission costs will be a significant part of the total
cost of datacomputer usage.
2.3.2 Interprocess Use of the Datacomputer System
Effective use of the network requires that groups of processes, remote
from one another, be capable of cooperating to accomplish a given task
or provide a given service. For example, to solve a given problem which
involves array manipulation, data retrieval, interaction with a user at
a terminal, and the generalized services of a language like PL/I, it may
be most economical to have four cooperating processes. One of these
could execute at the ILLIAC IV, one at the datacomputer, one at MULTICS,
and one at a TIP. While there is overhead in setting up these four
processes and in having them communicate, each is doing its job on a
system specialized for that job. In many cases, the result of using the
specialized system is a gain of several orders of magnitude in economy
or efficiency (for example, online storage at the datacomputer has a
capital cost two orders of magnitude lower than online costs on
conventional systems). As a result, there is considerable incentive to
consider solutions involving cooperating processes on specialized
To summarize: the datacomputer must be prepared to function as a
component of small networks of specialized processes, in order that it
can be used effectively in a network in which there are many specialized
2.3.3 Common Network Data Handling
A large network can support enough data management hardware to construct
more than one datacomputer. While this hardware can be combined into
one even larger datacomputer, there are advantages to configuring it as
two (or possibly more) systems. Each system should be large enough to
obtain economies of scale in data storage and to support the data
management software. Important data bases can be duplicated, with a
copy at each datacomputer; if one datacomputer fails, or is cut off by
network failure, the data is still available. Even if duplicating the
file is not warranted, the description can be kept at the different
datacomputers so that applications which need to store data constantly
can be guaranteed that at least one datacomputer is available to receive
These kinds of failure protection involve cooperation between a pair of
datacomputers; in some sense, they require that the two datacomputers
function as a single system. Given a system of datacomputers (which one
can think of as a small network of datacomputers), it is obviously
possible to experiment with providing additional services on the
datacomputer-network level. For example, all requests could be
addressed simply to the datacomputer-network; the datacomputer-network
could then determine where each referenced file was stored (i.e., which
datacomputer), and how best to satisfy the request.
Here, two kinds of cooperation in the network environment have been
mentioned: cooperation among processes to solve a given problem, and
cooperation among datacomputers to provide global optimizations in the
network-level data handling problem. These are only two examples,
especially interesting because they can be implemented in the near term.
In the network, much more general kinds of cooperation are possible, if
a little farther in the future. For example, eventually, one might want
the datacomputer(s) to be part of a network-wide data management system,
in which data, directories, services, and hardware were generally
distributed about the network. The entire system could function as a
whole under the right circumstances. Most requests would use the data
and services of only a few nodes. Within this network-wide system,
there would be more than one data management system, but all systems
would be interfaced through a common language. Because the
datacomputers represent the largest data management resource in the
network, they would certainly play an important role in any network-wide
system. The language of the datacomputer (datalanguage) is certainly a
convenient choice for the common language of such a system.
Thus a final, albeit futuristic, requirement imposed by the network on
the design of the datacomputer system, is that it be a suitable major
component for network-wide data management systems. If feasible, one
would like datalanguage to be a suitable candidate for the common
language of a network-wide group of cooperating data management systems.
2.4 Different Modes of Datacomputer Usage
Within this network environment, the datacomputer will play several
roles. In this section four such roles are described. Each of them
imposes constraints on the design of datalanguage. We can analyze them
in terms of four overlapping advantages which the datacomputer provides:
1. Generalized data management services
2. Large file handling
3. Shared access
4. Economic volume storage
Of course, the primary reason for using the datacomputer will be the
data management services which it provides. However, for some
applications size will be the dominating factor in that the datacomputer
will provide for online access to files which are so large that
previously only offline storage and processing were possible. The
ability to share data between different network sites with widely
different hardware is another feature provided only by the datacomputer.
Economies of scale make the datacomputer a viable substitute for tapes
in such applications as operating system backup.
Naturally, a combination of the above factors will be at work in most
datacomputer applications. The following subsections describe some
possible modes of interaction with the datacomputer.
2.4.1 Support of Large Shared Databases
This is the most significant application of the datacomputer, in nearly
Projects are already underway which will put databases of over one
hundred billion bits online on the Arpanet datacomputer. Among these
are a database which will ultimately include 10 years of weather
observations from 5000 weather stations located all over the world. As
online databases, these are unprecedented in size. They will be of
international interest and be shared by users operating on a wide
variety of hardware and in a wide variety of languages.
Because these databases are online in an international network, and
because they are expected to be of considerable interest to researchers
in the related fields, it seems obvious that there will be extremely
broad patterns of use. A strong requirement, then, is a flexible and
general approach to handling them. This requirement of providing
different users of a database with different views of the data is an
overriding concern of the datalanguage design effort. It is discussed
separately in Section 2.5.
2.4.2 Extensions of Local Data management Systems
We imagine local data handling systems (data management systems,
applications-oriented packages, text-handling systems, etc.) wanting to
take advantage of the datacomputer. They may do so because of the
economics of storage, because of the data management services, or
because they want to take advantage of data already stored at the
datacomputer. In any case, such systems have some distinctive
properties as datacomputer users: (1) most would use local data as well
as datacomputer data, (2) many would be concerned with the translation
of local requests into datalanguage.
For example, a system which does simple data retrieval and statistical
analysis for non-programming social scientists might want to use a
census database stored at the datacomputer. Such a system may perform a
range of data retrieval functions, and may need sophisticated
interaction with the datacomputer. Its usage patterns would make quite
a contrast with those of a single application program whose sole use of
the datacomputer involves printing a specific report based on a single
This social-science system would also use some local databases, which it
keeps at its own site because they are small and more efficiently
accessed locally. One would like it to be convenient to think of data
the same way, whether it is stored locally or at the datacomputer.
Certainly at the lower levels of the local software, there will have to
be differences in interfacing; it would be nice, however, if local
concepts and operations could easily be translated into datalanguage.
2.4.3 File Level Use of the Datacomputer
In this mode of use, other computer systems take advantage of the online
storage capacity of the datacomputer. To these systems, datacomputer
storage represents a new class of storage: cheaper and safer than tape,
nearly as accessible as local disk. Perhaps they even automatically
move files between local online storage and the datacomputer, giving
users the impression that everything is stored locally online.
The distinctive feature of this mode of use is that the operations are
on whole files.
A system operating in this mode uses only the ability to store,
retrieve, append, rename, do directory listings and the like. An
obvious way to make such file level handling easily available to the
network community is to make use of the File Transfer Protocol (see
Network Information Center document #17759 -- File Transfer Protocol)
already in use for host to host file transfer.
Although such "whole file" usage of the datacomputer would be motivated
primarily by economic advantages of scale, data sharing at the file
level could also be a concern. For example, the source files of common
network software might reside at the datacomputer. These files have
little or no structure, but their common use dictates that they be
available in a common, always accessible place. It is taking advantage
of the economics of the datacomputer, more than anything else, since
most of these services are available on any file system.
This mode of use is mentioned here because it may account for a large
percentage of datalanguage requests. It requires only capabilities
which would be present in datalanguage in any case; the only special
requirement is to make sure it is easy and simple to accomplish these
2.4.4 Use of Datacomputer for File Archiving
This is another economics-oriented application. The basic idea is to
store on the datacomputer everything that you intend to read rarely, if
ever. This could include backup files, audit trails, and the like.
An interesting idea related to archiving is incremental archiving. A
typical practice, with regard to backing up data stored online in a
time-sharing system, is to write out all the pages which are different
than they were in the last dump. It is then possible to recover by
restoring the last full dump, and then restoring all incremental dumps
up to the version desired. This system offers a lower cost for dumping
and storage, and a higher cost for recovery; it is appropriate when the
probability of needing a recovery is low. Datalanguage, then, should be
designed to permit convenient incremental archiving.
As in the case of the previous application (file system), archiving is
important as a design consideration because of its expected frequency
and economics, not because it necessarily requires any extra generality
at the language level. It may dictate that specialized mechanisms for
archiving be built into the system.
2.5 Data Sharing
Controlled sharing of data is a central concern of the project. Three
major sub-problems in data sharing are: (1) concurrent use, (2)
independent concepts of the same database, and (3) varying
representations of the same database.
Concurrent use of a resource by multiple independent processes is
commonly implemented for data on the file level in systems in which
files are regarded as disjoint, unrelated objects. It is sometimes
implemented on the page level.
Considerable work on this problem has already been done within the
datacomputer project. When this work is complete, it will have some
impact on the language design; by and large however, we do not consider
this aspect of concurrent use to be a language problem.
Other aspects of the concurrent use problem, however, may require more
conscious participation by the user. They relate to the semantics of
collections of data objects, when such collections span the boundaries
of files known to the internal operating system. Here the question of
what constitutes an update conflict is more complex. Related questions
arise in backup and recovery. If two files are related, then perhaps it
is meaningless to recover an earlier state of one without recovering the
corresponding state of the other. These problems are yet to be
Another problem in data sharing is that not all users of a database
should have the same concept of that database. Examples: (1) for
privacy reasons, some users should be aware of only part of the database
(e.g., scientists doing statistical studies on medical files do not need
access to name and address), (2) for program-data independence, payroll
programs should access only data of concern in writing paychecks, even
though skill inventories may be stored in the same database, (3) for
global control of efficiency, simplicity in application programming, and
program-data independence each application program should "see" a data
organization that is best for its job.
To further analyze example (3), consider a database which contains
information about students, teachers, subjects and also indicates which
students have which teachers for which subjects. Depending on the
problem to be solved, an application program may have a strong
requirement for one of the following organizations:
(1) entries of the form (student,teacher,subject) with no concern about
redundancy. In this organization an object of any of the three
types may occur many times.
(2) entries of the form
(3) entries of the form
and other organizations are certainly possible.
One approach to this problem is to choose an organization for stored
data, and then have application programs write requests which organize
output in the form they want. The application programmer applies his
ingenuity in stating the request so that the process of reorganization
is combined with the process of retrieval, and the result is relatively
efficient. There are important, practical situations in which this
approach is adequate; in fact there are situations in which it is
desirable. In particular, if efficiency or cost is an overriding
consideration, it may be necessary for every application programmer to
be aware of all the data access and organization factors. This may be
the case for a massive file, in which each retrieval must be tuned to
the access strategy and organization; any other mode of operation would
result in unacceptable costs or response times.
However, dependence between application programs and data organization
or access strategy is not a good policy in general. In a widely-shared
database, it can mean enormous cost in the event of database
reorganization, changes to access software, or even changes in the
storage medium. Such a change may require reprogramming in hundreds of
application programs distributed throughout the network.
As a result, we see a need for a language which supports a spectrum of
operating modes, including: (1) application program is completely
independent of storage structure, access technique, and reorganization
strategy, (2) application program parametrically controls these, (3)
application program entirely controls them. For a widely-shared
database, mode (1) would be the preferred policy, except when (a) the
application programmer could do a better job than the system in making
decisions, and (b) the need for this increment of efficiency outweighed
the benefits of program-data independence.
In evaluating this question for a particular application, it is
important to realize the role of global efficiency analysis. When there
are many users of a database, in some sense the best mode of operation
is that which minimizes the total cost of processing all requests and
the total cost of storing the data. When applications come and go, as
real-world needs change, then the advantages of centralized control are
more likely to outweigh the advantages of optimization for a particular
The third major sub-problem arises in connection with item level
representations. Because of the environment in which it executes, each
application program has a preferred set of formatting concepts, length
indicators, padding and alignment conventions, word sizes, character
representations, and so on. Once again it is better policy for the
application program to be concerned only with the representations it
wants and not with the stored data representation. However, there will
be cases in which efficiency for a given request overrides all other
At this level of representation, there is at least one additional
consideration: potential loss of information when conversion takes
place. Whoever initiates a type conversion (and this will sometimes be
the datacomputer and sometimes the application program) must also be
responsible for seeing that the intent of the request is preserved.
Since the datacomputer must always be responsible for the consistency
and the meaning of a shared database, there are some conflicts to be
To summarize, it seems that the result of wide sharing of databases is
that a larger system must be considered in choosing a data management
policy for a particular database. This larger system, in the case of
the datacomputer, consists of a network of geographically distributed
applications programs, a centralized database, and a centralized data
management system. The requirement for datalanguage is to provide
flexibility in the management of this larger system. In particular, it
must be possible to control when and where conversions, data re-
organizations, and access strategies are made.
2.6 Need for High Level Communication
All of the above considerations point to the need for high level
communication between the datacomputer and its users. The complex and
distinct nature of datacomputer hardware make it imperative that
requests be put to the datacomputer so that it can make major decisions
regarding the access strategies to be used. At the same time, the large
amounts of data stored and the demand of some users for extremely high
transmission bandwidths make it necessary to provide for user control of
some storage and transmission schemes. The fact that databases will be
used by applications which desire different views of the same data and
with different constraints means that the datacomputer must be capable
of mapping one users request onto another users data. Interprocess use
of the datacomputer means that datasharing must be completely
controllable to avoid the need for human intervention. Extensive
facilities for ensuring data integrity and controlling access must be
2.6.1 Data Description
Basic to all these needs is the requirement that the data stored at the
datacomputer be completely described in both functional and physical
parameters. A high level description of the data is especially
important to provide the sharing and control of data. The datacomputer
must be able to map between different hardware and different
applications. In its most trivial form this means being able to convert
between floating point number representations on different machines. On
the other extreme it means being able to provide matrix data for the
ILLIAC IV as well as being able to provide answers to queries from a
natural language program, both addressed to the same weather data base.
Data descriptions must provide the ability to specify the bit level
representations and the logical properties and relationships of data.
2.6.2 Data integrity and Access Control
In the environment we have been describing, the problems of maintaining
data integrity and controlling use of data assume extreme importance.
Shared use of datacomputer files depends on the ability of the
datacomputer to guarantee that the restrictions on data-access are
strictly enforced. Since different users will have different
descriptions, the access control mechanism must be associated with the
descriptions themselves. One can control access to data by controlling
access to its various descriptors. A user can be constrained to access
a given data base only through one specific description which limits the
data he can access. In a system where the updaters of a database may be
unknown to each other, and possibly have different views of the data,
only the datacomputer can assure data integrity. For this reason, all
restrictions on possible values of data objects, and on possible or
necessary relationships between objects must be stated in the data
The decisions regarding data access strategy must ordinarily be made at
the datacomputer, where knowledge of the physical considerations is
available. These decisions cannot be made intelligently unless the
requests for data access are made at a high level.
For example, compare the following two situations: (1) a request calls
for output of _all_ weather observations made in California exhibiting
certain wind and pressure conditions, (2) a series of requests is sent,
each one retrieving California weather observations; when a request
finds an observation with the required wind and pressure conditions, it
transmits this observation to a remote system. Both sessions achieve
the same result: the transmission of a certain set of observations to a
remote site for processing. In the first session, however, the
datacomputer receives, at the outset, a description of the data that is
needed; in the second, it processes a series of requests, each one of
which is a surprise.
In the first case, a smart datacomputer has the option of retrieving all
of the needed data in one access to the mass storage device. It can
then buffer this data on disk until the user is ready to accept it. In
the second case, the datacomputer lacks the information it needs to make
such an optimization.
The language should permit and encourage users to provide the
information needed to do optimization. The cost of not doing it is much
higher with mass storage devices and large files than it is in
2.7 Application Oriented Concerns
In the above sections we have described a number of features which the
datacomputer system must provide. In this section we focus on what is
necessary to make these features readily available to users of the
2.7.1 Datacomputer-user Interaction
An application interacts with the datacomputer in a _session_. A
session consists of a series of requests. Each session involves
connecting to the datacomputer via the network, establishing identities,
and setting up transmission paths for both data and datalanguage.
Datalanguage is transmitted in character mode (using network standard
ASCII) over the datalanguage connection. Error and status messages are
sent over this connection to the application program.
The data connection (called a PORT) is viewed as a bit stream and is
given its own description. These descriptions are similar to those given
for stored data. At a minimum this description must contain enough
information for the datacomputer to parse the incoming bit stream. It
also may contain data validation information as well. To store data at
the datacomputer, the stored data must also have a description. The
user supplies the mapping between the descriptions of the stored and
| | / /
| ______ ___________ | \ \
| | |---| | | / /
| | | | DATA | | \ \
| | | |DESCRIPTION| _______ | DATALANGUAGE ___________
| | | |___________| | |<-------------------->| |
| |STORED| |________| USER | | PATH |APPLICATION|
| | DATA |__________________|REQUEST| | | PROGRAM |
| | | |_______|<----!--------------->|___________|
| | | ___________ | ! DATA PATH
| | | | | | ! / /
| | | | PORT |-----! \ \
| | | |DESCRIPTION| | / /
| |______| |___________| | \ \
|_____________________________________| / /
A Model of Datacomputer/User Interaction
2.7.2 Application Features for Data Sharing
In using data stored at the datacomputer, users may supply a description
of the data which is customized to the application. This description is
mapped onto the description of the stored data. These descriptions may
be at different levels. That is, one may merely rearrange the order of
certain items, while another could call for a total restructuring of the
stored representation. So that each user may be able to build upon the
descriptions of another, data entities should be given named types.
These type definitions are of course to be stored along with the data
they describe. In addition, certain functions are so closely tied to
the data (in fact may be the data in the virtual description case -- see
section 3), that they must also reside in the datacomputer and their tie
with the data items should be maintained by the datacomputer. For
example, one user can describe a data base as made up of structures
containing data of the types _latitude_ and _longitude_. He could also
describe functions for comparing data of this type. Other users, not
concerned with the structure of the _latitude_ component itself, but
interested in using this information simply to extract other fields of
interest can then use the commonly provided definitions and functions.
Furthermore, by adopting this strategy as many users as possible can be
made insensitive to changes in the file which are tangential to their
main interests. For example, _latitudes_ could be changed from binary
representation to a character form and if use of that field were
restricted to its definitions and associated functions, existing
application systems would be unaffected. Conversion functions could be
defined to eliminate the impact on currently operating programs. The
ability of such definitional facilities means that groups of users can
develop common functions and descriptions for dealing with shared data
and that conventions for use of shared data can be enforced by the
datacomputer. These facilities are discussed under _extensibility_ in
| ____________ | | ___________ |
| |APPLICATION | | | |APPLICATION| |
| _| DATA |_|____|_| PROGRAM | |
| | |DESCRIPTIONS| | | |___________| |
| | |____________| | |_______________|
| | ^ | HOST 1
| ______ | | |
| | | | _____|______ |
| | | | | DATA | |
| | | | | FUNCTIONS | |
| | | | |____________| | _______________
| | | ___________ | ____________ | | ___________ |
| | | | STORED |__| | | | | |APPLICATION| |
| | |__| DATA |____| |_|____|_| PROGRAM | |
| |STORED| |DESCRIPTION|__ | | | | |___________| |
| | DATA | |___________| | |____________| | | |
| | | ^ | ____________ | | ___________ |
| | | | | | | | | |APPLICATION| |
| | | _____|_____ | | |_|____|_| PROGRAM | |
| | | | DATA | |_| | | | |___________| |
| | | | FUNCTIONS | |____________| | |_______________|
| |______| |___________| | HOST 2
Multiple User Interaction with the Datacomputer
2.7.3 Communication Model
We intend that datalanguage, while at a high level conceptually, will be
at a low level syntactically. Datalanguage provides a set of primitive
functions, and a set of commonly used higher level functions (see
section 4 on the datalanguage model). In addition, users can define
their own functions so that they can communicate with the datacomputer
at a level as conceptually close to the application as possible.
There are two reasons for datalanguage being at a low level
syntactically. First, it is undesirable to have programs composing
requests into an elaborate format only to be decomposed by the
datacomputer. Second, by choosing a specific high level syntax, the
datacomputer would be imposing a set of conventions and terminology
which would not necessarily correspond to those of most users.
DATACOMPUTER ENVIRONMENT | OUTSIDE ENVIRONMENT
| | DMS |____
| | |_______|
_________ ________ _________ |
| | | HIGHER | | |__| _______ ________
|PRIMITIVE|___| LEVEL |___|LOW-LEVEL|_____|COBOL | | COBOL |
|LANGUAGE | |LANGUAGE| | SYNTAX |__ |SERVER |___|PROGRAM |
|_________| |________| |_________| | |_______| |________|
| | _______
| | QUERY |_______
| | USERS |
Datacomputer/User Working Environment
In this section we have presented the major considerations which have
influenced the current datalanguage design effort. The datacomputer has
much in common with most large-scale shared data management systems, but
also has a number of overriding concerns unique to the datacomputer
concept. The most important of these are the existence of a separate
box containing both hardware and software, the control of an extremely
large storage device, and embedding in a computer network environment.
Data sharing in such an environment is a central concern of the design.
Both extensive data description facilities and high level communication
between user and datacomputer are necessary for data integrity and for
datacomputer optimization of user requests. In addition, the expected
use of the datacomputer involves satisfying several conflicting
constraints for different modes of operation. One way of satisfying
various user needs is to provide datalanguage features so that users may
develop their own application packages within datalanguage.
3. Principal Language Concepts
This section discusses the principal facilities of datalanguage.
Specific details of the language are not presented, however, the
discussion includes the motivation behind the inclusion of the various
language features and also defines, in an informal way, the terms we
3.1 Basic Data Items
Basic data are the atomic level of all data constructions; they cannot
be decomposed. All higher level data structures are fundamentally
composed of basic data items. Many types of basic data items will be
provided. The type of an item determines what operations can be
performed on the item and the meaning of those operations. Datalanguage
will provide those primitive types of data items which are commonly used
in computing systems to model the real world.
The following basic types of data will be available in datalanguage:
_fixed_point_numbers_, _floating_point_numbers_, _characters_,
_booleans_, and _bits_. These types of items are "understood" by the
datacomputer system to the extent that operations are based on the type
of an item. Datalanguage will also include an _uninterpreted_ type of
item, for data which will only be moved (including transmitted) from one
place to another. This type of data will only be understood in the
trivial sense that the datacomputer can determine if two items of the
uninterpreted type are identical. Standard operations on the basic
types of items will be available. Operations will be included so that
the datacomputer user can describe a wide range of data management
functions. They are not included with the intent of encouraging use of
the datacomputer for the solving of highly computational problems.
3.2 Data Aggregates
Data aggregates are compositions of basic data items and possibly other
data aggregates. The types of data aggregates which are provided allow
for the construction of hierarchical relationships of data. The
aggregates which will definitely be available are classified as
_structs_, _arrays_, _strings_, _lists_, and _directories_.
A struct is a static aggregate of data items (called _components_). A
struct is static in the sense that the components of a struct cannot be
added or deleted from the struct, they are inextricably bound to the
struct. Associated with each component of the struct is a name by which
that component may be referenced relative to the struct. The struct
aggregate may be used to model what is often thought of as a record,
with each component being a field of that record. A struct can also be
used to group components of a record which are more strongly related,
conceptually, than other components and may be operated on together.
Arrays allow for repetition in data structures. An array, like a
struct, is a static aggregate of data items (called _members_). Each
member of an array is of the same type. Associated with each member is
an index by which that member can be referenced relative to the array.
Arrays can he used to model repeating data in a record (repeating
The concept of string is actually a hybrid of basic data and data
aggregates. Strings are aggregates in that they are compositions
(similar to arrays) of more primitive data (e.g., characters). They are,
however, generally conceived of as basic in that they are mostly viewed
as a unit rather than as a collection of items, where each item has
individual importance. Also the meaning of a string is highly dependent
on the order of the individual components. In more concrete terms,
there are operations which are defined on specific types of strings.
For example, the logical operators (_and_, _or_, etc.) are defined to
operate on strings of bits. However, there are no operations which are
defined on arrays of bits, although there are operations defined on both
arrays, in general, and on bits. Strings of characters, bits, and
uninterpreted data will be available in datalanguage.
Lists are like arrays in that they are collection of similar members.
However, lists are dynamic rather than static. Members of a list can be
added and deleted from the list. Although, the members of a list are
ordered (in fact more than one ordering can be defined on a list), the
list is not intended to be referenced via an index, as is the case with
an array. Members of a list can be referenced via some method of
sequencing through the list. A list member, or set (see discussion
under virtual data) of members, can also be referenced, by some method
of identification by content. The list structure can be used to model
the common notion of a file. Also restrictive use of lists as
components of structs provides power with respect to the construction of
dynamic hierarchical data relationships below the file level. For
example, the members of a list may themselves be, in part, composed of
lists, as in a list of families, where each family contains a list of
children as well as other information.
Directories are dynamic data aggregates which may contain any type of
data item. Data items contained in a directory are called _nodes_.
Associated with each node of a directory is a name by which that data
item can be referenced relative to the directory. As with lists, items
may be dynamically added to and deleted from a directory. The primary
motivation behind providing the directory capability is to allow the
user to group conceptually related data together. Since directories
need not contain only file type information, "auxiliary" data can be
kept as part of the directory. For example, "constant" information,
like salary range tables for a corporation data base; or user defined
operations and data types (see below) can be maintained in a directory
along with the data which may use this information. Also directories
may themselves be part of a directory, allowing for a hierarchy of data
Directories will also be defined so that system controlled information
can be maintained with some of the subordinate items (e.g. time of
creation, time of update, privacy locks, etc.). It may also be possible
to allow the data user to define and control his own information which
would be maintained with the data. At the least, the design of
datalanguage will allow for parametric control over the information
managed by the system.
Directories are the most general and dynamic type of aggregate data.
Both the name and description (see below) of directory nodes exist with
the nodes themselves, rather than as part of the description of the
directory. Also the level of nesting of a directory is dynamic since
directories can be dynamically added to directories. Directories are
the only aggregate for which this is true.
Datalanguage will also provide some specific and useful variations of
the above data aggregates. Structs will be available which allow for
optional components. In this case the existence of a component would be
based on the contents of other components. It may also he possible to
allow for the existence to be based on information found at a higher
level of data hierarchy. Similarly, components with _unresolved_ type
will be provided. That is the component may be one of a fixed number of
types. The type of the component would be based on the contents of
other components of the struct. It is also desirable to allow the type
or existence of a component to be based on information other than the
contents of other components. For instance, the type of one component
might be based on the type of another component. In general, we would
like for datalanguage to allow for the attributes (see below) of one
item to be a function of the attributes of other items.
We would also like to provide mixed lists. Mixed lists are lists which
contain more than one type of member. In this case the members would
have to be self defining. That is, the type of all member would have to
be "alike" to the degree that information which defines the type of that
member could be found.
Similar to components whose type is unresolved are Arrays with
unresolved length. In this case, information defining the length of the
array must be carried with the array or perhaps with other components of
an aggregate which encompasses the array.
In all of the above cases the type of an item is unresolved to some
degree and information which totally resolves the type is carried with
the item. It is possible that in some or perhaps all of these cases the
datacomputer system could be responsible for the maintenance of this
information, making it invisible to the data user.
3.3 General Relational Capabilities
The data aggregates described above allow for the modeling of various
relationships among data. All relationships which can be constructed
Two approaches can he taken to provide the capability of modeling non-
hierarchical relationships. New types of data aggregates can be
introduced which will broaden the range of data relationships
expressible in datalanguage. Or, a basic data type of "pointer" can be
introduced which will serve as a primitive out of which relations can be
represented. Pointer would be a data type which establishes some kind
of correspondence from one item to another. That is, it would be a
method of finding one item, given another . Providing the ability to
have items of type pointer does not necessitate the introduction of the
concept of address which we deem to be a dangerous step. For example,
an item defined to point to a record in a personnel file could contain a
social security number which is contained in each record of the file and
uniquely identifies that record. In general a pointer is an item of
information which can be used to uniquely identify another item.
While the pointer approach provides the greater degree of flexibility,
it does this at the price of relegating much of the work to the user as
well as severely limiting the amount of control the datacomputer system
has over the data. A hybrid solution is possible, where some new
aggregate data types are provided as well as a restricted form of
pointer data type. While the approach to be taken is still being
studied, the datalanguage design will include some method of expressing
non-hierarchical data structures.
3.4 Ordering of Data
Lists are generally viewed as ordered. It is possible, however, that a
list can be used to model a dynamic collection of similar items which
are not seen as ordered. The unordered case is important, in that,
given this information the datacomputer can be more efficient since new
members can be added wherever it is convenient.
There are a number of ways a list can be ordered. For instance, the
ordering of a list can be based on the contents of its members. In the
simplest case this involves the contents of a basic data item. For
example, a list of structs containing information on employees of a
company may be ordered on the component which contains the employee's
social security number. More complex ordering criteria are possible.
For example, the same list could be ordered alphabetically with respect
to the employee's last name. In this case the ordering relation is a
function of two items, the last and first names. The user might also
want to define his own ordering scheme, even for orderings based on
basic data items. An ordering could be based on an employee's job title
which might even utilize auxiliary data (i.e. data external to the
list). It is also possible to maintain a list in order of insertion.
In the most general case, the user could dynamically define his ordering
by specification of where an item is to be placed as part of his
insertion requests. In all of the above cases, data could be maintained
in ascending or descending order.
In addition to maintenance of a list in some order, it is possible to
define one or more orderings "imposed" on a list. These orderings must
be based on the contents of a list's members. This situation is similar
to the concept of virtual data (see below) in that the list is not
physically maintained in a given order, but retrieved as if it were.
Orderings of this type can be dynamically formed (see discussion of set
under virtual data). Imposed orderings can be accomplished via the
maintenance of auxiliary structures (see discussion under internal
representation) or by utilization of a sorting strategy on retrievals.
Much work has been done with regard to effective implementation of the
maintenance and imposition of orderings on lists. This work is
described in working paper number 2.
3.5 Data Integrity
An important feature of any data management system is the ability to
have the system insure the integrity of the data. Data needs to be
protected against erroneous manipulation by people and against system
Datalanguage will provide automatic validity checks. Many flavors need
to be provided so that appropriate trade-offs can be made between the
degree of insurance and the cost of validation. The datalanguage user
will be able to request constant validation: where validity checks are
made whenever the data is updated; validation on access: where validity
checks are performed when data is referenced but before it is retrieved;
regularly scheduled validation: where the data is checked at regular
intervals; background validation: where the system will run checks in
its spare time; and validation on demand. Constant validation and
validation on access are actually special cases of the more general
concept of event triggered validation. In this case the user specifies
an event which will cause data validation procedures to be invoked. This
feature can be used to accomplish such things as validation following a
"batch" of updates. Also, some mechanism for specifying combinations of
these types would be useful.
In order for some of the data validation techniques to be effective, it
may be necessary to keep some data validation "bookkeeping" information
with the data. For example, information which can be used to determine
whether an item has been checked since it was last updated might be used
to cause validation on access if there has not been a recent background
validation. The datacomputer may provide for optional automatic
maintenance of such special kinds of information.
In order for the datacomputer system to insure data validity, the user
must define what valid is. Two types of validation can be requested. In
the first case the user can tell the datacomputer that a specific data
item may only assume one of a specific set of values. For example, the
color component of a struct may only assume the values 'red', 'green',
or 'blue'. The other case is where some relation must hold between
members of an aggregate. For example, if the sex component of a struct
is 'male' then the number of pregnancies component must be 0.
Data validation is only half of the data integrity picture. Data
integrity involves methods of restoring damaged data. This requires
maintenance of redundant information. Features will be provided which
will make the datacomputer system responsible for the maintenance of
redundant data and possibly even automatic restoration of damaged data.
In section 2 we discussed possible uses of the datacomputer for file
backup. All features which are provided for this purpose will also be
available as methods of maintaining backup information for restoration
of files residing at the datacomputer.
Datalanguage will have to provide extensive privacy and protection
capabilities. In its simplest form a privacy lock is provided at the
file level. The lock is opened with a password key. Associated with
this key is a set of privileges (reading, updating, etc.). Two degrees
of generality are sought. Privacy should be available at all levels of
data. Therefore, groups of related data, including groups of files
could be made private by creating private directories. Also, specific
fields of records could be made private by having private components of
a struct where other components of the struct are visible to a wider (or
different) class of users. We would also like the user to be able to
define his own mechanism. In this way, very personalized, complex, and
hence secure mechanisms can be defined. Also features such as 'everyone
can see his own salary' might be possible.
Many types of data are related in that some or all of the possible
values of one type of data have an "obvious" translation to the values
of another. For example, the character '6' has a natural translation to
the integer 6, or the six character string 'abc ' (three trailing
blanks) has a natural translation to the four character string 'abc '
(one trailing blank). Datalanguage will provide conversion capabilities
for the standard, commonly called for, translations. These conversions
can be explicitly invoked by the user or implicitly invoked when data of
one type is needed for an operation but data of another type is
provided. In the case of implicit invocation of conversion of data the
user will have control over whether conversion takes place for a given
data item. More generally we would like to provide a facility whereby
the user could specify conditions which determine when an item is to be
converted. Also, the user should be able to define his own conversion
operations, either for a conversion between types which is not provided
by the datacomputer system or to override the standard conversion
operation for some or all items of a given type.
3.8 Virtual and Derived Data
Often, information important to users of data is embedded in that data
rather than explicitly maintained. For example, the dollar value of an
individual's interest in a company in a file of stock holders. Since
the value of the company changes frequently, it is not feasible to
maintain this information with each record. It is useful to be able to
use the file as if information of this type was part of each record.
When referencing the dollar value field of a record, the datacomputer
system would automatically use information in the record, such as
percentage of ownership in the company, possibly in conjunction with
information which is not part of the record but is maintained elsewhere,
such as company assets, to compute the dollar value. In this way the
data user need not be concerned with the fact that this information is
not actually maintained in the record.
The _set_, which is a specific type of virtual container in
datalanguage, deserves special mention. A set is a virtual list. For
example, suppose there is a real list of people representing some
population sample. By real (or actual) data we mean data which is
physically stored at the datacomputer. A set could be defined to
contain all members of this list who are automobile owners. The set
concept provides a powerful feature for viewing data as belonging to
more than one collection without physical duplication. Sets are also
useful, in that, they can be dynamically formed. Given an actual list,
sets based on that list can be created without having been previously
As mentioned above, virtual data can be very economical. These
economies may become most important with respect to the use of sets.
Savings are found not only in regard to storage requirements, but also
in regard to processing efficiency. Processing time can be reduced as a
result of calculations being performed only when the data is accessed.
The ability to obtain efficient operation by optimization becomes
greater when virtual data is defined in terms of other virtual data.
For sets, large savings may be realized by straight forward
"optimization" of the nested calculations.
The above ideas are made more clear by example. Having created a set of
automobile owners, A, a set of home owners, HA, can be defined based on
A. The members of HA can be produced very efficiently, in one step, by
retrieving people who are both automobile owners and home owners. This
is more efficient than actually producing the set, A and then using it
to create HA. This is true when one or both pieces of information
(automobile ownership and home ownership) are indexed (see discussion
under internal representation) as well as when neither is indexed.
The same gains are achieved when operations on virtual data are
requested. For example, if a set, H, had been defined as the set of
homeowners based on the original list of people, the set, HA, could have
been defined as the intersection (see discussion on operators) of A and
H. In this case too, HA can be calculated in one step. Use of sets
allows the user to request data manipulations in a form close to his
conceptual view, leaving the problem of effective processing of his
request to the datacomputer.
Another use of virtual data is to accomplish data sharing. An item
could be defined, virtually, as the contents of another item. If no
restriction is placed on what this item can be, we have the ability to
define two paths of access to the same data. Hence, data can be made
subordinate to two or more aggregate structures. Stated another way,
there are two or more paths of access to the data. This capability can
be used to model data which is part of more than one data relationship.
For example, two files could have the same records without maintaining
It will also be possible, via data sharing to look at data in different
ways. Shared data might behave differently depending on how (and
ultimately by whom) it is accessed. Although, the ability to have
multiple paths to the same data and the ability to have data which is
calculated on access are both part of the general virtual data
capability, datalanguage will probably provide these as separate
features, since they have different usage characteristics.
Derived data is similar to virtual data in that it is redundant data
which can be calculated from other information. Unlike virtual data it