The INTERNET -
Indigestible data flow or all-embracing library
1. Introduction 2
2. History of the Internet
The idea 4
The ARPANET 4
2.3 The Internet grows 5
The present situation and a further outlook 5
3. The technical structure of the Internet
The physical structure 7
The logical structure 7
The Domain-Name-System (DNS) 8
4. The Internet - Indigestible data flow or all-embracing library
To much data but no information 10
Simple search strategy, based on search services 10
4.3 Case study: The status of the Queen in the Canadian 11
constitution
4.4 Case study: Output of renewable energy sources in 12
the USA
4.5 Conclusion 13
4.6 Advanced search strategy, without search services 13
5. An outlook into the future 14
Appendix
A. Used literature and Internet sites 15
B. Increase of Internet hosts 16
Introduction
Every new important technology has rung in a new era. Fire, the alphabet, the discovery of gunpowder, the steam engine or electricity. With every new era there was a change in society. The last important change was the discovery of electricity and the steam engine. They led humanity into the industrial revolution. People moved into the cities to find jobs in the new growing industries. Their social behaviour changed, since they lived in an anonymous huge city and no longer in a small town.
But the eighties and nineties of our century show, that the industrial era is coming to an end. A new technology, which is somewhere in the order of industrialisation, is taking over the control over mankind's destiny. The information age has just started.
This change goes hand in hand with the digitisation of information. Digitisation means, that every kind of information (text, audio or video signals) can be expressed in bits and bytes. So a computer is able to display, store or manipulate this information. Nowadays, computers are already capable to replace TV, radio, video, newspapers or books.
For the arithmetical skill of computers, they are employed in nearly every office to process the enormous amount of incoming data. You only need to imagine a stock of thousands of tools and a craftsman who searches only one special tool. How to find it without a computer? For exactly the same reason computers are used in science. Or how many men would be necessary to calculate the exact trajectory of a spaceship? However people work with computers even in their leisure time. Games, formerly only played as broad versions, are more and more converted into interactive multimedia spectacles. The number of computer players grows constantly and rapidly.
Scientists declare that the reason for human predominance in nature can be found in communication. Communication means the exchange of information. Communication enables the creation of a social order which is a condition for a peaceful coexistence of all individuals. Since technical progress has made computer communication possible, utilizability of computers seems to have multiplied. LANs (Local Area Network) link many PCs (Personal Computer). So employees of a company can share documents, information sources and different programs. WANs (Wide Area Network) connect branch offices on a continent with each other. But there is one very network, which is the most important one and which ties millions of computers anywhere in the world together: the INTERNET.
The Internet is a GAN (Global Area Network), which means that a computer anywhere in the world can go online (can be part of the Internet). This computer only needs a telephone connection. Entering the Internet opens a door to a world on its own: the Cyberspace.
The Internet offers a lot of possibilities:
· Searching and downloading information
· Sending and receiving e-mails (electronic Mails)
· Participating in discussion groups
· Remote-Computing (remote control of distant computers)
· Chatting with other Cyberians (citizens in the Cyberspace)
· etc
In this document, I will have a closer look on the first point: Searching and downloading information.
At first I'll sum up the history of the Internet. Then I'll describe the operation of the Internet. After that, I'll try to develop a search strategy, testify its success with the help of two case studies and draw a conclusion. At the end, I'll dare to give an brief outlook into the future.
The idea
At the end of the fifties, the Department of Defence sought a way to ensure communication between military bases and cities after a nuclear attack. But neither any cable nor any computer would be able to resist the power of nuclear bombs. And if there was a central authority, which should control the network, this authority would probably be one of the first targets to be bombed. The RAND Corporation, a company of the Department of Defence, published a solution in 1964: a decentralised network. In such a network, information is not directly transmitted from sender to recipient (like a telephone connection). Since a network consists of many computers, the whole network can be subdivided into many knots. Each computer which participates in this network forms such a knot. In a decentralised network, information is transmitted from knot to knot until it reaches its destination. If one knot is destroyed, there will still remain other knots to transfer the information. This transmission is called "Dynamic Rerouting".
2.2 The ARPANET
In the sixties, this draft of a network was tested by several American universities (namely the Massachussetts Institute of Technology (MIT) and the University of California Los Angeles (UCLA) ). In 1968, another company of the Department of Defence, the Advanced Research Project Agency (ARPA), developed the first decentralised network and was in charge of it. High-Speed-Computers formed the knots of that network. In the autumn of 1969, a knot was installed at the UCLA. By the end of that year, a network came into existence, which was called the ARPANET and which consisted of four knots. One of these knots could be operated by another knot via remote-control. This means, that a user at any knot was able to control a computer which could be right at the other end of the continent. This was of high value since computer time was quiet precious and expensive, these days. In 1971, this network was made up of 15 knots. In 1972, 37 knots already formed the ARPANET. Soon the system was extended to transmit files and news via e-mail (electronic mail). Only military personnel or military scientists had access to that network. But this restriction was soon given up. The first two years had shown, that the ARPANET was not mainly used for remote-control but for information exchange.
The Internet grows
The ARPANET grew very fast, because of its decentralised architecture. Computer of any OS (Operating System, i.e. MACOS, MS-DOS, WINDOWS or UNIX) were able to join this network. The computer only had to use the "Network Control Protocol" (NCP), which was later replaced by the actual standard "Transmission Control Protocol" (TCP/IP, where IP is the abbreviation for Internet Protocol) in 1982. In 1973, there was the first international APRANET connection to Great Britain and Norway. In 1983, the Milnet (Military Network) split off from the ARPANET. But communication between these two networks was still possible, since the connection remained. This connection was called DARPA Internet or simpler: Internet. One year later, the ARPANET consisted of 1000 knots (, which were from now on called hosts because these computers are hosting the information). In 1986, the National Science Foundation Network (NSFNET) was founded, which connected the different networks in the USA via five high-performance computers (backbones). This new network connected the ARPANET and several other networks (CSNET (Computer and Science Network), Usenet(Unix network), NASA, Milnet).
The present situation and a further outlook
Internet is spreading faster than the telephone or the fax machine. It is the best example for an operating anarchy (since there is no central authority in control). The users gain access to many offers concerning business, recreation, entertainment or hobby.
Right now, about 60,000,000 people are using the Internet. Their number grows every day. According to an estimation of the "Network Wizards"[3][3] , there will be 500,000,000 people online in the year 2000. This amount represents 8.37 per cent of the world's population. (For recent Internet growth, see appendix B )
But recent development has shown, that the Internet is more and more commercialised. In 1993, 1.5 % of all web sites (or pages) were for commercial purposes. Three years later, 50 per cent of all web sites were commercial ones. This development reached its peak in June 1996, when 68 % of all web sites were offered by companies. The latest figures of January 1997 show that the .COM sites (their names end with ".COM") have fallen to 62.6 per cent again [4].
With regard to the scientific and commercial use, governments and local Internet providers (central knots, which are connected to many computers like "RZ-ONLINE") have realised the importance of the Internet. Therefore the Internet infrastructure is improved and will be improved in the next years with great effort. The US government has set itself the goal of an "Information Superhighway". Every user should be able to look something up in any American library or have access to all public information. In Europe, the ISDN (Integrated Services Digital Network)- standard was introduced in 1994. ISDN offers two telephone channels for data or voice and is nowadays exhaustively cheap available.
Besides, the software has improved. In the beginning, the browsers (software to navigate trough the Internet) only showed columns of characters and weren't really comfortable. These days, the browsers offer services to display text, music and video animation. Now, even a novice can "surf online" using the easy interfaces.
3.1 The physical structure
A complex hierarchical structure is inherent in the Internet, since this network connects millions of computers. On the highest level, computer centres are linked via satellites, hired telephone lines or (more modern) fibre-optic cables. These nodal points are called "backbones". They can exchange information with high speed ( from 64 kbit/s to 622 Mbit/s ). This high speed is necessary with regard to the amount of information, a single backbone has to transmit. The second level contains the Internet providers. Internet providers are companies, which offer their on-line services to clients while they handle their transmissions
over the instruments of the upper level. Clients of these providers can be other providers, companies or consumers. Internet providers are connected to their providers (backbones or bigger providers) via hired telephone lines (at a speed of 128 kbit/s to 2 Mbit/s). On the lowest level, there is the end-consumer, who pays his provider for being on-line. These users of the Internet use ISDN or normal telephone lines and modems (64 kbit/s) to establish a connection to their provider.
The logical structure
The logical Internet-structure has to execute several assignments:
· Enable communication and transmission
· Connect numerous networks (even with different operating systems)
· Assign an individual address to every computer, which is on-line
There are many rules to regulate these jobs. These rules are called "protocols".
The basic protocol is the Internet Protocol (IP). It exchanges information via a packet-orientated, indirect and not guaranteed transmission. Packet-orientated means, that the whole information is divided into several pieces (packets). Each of these packets gets a number according to its position in the sent message. So the receiving computer can compose the separated packets to the original information. Indirect stands for that there is no direct connection between sender and recipient (like in a telephone connection). First, every packet is sent to the highest physical level. There it is transmitted from backbone to backbone until it reaches the backbone of the recipient. This backbone sends the information to the provider and the provider forwards it to the recipient. This principle of searching a way for transmission is called "routing". Since every packet is anew routed, packets of the same transmission can take different routes. For this reason, packets can be received in another order than they were mailed. Not guaranteed means that there is no test to supervise the correctness of the received information.
Another very important protocol is the Transmission Control Protocol (TCP). The only difference to the IP is inherent in the name. The transmissions of this protocol are controlled (guaranteed). TCP offers a verification whether the received information is correct or not. If some packets contain errors, these packets will be demanded once more by the recipient until a correct packet is transmitted.
A combination of these two protocols represents the standard protocol of the Internet (called TCP/IP) and works on every operating system (MacOS, Windows, Unix, ).
The Domain-Name-System (DNS)
But one problem still remains: Every computer in the Internet needs an individual name to be the only destination of a message. In the beginning of the Internet, "hosts [..] were assigned names in a flat or global name space of character strings" [7][7](like 'USC-ISIF'). These names were stored in a central list of
names. However in 1986, this central list of names contained about 3,600
entries and couldn't be extended. So a new hierarchical structure for names was introduced: the Domain-Name-System. A domain is a network, which consists of subordinated computers or networks. If a domain is subordinated to another domain, the first one is called a "sub-domain". In accordance to the DNS, a name or address consists of a user-id and the domain. The domain of the highest level (which stands is the last component of the address) is called the top-level domain. If there are sub-domains, these are listed according to their level between the user-id and the top-level-domain. A name for a user-id or a sub-domain must be unique in the higher domain.
Example: johannes@abo.rhein-zeitung.de (our school's E-mail address)
johannes - identification of the user
de - top-level-domain (de for Germany)
rhein-zeitung - first sub-domain
abo - second sub-domain
Addresses of homepages don't contain a user-id, since there may be many users responsible for this web-site. The top-level-domain is an abbreviation, which is a sign for a certain kind of address (three letters):
com - commercial organisation mil - military organisation
org - homepage of an organisation net - network organisation
edu - educational institution gov - government
The top-level-domain can also provide geographical information (two letters):
us - USA de - Germany
uk - United Kingdom fr - France
jp - Japan ca - Canada
4. The Internet - Indigestible data flow or all-embracing
library [8]
4.1 To much data but no information
With regard to the incomparable growth of the Internet, it is becoming more and more difficult to find an answer to a certain question. While searching for relevant information, one experiences the difference between the present situation of the Internet and the aimed information-highway. Even recognised search strategies often fail to find an accurate answer in an acceptable time. Every search strategy employs at first a search service . These free services are programs, which look for the sought word in their databases, which contain millions of homepages and are permanently updated. There are two types of search services: search-engines and directories. While employees add homepages to the directory list (like YAHOO[9][9] - Yet Another Hierarchical Officious Oracle) and sort these according to their content, computer programs (called "Spiders") maintain the database of a search engine (like Alta Vista). Most people try to get their information by entering just one expression into such a service. But as an inaccurate question leads to an inaccurate answer, they will get thousands or even millions of relevant sites. After having checked the first pages unsuccessfully, they will soon give up. A saying underlines this frequent experience: "The Internet provides data, but no information."
Is there any way to get the searched information ?
4.2 Simple search strategy, based on search services
The introducing example has shown the main problem with finding information in the Internet: the formulation of the question. To find suitable sites, the search service must be fed with striking, unambiguous keywords. The intelligence of a search-engine or a directory isn't able to understand the exact meaning of an imprecise keyword. For these it will return thousands of useless pages and only some relevant ones, which won't be found in the data waste. At first, the searched Information must be characterised with a couple of associated expressions. The chosen keywords should be as clear as possible to avoid a differing connotation.
Most services offer operators to link several keywords to describe the searched information more precisely:
· AND ,+
The resulting sites must contain all keywords linked with "AND".
· OR
The resulting sites must contain at least one of the keywords linked with "OR".
· NEAR
The resulting sites must contain the keywords linked with "NEAR". A maximum of eight words may separate the two keywords.
· NOT , -
Irrelevant subjects can be excluded by negation ("NOT").
Apart from these operators, wildcards (such as "?" or "*") are allowed. To find only the accurate matching word and no composition (e.g. only "net" and not "Internet"), keywords can be entered in quotation marks. Another help might be to search only in structured categories, e.g. a question on the Internet in "Computer and Internet: Internet" at Yahoo.
4.3 Case study: The status of the Queen in the Canadian constitution
To test this search strategy, I wanted to find out, which is the status of the Queen in the Canadian constitution.
At first, I began by searching for 'Queen' at Yahoo. I found 9 sites, but none of them provided information on the question. So I searched more precisely for 'Queen +constitution'. I found 19936 sites. Since it would take many hours to check this enormous number of pages, I decided to specify the keywords to 'Queen +Canadian +constitution', what led to an amount of 5459 hits. Having added ' +status' to the search string, Yahoo found 2787 matching sites. So I had to choose another way to find information about this topic. Another possibility would be to find the relevant passage in the Canadian constitution. So, I searched for '"Canadian constitution"' (written in quotation marks) and found two sites. I visited the first one, which was an index of constitutions from all around the world, and found a link to the Canadian one[10][10]. In the table of contents, I searched for the keyword 'Queen' and found section 9 of the third chapter of the "Canadian Constitution Act" from 1867, saying: "The Executive Government and Authority of and over Canada is hereby declared to continue and be vested in the Queen."
The search took me 12 minutes.
4.4 Case study: Output of renewable energy sources in the USA
Since one test isn't enough, I have run another experiment. I wanted to know, what was the total production of renewable energy sources in the USA in 1997.
I started by searching for '"electricity production"' at Yahoo. I got three sites, which didn't supply any relevant data. I thought, there must be any statistics about this topic. So I searched for 'electricity +statistic' and found a category called "Government: U.S. Government: Statistics" containing all statistics drawn up for or by the U.S. Government or other administrations. Then, I visited the "Center for Environmental Statistics", where I found no survey about this topic. But the 'Energy Information Administration' had a quite interesting statistic about the "U.S. Energy Flow" [11], saying that the renewable sources produced in 1996 were 7.06 Quadrillion Btu (U.S. usage of quadrillion, that means 7,060,000,000,000,000; Btu is an abbreviation for "British thermal unit"). The figure of 1996 was the latest one.
This search took me 27 minutes.
4.5 Conclusion
The case studies (4.3 and 4.4) have shown, that even complicated questions can be answered with the help of the Internet. This is interesting with regard to the requirements one need to access the "World Wide Web" (the totality of all sites, also WWW). A computer, a modem, telephone connection, an Internet provider are just enough to get information at any time. A lot of answers and solutions are waiting to be found in the Internet. The only problem is to locate them. Searching with an imprecise, ambiguous keyword in the WWW is like searching for a certain plant and only knowing, that it grows in the jungle. In order to find relevant information one must know exactly what to look for. Besides that a certain search strategy should be used to find the answers quickly.
To sum up in a few words, the Internet provides information on nearly every topic with the help of an efficient search strategy. Without a search strategy, one can spend his whole life searching for an answer but not finding it.
The presented search strategy might not work for every question, but for many. To improve it, I added an advanced search strategy (4.6). The two case studies can illustrate the problem, but they are not representative for all searches. I think the Internet also leaves a lot of questions open. In my opinion, the Internet is more a fast, large library than an all-embracing one or an indigestible data flow.
4.6 Advanced search strategy, without search services
Nearly all search-engines look for relevant information on the homepages of the Internet. But there are also other sources full of answers:
· Newsgroups
Newsgroups are forums, where people e-mail their opinions on the newsgroup topic. Their mails can be read by every participant of that forum. There are more than ten thousand newsgroups discussing many different topics ("Liszt - the mailing list directory"[12][12] helps to find a newsgroup on a certain topic). There are already services, which search through all newsgroups for a keyword (like "Dejanews" [13]).
· FAQ-archives
FAQ-archives contain answers to Frequently Asked Questions (FAQ) on a certain topic. One can search for a special archive at the Usenet Hypertext FAQ Archive Search [14].
· FTP-directories
Files, comprising information, can be downloaded from FTP-servers. To locate relevant files, Archie-systems (search-service for ftp-directories) can be used (like ArchiePlexForm[15][15]).
· Online-databases
The Internet user has also access to huge databases of nearly every topic. To find these databases, search services should be used.
If a sufficient answer was still not found, the question should be asked in a related newsgroup. The last possible step is to create an own homepage, stating the question clearly and to hope for an answer.
An outlook into the future
The future development of the Internet must be seen under three aspects. On the one hand, the WWW grows permanently, as more and more companies and people are going online. But as the Internet contains more information, searching for certain information becomes more difficult. On the other hand, the technical infrastructure is improved for the growth of the WWW (e.g. ATM - asynchronous transmission mode ). So, information is transmitted faster.
I'm convinced, that the third aspect is the most important one. Software companies and other programmers have realised, that the search services must be improved to allow a fast access to the Internet. There are already projects inventing search services with "artificial intelligence", which would simplify the dialogue between man and computer. This dialogue is the greatest problem of recent search services.