HTML From the Ground Up

by L. Downs

More About URLs

Now we're going to take a closer look at some of the nuts and bolts that hold all this together. Part of this lesson is going to be about terminology--not necessarily because it will help you write better Web pages, but because you're going to hear these terms bandied about and if you have no clue what they mean you'll be out of the loop. And that's bad. So let's take a deep breath and revisit how all this works.

First of all, you need to understand that beneath the human-friendly icing of URLs and links, computers actually communicate with each other through numbers. Text is exchanged in the form of numbers. So are graphics. Even the actual "address" of each computer is a number, usually written in the form [number].[number].[number].[number] where each [number] is a whole number between 0 and 255. This numeric address is called an IP address. In some cases where a network is having problems and you can't access a Web site by its URL you can still get in by entering its IP address instead (if you know what it is). You may even encounter some sites which do not have a URL, just an IP address.

In order to make things a little more friendly, however, a domain name can be assigned to one or more IP addresses. For example, in the URL http://www.twinplanets.com "twinplanets.com" is the domain name. There are many specialized computers scattered all over the Internet whose only job is to look up domain names and convert them to IP addresses; these are called "domain name servers," or (rather redundantly) "DNS servers." If the domain name server for an institution crashes, the only way to access any Internet site from that institution is via its IP address until the domain name server is fixed.

Think of the domain name as being like the root directory (for example, C:\ on a PC) for a site. Like a PC, you can have subdirectories beneath the root directory, and subdirectories beneath those, and so on. Eventually you end with an actual file name, such as index.html. We call the entire string of file name preceded by subdirectories (and minus only the domain name itself) the path. For example, in the URL http://www.twinplanets.com/tutorial/examples/index.html we would call twinplanets.com the domain name and /tutorial/examples/index.html the path. Remember that you can also have a fragment (such as #top) tacked on at the very end, enabling you to jump to an anchor somewhere in the document.

Sometimes you'll see a URL with no filename at the end. In that case, the browser will look for a file called index.html at the location specified, and display it. If it can't find such a file, it may instead display a listing of the contents of the directory in question, and this is considered bad (you may not want the world to be able to see--and perhaps download--everything in the directory). A good rule of thumb is to make sure that every directory on your site has an index.html file present, even if it's just a file to redirect the user to another URL.

Now you may find that some of what you learned about relative and absolute addressing makes a little more sense. In particular, if you're accessing another file on your site (in other words, with the same domain name) you only need to include the path of the file, rather than the entire URL. Note that in many cases "www" is added to the front of domain names to indicate that it's a Web address (as opposed to an email address, for example). In most cases the root URL of your site (the part that you can leave out) will consist of your domain name plus "www." (But not always; for example the URL for Google's experimental "Froogle" service is http://froogle.google.com/.)

An Introduction to URL Character Encoding

Because many characters have a specific meaning or function within a URL, or are not handled well by Web servers when included as part of a URL, they are referred to as "unsafe or reserved characters." An example is the backward slash, which separates parts of the path. If you wish to use a backward slash for any other purpose in a URL (such as part of a filename, for example), then you have two options:

Get over it; or
Encode the character using a special sequence of alphanumeric characters designed to "stand in" fir the unsafe character.

If you do not do one of these things, the Web server will interpret the slash in your filename as a directory/filename separator, and you will get a 404 Page not found error.

Far and away the simplest thing to do is not to use unsafe characters in your filenames or directory names. Common unsafe and reserved characters include:

Colon and semicolon (:;)
Space
Backward and forward slashes (/\)
Pound sign (#)
Question mark (?)
Caret (^)
Tilde (~)
Bar (|)
Brackets ([]{})
Accent mark (')
Equals sign (=)
Ampersand (&)
At sign (@)
Plus sign (+)
Quotation marks ("")
Percent sign (%)
Less than/greater than (<>)

If you absolutely must use one of these characters in a file or directory name, then you will have to encode it. You do this by replacing the unsafe or reserved character in the URL with a percent sign and a two-digit number that identifies the character. This "stands-in" for the unsafe character.

For example, if you had an HTML file called my file.html and just had to keep the space in the filename, then every time you constructed a link to this file, you would write the filename in the URL like this: my%20file.html.

If you don't do this, you will likely get a 404 error every time you try to link to the file.

Some browsers have auto encoding built in (Internet Explorer) and some don't (Netscape). So the safe thing to do is use character encoding all the time.

Terms to know from this lesson

IP address: A unique numeric identifier given to each and every computer on the World Wide Web. IP addresses consist of four numbers ranging from 0 to 255 separated by periods. "0.255.7.89" is an example of a valid IP address.

Domain name: A name that identifies one or more IP addresses. Domain names are used in Web URLs to identify a particular server that a Web page resides on. In the address http://www.twinplanets.com, "twinplanets.com" is the domain name.

Path: The part of a URL consisting of the file name preceded by the hierarchy of directory names in which the file is stored. The path tells the server what file the browser wants and where to find it. In the address http://www.twinplanets.com/tutorial/examples/index.html, "/tutorial/examples/index.html" is the path.

Unsafe or reserved characters: Characters which have a specific meaning or function within a URL, or which are not handled well by Web servers when included as part of a URL.

Character encoding: Encoding a character using a percent sign and a numeric value (actually the hexadecimal code for the character) instead of the character itself.

Portions of this tutorial originally appeared in Technotes, a publication of the UNLV Libraries, and are copyright by the University of Nevada, Las Vegas; used by permission. All remaining material © 2003 Lamont Downs.