Sir Tim Berners-Lee about The Semantic Web

(Via Technology Review)

Creating the world wide web didn’t make Tim Berners-Lee instantly rich or famous. In part, that’s because the Web sprang from relatively humble technologies. Berners-Lee’s invention was based on an information retrieval program called Enquire (named after a Victorian book, Enquire Within upon Everything), which he wrote in 1980 as a contract programmer at the European Organization for Nuclear Research (CERN) in Geneva, Switzerland. In part, it’s because Berners-Lee did the unthinkable when, more than a decade later, he finished writing the tools that defined the Web’s basic structure: he gave them away, with CERN’s blessing, no strings attached. While others made millions off his invention, the soft-spoken programmer went on to found the World Wide Web Consortium (W3C) at MIT, which he still directs, to promote global Web standards and development.

Berners-Lee is finally getting his reward: in July he was knighted by Queen Elizabeth II, and the previous month he received Finland’s million-euro Millennium Technology Prize, awarded “for outstanding technological achievements that directly promote people’s quality of life, are based on humane values, and encourage sustainable economic development.”

Now in new offices in MIT’s Frank Gehry–designed Ray and Maria Stata Center, the 49-year-old native of England is busy overseeing hundreds of projects at the W3C. He is also personally engaged in developing his second big idea: the Semantic Web, which adds definition tags to information in Web pages and links them in such a way that computers can discover data more efficiently and form new associations between pieces of information, in effect creating a globally distributed database. Though part of Berners-Lee’s original intention for his invention, the Semantic Web has been 15 years in the making and has met its share of skepticism. But Berners-Lee believes it will soon win acceptance, enabling computers to extract meaning from far-flung information as easily as today’s Internet simply links individual documents.

The Semantic Web, coupled with other specifications and tools being developed at W3C, including accessibility standards for disabled people and software for mobile devices, is part of Berners-Lee’s grand vision of “a single Web of meaning, about everything and for everyone.” But is it a tangled web we weave? Despite his excitement about the future, Berners-Lee worries that poorly conceived changes to the Web’s organization and governance could compromise its inherent functionality and “universality.” The father of the World Wide Web shared his concerns—and dreams—the day before flying to Helsinki to accept his Millennium prize.

TECHNOLOGY REVIEW: For several years, you’ve been promoting something you call the Semantic Web, but people don’t seem too excited. Why not?
TIM BERNERS-LEE: It’s not the first time I’ve had this paradigm-shift problem. Early on, people really didn’t understand why the Web was interesting. They saw it in the smaller scale, and it’s not interesting in the smaller scale. Same thing with the Semantic Web.

TR: How do you get past that?
B-L: Right now we are just starting by putting applications onto the Semantic Web one by one and linking them up where it seems useful. But what’s exciting is the network effect. The vision is that we will get to a critical mass, where everything starts getting linked into an unimaginably large whole. Then, the incentive to add more to it rises exponentially as the value of what is out there also does.

Because few people initially get this great “aha!” of connecting to a huge mass of Semantic Web data, it all has to be done by people who are convinced—who understand that it’s worth putting the effort into getting the thing off the ground.

TR: Then please explain: Why is it worth all this up-front effort?
B-L: The common thread to the Semantic Web is that there’s lots of information out there—financial information, weather information, corporate information—on databases, spreadsheets, and websites that you can read but you can’t manipulate. The key thing is that this data exists, but the computers don’t know what it is and how it interrelates. You can’t write programs to use it.

But when there’s a web of interesting global semantic data, then you’ll be able to combine the data you know about with other data that you don’t know about. Our lives will be enriched by this data, which we didn’t have access to before, and we’ll be able to write programs that will actually help because they’ll be able to understand the data out there rather than just presenting it to us on the screen.

TR: How does the Semantic Web understand data?
B-L: Suppose you’re browsing the Web and you find a seminar advertised, and you decide to go. Now, there is all sorts of information on that page, which is accessible to you as a human being, but your computer doesn’t know what it means. So you must open a new calendar entry and paste the information in there. Then get your address book and add new entries for the people involved in the seminar. And then, if you wanted to be complete, find the latitude and the longitude of the seminar, and program that into your GPS [Global Positioning System] device so you could find it.

It’s very laborious to do all this by hand. What you would like to be able to do is just tell the computer, “I’m going to this seminar.” If there were a Semantic Web version of the page, it would have labeled information on it that would tell the computer “this is an event,” and what time and date it is. And it would automatically add your travel to your event book. It would add the people to your address book, and it would program your GPS to give you directions. It would have the relationships between the event and the various people chairing it. And those people would have Semantic Web personal pages, which contained information about how you could contact them.

Your address book can now grow from a closed repository of private data to a view on the people-related data in the world.

TR: Does the Semantic Web, then, merely automate many of the things that a human assistant would do?
B-L: No. A human assistant uses a form of intelligence that we are not mimicking here. The human assistant will have the human mind’s ability to suddenly think of correlates across the whole spectrum of his or her experience. “I’ve booked you through Tiawicha because they have the flower festival that weekend, I think, and…well, maybe you’ll like it” is a human thought process.

This is more like giving you a program which can do all the things which your MIS department could write programs to do but doesn’t have time to. But it is still a program. Just as the World Wide Web is still a document.

In the future, the Semantic Web will be a great place to develop artificial intelligence, AI, in the strong sense. But right now we are making something quite mechanical—even if we are using bits and pieces of the machinery developed by the AI community over the years.

TR: It would seem an impossibly huge task. How does the technology work?
B-L: The Semantic Web technology tackles the problem in two stages. The more mundane is a common data format. You can take a database or a calendar or an address book or a bank statement or a weather reading—basically anything with hard data in it—and make the machine write it in the basic Semantic Web language, instead of some proprietary or application-specific format. This solves the “syntactic” problem.

It still doesn’t solve the “semantic” one, though. For that, the Semantic Web first gives names to the basic concepts involved in the data: date and time, an event, a check, a transaction, temperature and pressure, and location. These are all defined just to mean whatever they mean in the system which produces the data—for example, “Transaction date as I get on a bank statement,” and so on. This set of concepts is called an ontology. Then, where there are connections between ontologies, such as when the date and time on a photograph is the same concept as the time on a weather report, we write rules to take advantage of these connections. This allows one to query the Semantic Web agent for photos taken on sunny days, for example. Bit by bit, link by link, the data becomes connected, interwoven. The exciting thing is serendipitous reuse of data: one person puts data up there for one thing, and another person uses it another way.

TR: You’ve said that “phase one” of the Semantic Web is finished. Can you explain?
B-L: The way the Semantic Web works is by defining new languages for computers to exchange information. Phase one was getting those first languages, for both syntax and semantics, to the state where they became standards supported by W3C’s members. Because interoperability is the key: you can’t call it a Semantic Web application if the program just sits there doing things with its own data format without being able to exchange data with other programs. Now there is this foundation, and anybody who wants to make a new application and publish data can do that, and everybody else’s program will be able to read the data.

TR: What kinds of Semantic Web applications are people making for the next phase?
B-L: Exciting things are happening in the life sciences. The big challenges such as cancer, AIDS, and drug discovery for new viruses require the interplay of vast amounts of data from many fields that overlap— genomics, proteomics, epidemiology, and so on. Some of this data is public, some very proprietary to drug companies, and some very private to a patient. The Semantic Web challenge of getting interoperability across these fields is great but has huge potential benefits.

TR: But it’s not just a matter of exchanging data from a multitude of fields?
B-L: No. There are also challenges around maintaining privacy and intellectual property while making effective use of the information. For example, when searching for a new drug, one might want to join epidemiological data with external factors such as weather and travel and demographics to find out how a disease is transmitted and what sorts of people are predisposed to it. One may then seek to connect it to a genetic trait and start asking what proteins are associated with that, and what they enable and block in the biology of the human cell. Subsequently, one may want to connect the chemicals involved in those pathways to symptoms of diseases, and also to possible chemicals that could be used as a drug. There’s a great deal to gain, which is why a lot of people are getting very fired up about working on the life sciences with Semantic Web applications.

TR: Is there an existing application that shows how the Semantic Web can form such connections?
B-L: If you want to play with the Semantic Web, you can make a friend-of-a-friend file. In a FOAF file [the data component of a personal home page, formatted in a standardized way], you can publish stuff about yourself, your organization, your publication, places, or photographs. You can have a pointer that says “this is a photograph about me” and other data about the photograph, such as who else is in it.

To create a FOAF file, you must fill out a form, such as the one at http://www.ldodds.com/foaf/foaf-a-matic.html. From this information, a Semantic Web–readable text file is generated that you can add to your personal website. There are semantic websites that will pull that data up and give you things like a list of photographs linking you to somebody else. I’m three photographs from Frank Sinatra because I’m photographed with Bill Clinton who’s been photographed with one of the Kennedys who’s been photographed with Frank Sinatra. That’s a silly application, but it really shows the power of the reuse of information.

TR: Can you describe a more serious example?
B-L: It’s exciting to see industry focused on implementing these standards. Tool kits from HP and IBM, authoring applications from Adobe, smart content management solutions from Profium and Brandsoft, and search engines from Network Inference are all working to create a Semantic Web at various scales. These and other technologies are being adopted by communities that in turn revolutionize how these groups collaborate and communicate. This is what’s happening in life sciences, which we spoke about earlier.

In the U.K., the Semantic Web Environmental Directory is a prototype of a new kind of directory of environmental organizations and projects. Rather than centralizing the storage, management, and ownership of the information, SWED simply harvests data and uses it to create the directory. From a social perspective, there’s an application nicknamed Fatcats from FoafCorp [a Semantic Web project that extends the friend-of-a-friend format to corporate entities] that allows you to pick a company, and it shows you who’s on its board by displaying a graph of connected people. When you click on one of the people, it shows you all the boards they’re a member of. You can start exploring the spheres of influence in American corporate culture.

The exciting thing is when you find that one of these people has a FOAF file, and you start going from corporate culture into personal culture, and then into photographs, and then into weather information, and then booking flights, and then into booking restaurants, and then into figuring out what wine to have for a meal.

TR: You often talk about the importance of “Web universality.” What do you mean?
B-L: One of the fundamental properties of the Web is the fact that it is just one space, and it’s a consensual space. It should be independent of the hardware you use. It should be independent of the software you use or the operating system it’s running on. It should also be independent of what culture you’re in, or whether you’re writing a wonderful, carefully edited document, or whether you’re scribbling something on the back of the proverbial envelope. And it should be independent of what language you’re using, what character set, whether your letters go up and down, left to right, or right to left. Also, people should be able to access that information even if they have disabilities. At W3C we call this concept “one Web—for anyone, everywhere, on anything.”

TR: And there’s a threat to this universality?
B-L: There was a proposal to make a special top-level domain called “.mobi.” All the websites that would work with mobile phones would be put in that area; it would be the place for Web content for mobile devices. But there should be just one URL, or Web address, for something. To segregate content into a .mobi corral is the wrong way to do it. We’ve got lots of standards at W3C for allowing a website to perform optimally whether you’re looking at it from a cell phone or from a huge screen. But obviously, if you put a “.mobi” at the end of a domain name, then you’re saying, “That’s a special place for stuff you can see on your cell phone.”

TR: What about other top-level domains—.biz, .info, et cetera—that have been proposed to relieve the name crunch in the .com domain?
B-L: Adding new top-level domains won’t help that. What people remember is the string between “www” and “.com.” So if there is a .info or a .biz after it, that would just confuse them. It means that they have to remember the whole thing instead of just the brand between the “www” and the dot.

Also, of course, you have a registration fee system for financial transactions. Small companies or individuals who have a domain may feel that, in order to avoid confusion, they have to keep buying these other ones. Just the yearly rental for a family adds quite a lot to their Internet bill.

TR: There is a power struggle between the United Nations and ICANN, the Internet Corporation for Assigned Names and Numbers, which manages how domain names and Internet addresses are issued. What’s your opinion?
B-L: Some countries are concerned, rightfully, that ICANN runs under a contract with the U.S. Department of Commerce. The Internet is an international thing, and even if it may be carefully run by ICANN in the best interests of the whole world, there is a strong feeling in some countries that the fact that ICANN is funded by the U.S. government means that it is U.S. controlled, and that that is unfair.

My feeling is that this asymmetry should be removed carefully. It’s important that it’s seen to be fair. However, the fact is that ICANN has been set up and is running, and it should not be suddenly thrown away. Making something that represents the stakeholders in a balanced way really takes a lot of experience and constant reappraisal. Maybe ICANN should have more U.N. funding, but I don’t think it should have to become overnight more like a U.N. organization.

A lot of confusion in this area is caused when people use the term “Internet governance.” They start off talking about domain names, which is really a very specific area, and then end up talking about privacy, copyright, confidentiality, commercial terms, and all sorts of parts of the normal legal system. People shouldn’t think ICANN runs everything that happens on the Internet. ICANN just plays a very specific role.

TR: Do you believe that the World Wide Web will be your most important contribution?
B-L: My role necessarily had to morph from lone designer through community agitator to lead architect and facilitator of consensus at W3C. But I suspect the Web will be my most important contribution—although it required being in the right place at the right time. The mistake, though, is to think that it is finished. The Semantic Web is just the application of weblike design to data; it will be many more decades before we will be able to say we have really implemented the Web idea in the full, if ever we can.

TR: Besides the Semantic Web, do you have any other dreams or wishes for the future of the Web?
B-L: Oh, lots and lots! I have always wanted the Web to be a more creative, flexible medium, with annotation systems and group editors and so on. I’m excited about the new portable devices we can use for the Web, about speech-based technology, and a lot of other things. Once you start with the basic Web idea, so much stuff becomes possible.

My so called life*

*better known as silly random stuff