The future of search lies in finding ways to bypass search engines altogether. Where information exists locally to make sense of what you’re after it’ll be used to create better searches. Data mining will be used to form a picture of what results go together, and what meanings a user attaches to a word. Alternative, unambiguous entry points to the web, like spatial search, will provide alternative, quicker routes to data for compatible hand-held devices.
When the mass of data finally overwhelms the conventional search engines, something will have to give. If we want our data out there being used, we will have to transition our content over to semantic web technologies. These technologies provide solutions to developers seeking smarter software, and provide search engines with a way to give better results. New user interface idioms will have to be invented, but in the end the web will be a better place to live.
Crisis? What Crisis?
Search engine technology faces a pressing challenge. The challenge is to quickly yield relevant results from the avalanche of data we are producing. It’s not enough to find the data eventually - search engines must find the right data immediately. They must deduce what ‘right‘ means for a query, and that requires a level of understanding they don’t currently have. Every three years the data we produce doubles – in the next three years we will have three times as much data as we created during the whole of history up to 2007. By 2010 search engines will have to work three times harder than today, or be three times less discriminating. Surely, incremental fixes to search algorithms can’t keep up?
Perhaps ‘avalanche’ is not the right term – ‘lahar’ is more appropriate. A lahar is a colossal down-pouring of mud and water. Along with the water comes a variety of unwelcome dross. How do you distinguish worthwhile content from dross? Sometimes it is not easy. The continued success of Nigerian con-artists seems to show that many people naively believe everything they read. Deducing authority, accuracy and relevance is hard to automate. Search engines assume that if many others link to a page then it must be valuable and so rank it higher. That approach cannot hold up under the relentless lahar of dubious data.
When working on the web you have to filter data manually. As the web grows, your role in the search filtration process will grow all-consuming, unless a revolutionary change is brought about. Search engines that rely on reputation face the dilution of the very data they work with. Web users can’t keep up with (or find) the flow of worthy data – so the worthy data isn’t linked to and ranking algorithms the search engines use become useless. They are useless as a judge of truth. Tim Berners-Lee has identified ‘trust’ assessment as a major component of his vision for Web 3.0. It’s hard to see how current search engines can contribute to his vision.
The difficulty with performing searches based on natural language search terms is that natural language allows many meanings for words, homonyms, synonyms and a wealth of nuance for each individual word in a search. There are twelve synonyms of the word ‘search’. Can they all be used interchangeably? The answer is no, they all mean something slightly different, and so probably should return a different result set. The space of all possible meanings grows enormously as the complexity of the search increases. The average relevance of each search result goes down relative to the complexity of a query. Automatically clarifying a search query is a challenge that will not be solved by 2010, but just stating the problem points to some interesting solutions based on present day technologies.
Why is search important?
Consider the trend – we are conducting more and more of our professional and private lives online. We have the communications protocols to allow any connected individual to converse with any other. We have the means to share our thoughts with everyone else in the world that cares to know them. As a computer programmer, I spend a significant portion of my working life searching for information relevant to the problem I am trying to solve. Things that prevent me from achieving my goals are therefore a direct cost to my clients and to me. They reduce my productivity and increase the cost to my clients of getting new software developed. It is hard to underestimate the cost to an economy of search engine inefficiencies.
Inefficient search technologies engender cognitive dissonance – just at the point that we need information we are forced to context switch into a battle of wits with a search engine. It prevents us from entering the flow. The knock-on effect of stopping us from being able to smoothly transition between the data we have and the data we need is another economic cost. It reduces our job satisfaction, forces us to perform below peak and frustrates us. Our morale is weakened by the poor search technologies we have to work with!
So much for the problems. What of the solutions? How can search be improved to lessen or eliminate the costs to our productivity and fun? Well the solution exists in three parts – preparing a query, performing it and presenting the results. In addition we need to solve the problem of how to prepare (or reveal) data so that search engines can query it. The issue of data and query formats falls broadly within the category of “semantic web” technologies. Next generation data formats are undoubtedly are a necessary component of future search, but they are not sufficient – semantic web query languages are beyond the reach of everyday users.
I have endeavored to bring semantic search into the domain of the commercial programmer with my LinqToRdf project. What I have done, though, is to replace one formal language for another. For semantic search to become a commercial reality on the desktop, we must devise user interfaces that hide or simplify the searching experience. Google already provides ways to build complex queries that filter my results better. Yet, for most searches (of which I must perform dozens every day) I stick to keyword searches. Most users will not ever use the advanced features of an engine like Google. To reach them, we need a way to produce semantically meaningful queries to match the semantically meaningful data that will be out there on the web. How can we do that? I think the answer lies in context.
Assuming that the search engine doesn’t understand a word that you say when you type in a sequence of search keywords, it makes sense that the relevance of the results will decline as the number of possible misinterpretations increases. The English language, or any other natural language for that matter, is full of so many nuances and alternate meanings for each significant word that the space of possible interpretations for each word increases dramatically. As a consequence, the chances of a search engine being able to deliver relevant results are bound to decline at a similar rate. In fact, the problem is worse than that. The rate of decline of relevance is going to be the product of the probabilities of getting the sense of each word right for each word in the query. If there are five ‘senses’ to each of two words in a search string, then the chances of you getting the right result are going to be one in twenty five. A long shot. There are various heuristics that can be used to shorten the odds. But in the end, you’re never going to consistently get decent results out of a machine that doesn’t understand you or the documents that it has indexed. Enter the semantic web.
The semantic web is all about providing a means to be completely unambiguous about what you are talking about. The markup languages devised by the W3C and others are aimed at allowing content writers to provide clear and unambiguous meanings to terms. Specifically they allow one to distinguish between the uses of a word by attaching a URL to the sense, which being unique gives a unique sense to the search term. Obviously, if terms are unique then the queries against them can also be unique by referencing the term of interest. The adoption rates for semantic web technologies has not so far been meteoric, and I think that that is as much to do with tool support as with any innate pessimism about reaching a consensus about how to model your data.
I think that HTML would have taken much longer to reach saturation point if there had been no WYSIWYG editors available. What is needed to make data on the web (and the queries about them) more meaningful is a revolutionary new interface to the semantic web schemas that are available for querying.
Perhaps what is required is for a predictive lookup extension to the good old textbox that will offer you an intellisense type lookup so that as you type a term into the textbox it looks up the available senses of the term from the search engine. Examples of this are on display in the text entry boxes of Freebase. These senses can be derived from ontologies known to the search engine. If the context of what you’re working on allows it, the right ontology type can be preselected as you perform a search. Just as context provides the way to isolate the specific meaning of a word in a sentence or a query, so too does it provide a way for future search tools to generate intelligible queries.
The price of being able to make meaningful queries and get meaningful answers is that someone somewhere has to compose their data in a meaningful way. Meaning is a tricky word – here I mean complex structured data designed to adequately describe some domain. In other words someone has to write and populate an ontology for each domain that the users want to ask questions about. It’s painstaking, specialized work that not just anyone can do. Not even a computer scientist – whilst they may have the required analysis and design skills, they don’t have the domain knowledge or the data. Hence the pace of forward progress has been slow as those with the knowledge are unaware of the value of an ontology or the methods to produce and exploit it.
Compare this to the modus operandi of the big search companies. Without fail they all use some variant on full-text indexing. It should be fairly clear why as well – they require no understanding of the domain of a document, nor do their users get any guarantees of relevance in the result sets. Users aren’t even particularly surprised when they get spurious results. It just goes with the territory.
Companies that hope or expect to maintain a monopoly in the search space have to use a mechanism that provides broad coverage across any domain, even if that breadth is at the expense of accuracy or meaningfulness. Clearly, the semantic web and monolithic search engines are incompatible. Not surprising then that for the likes of Microsoft and Google the semantic web is not on their radar. They can’t do it. They haven’t got the skills, time, money or incentive to do it.
If the semantic web is to get much of a toehold in the world of search engines it is going to have to be as a confederation of small search engines produced by specialized groups that are formed and run by domain experts. In a few short years Wikipedia has come to rival the likes of Encyclopedia Britannica. The value of its crowd-sourced content is obvious. This amazing resource came about through the distributed efforts of thousands across the web, with no thought of profit. Likewise, it will be a democratized, decentralized, grass-roots movement that will yield up the meaningful information we all need to get a better web experience.
One promising direction in structured data is being taken by Metaweb (founded by Danny Hillis, of Thinking Machines fame) whose Freebase collaborative database is just becoming available to the general public. It uses types in a way similar to OWL, and has a very simple and intuitive search interface. It provides a straightforward means for creating new types and adding data based on them – critical for its data to flourish. My own experience in creating types to describe Algorithms and Data Structures was encouraging. I seeded the database with a few entries for the most common examples and was able to link those entries to textual descriptions found in Wikipedia easily. Within a few weeks the entries had been fleshed out and many more obscure algorithms had been added by others. If this group of Freebase types continues to gain attention, it may become a definitive list of algorithms on the web acting as a main entry point for those wishing to research potential solutions to their programming problems.
Perhaps islands of high quality structured information may become the favored web entry points. Users may trace links out of an entry in Freebase to areas where HTML anchors are all that they need to follow. Perhaps structured data can be used to disambiguate queries on conventional search engines – There may be many homonymous entries in freebase, and reference to the specific freebase type may help the search engine to filter its own results more effectively.
This crowd-sourced approach to creating structured data on the web is obviously going to help in the creation of isolated areas of excellence like Freebase and Wikipedia. Can it help to displace the mass of unstructured data on the web? I don’t think so. Our exponential outpour of data will mostly be in the form of video and audio. Without sophisticated image and voice recognition systems, search engines of any type will not be able to index them. Metadata in RDF or OWL format is being used to describe the binary data out there on the web, but that is equivalent to only indexing the titles of documents while ignoring their content.
Microsoft and Google both seem ambivalent towards the Semantic Web. Unofficially, they claim that they don’t believe in the semantic web, but in recent months Microsoft has released media management products that are based on the W3C’s RDF and OWL standards. They claim that semantic web technologies overcome problems they could not solve using conventional techniques.
Embedded Search Tools
At the moment search engines are stand-alone. They are cut-off from the tools we are using them to gather data for. In my case, I spend most of my day inside a programmer’s IDE (integrated development environment) – like a glorified text editor for source code. When I’m in there I’m constantly faced with little puzzles that crop up to do with the program I’m writing. When I need to get a piece of reference information I have to come out of the IDE and call up a browser and perform the search in that before switching back to what I was doing. That context switching is distracting and I’d prefer the whole experience to be smoother. In addition to that, there is a whole bunch of contextual information that exists inside of the application that is (potentially) germane to the search I’m doing.
Embedding the internet search facilities inside of general purpose applications such as my IDE provides a wealth of information to the search engine with which to automatically filter the results or to target the search. This extra context enhances the accuracy of results but in the final accounting it is just extending the lifespan of the free text search paradigm, without properly bridging the gap between meaningful and meaningless queries and data. To go beyond the limiting technologies in use at the moment we will have to search for new representations of data and use those as the context in which we ask our questions.
For example, the current documentation that we write for our source code is derived from HTML and contains links to other classes and systems that we use. If the documentation that came with code was derived from a ontology, and it linked to other ontologies describing the systems our code uses and interacts with, then the documentation could be mined as a source of context information with which to produce a semantic search query.
Imagine that you are an engineer and are producing a quote for one of your clients. Imagine that you needed to put the price for a gross of sprockets into the invoice. To do that you need the price from one of your supplier’s online catalogues. The fact that your suppliers are referenced in previous invoices should allow an embedded search engine to query the supplier’s catalogue (an ontology) using semantic web technologies. The context is derived from a mix of internal knowledge of what task you are performing, and what pattern of relationships you have with suppliers that helps you to fulfil orders.
Embedded search engines will rely on the pervasiveness of structured data both inside and outside of the enterprise. Without them, they will have just as hard a time making sense of a query as any general purpose search engine.
Statistical Models and Search
A viable short term solution to the problem of injecting meaning into a search query and matching it with the meaning in a document is to exploit data mining. Data mining can be used to help make predictions of the meaning that is likely to be attached to a term by a user. In just the same way that Amazon looks at what books a user has, and what books they click on, and how they purchase books, so too can a modern search engine gather data about what you search for, and what results you clicked on from what was offered. In essence, there is very little difference between the task performed by the Amazon web site and that performed by an Internet search engine.
A store like Amazon is a vast collection of data items that can be searched against. Each user that searches has a particular ‘best result’ in mind, and when they find it they select it (by buying it, viewing it or adding it to a wish list or shopping cart). Similarly a search engine is used in the same way – in this case the results are given away, but the dynamics are the same – users determine from the metadata provided whether a given result is a close enough match for what they were after. If it is, then they will click on it. Each time a user clicks on a link, they are providing some information back to the search engine that can be used to tune the user’s search in future. Search engines can use models that can be used to prioritize the results that are offered next time that user (or one like them) searches.
Needless to say, the data generated in this kind of system would be truly gigantic. As the internet grows, this metadata will grow at an exponential rate. The only way to avoid this kind of exponential curve is to increase the information density of the data on the net, or to increase the hit rate and the efficiency of the process of finding results. Again, the practicalities of dealing with data sets as big as the web point to the need for some fundamental revolution in data storage and search.
When you perform a search on the internet you are diving headfirst into a wide sea of data. The data you are after is like an island in that sea, but you might dive in a long way from where it is. The place where you dive in is critical to your success, and geospatial search may provide the solution by providing a natural and unambiguous point of entry into the data space. Much of the data that we place on the web has a natural home location, and that can be used to bring relevant data to the fore.
Rolf Landauer once said ‘information is physical’. Information cannot be had about anything other than physical things, and it cannot be represented on anything other than physical objects. Information itself is the structure of the universe. Although we seldom think of it that way – search technology is really an extension of the senses. It allows us to peer into other places and times. All information is tied to specific geospatial and temporal locations. Mostly, we are not in the habit of taking much notice of these ambient contexts that are often the only concrete links between facts in the world.
Some years ago, I was sat daydreaming in a coffee shop near where I lived. While looking out of the window at some old shops across the road I wondered what would happen if I was able to call up the information related to those shops just because I was sat near them. Each shop was inhabited by a business, each business was but one of a long line of inhabitants, and each inhabitant would have a huge network of public and private relationships that in some sense defines them as an entity.
I imagined that I had a GPS enabled device that made use of a protocol similar to the DNS domain naming service. DNS divides web sites into a partly spatial and partly logical or commercial hierarchy. The process of searching for the web site of one of those businesses would involve sending a query up the line from the domain I was on, till the query got to a DNS server that knew how to route the query to the DNS servers for the host of the website I was after. A spatial query engine would add another level of searching over the top of DNS, mapping spatio-temporal coordinates onto the conventional DNS naming system.
I wondered whether a means of search based on geographical location might not be more sensible. Imagine that the search results were prioritized according to physical and temporal distance from where and when I was. This would be an alternative name resolution strategy that would give me the locations on record for the shops across the road. From there I could navigate the local patterns of the publicly accessible relations for the entities that I had found.
If I requested the spatial search, the first result on the list should be the coffee shop I was in (plus any other companies or individuals that were in other floors of the building). Following that would come any entities nearby, in order of distance. Perhaps I could point my handheld device at the shop (say it’s an art gallery) and a digital compass might deduce the direction I was pointing in and filter the results for me to just that sector I was pointing in up a certain distance. The gallery is a business, and as such it is (in the UK at least) obliged by law to file its accounts with the government organization called Companies House. These links should be accessible, since the gallery would (by virtue of renting space in the real world) have a claim on the virtual space of the search engine. All that is required to get the company accounts would be for the company accounts to be online and indexed to a specific company that has a registry entry in the spatial search engine.
I then imagined that getting the data for the gallery opened up a path to a whole bunch of other information that flowed out into other sites. Perhaps images were available for all the works of art on sale. Perhaps from there I could navigate to the sites of the artists that I liked. Perhaps I liked one of the pictures the gallery had on display. Perhaps I finished my coffee, and crossed the road and took a look? This kind of entry point might make a difference to the number of passers-by that went into the gallery. The spatial search entry point to the web gives way to conventional web browsing, without any of the intermediate head scratching.
This spatial search can of course tie in both with the new generation of GPS enabled multi-purpose mobile devices that are now coming onto the market. It could also dovetail nicely with existing spatial search engines such as Google Maps or Live Earth. Spatial search creates a direct and meaningful structure in the virtual space that easily correlates to our real space. Spatial addresses may well provide a more powerful entry-point into the mass of data that we are trying to negotiate than a mere keyword search. The problem with text indexing and keyword searches is that they carry too many possible meanings for what you were after, and the relevance of the results decreases in proportion to the number of possible interpretations. GPS at least is precise and unambiguous, and soon to be available in every phone…