COS Home  Community of ScienceCritical Information to Fuel Your Research
Query weighting is done using the Vector Space model. Each term in the query is associated with a query term weight. Currently this weight is constantly 1. On the other side, the terms in each document get a document term weight. This weight is the product of a document specific weight and the inverse document frequency. The latter is defined as `idf = log(N/n)' where N is the number of documents in the database and n the number of documents the term occurs in.

The other part of the document weight is computed as follows: Let tf be the number of occurrences of the term in a document and maxtf the maximum frequency of any term in the document. A preliminary weight is computed according to `w = (0.5 * tf)/(1 + maxtf)'. Then these weights are normalized by dividing them by the sum of the squares of all preliminary weights for terms in this document. So the document specific weights make up a vector of length 1. The final document term weight is yielded by multiplying this weight to the idf.

For simple queries (no booleans) the weight of a document is computed by multiplying the query term weight to the query term weight for each term in the query and summing up the results. This is often referred to as the vector product (hence the name of the model) or scalar product.

The calculation is slightly different when some of the booleans are used. The `or' operator is just dropped. So `information or retrieval' yields exactly the same weight as `information retrieval'. To interpret the vector product in another way, you can say the `or' operator just sums up the weights of its arguments. For the `and' operator the weights of both arguments is computed and the final weight is just the minimum of these weights. Similar the binary `not' operator returns the minimum of the weight of the left argument and 1 - the weight of the right argument.



©2008, ProQuest LLC All rights reserved