DataScience for Search ?

While I’m still early in my journey with e-commerce search systems, this note is a repository of search functionalities I could comprehend; which for the most of us, is beautifully abstracted as a black box.
Search and Reco go hand in hand, so to draw a line we’ll keep recommendations as the items that pop up separately either on the home page, or under a listing currently being viewed by the user.
Search is user-initiated, while recommendations are system-initiated. Reco has to be Near Real Time (NRT), while Search can be a bit relaxed (okay if we don’t show up items listed just moments ago).


We can make ‘search smarter’ by targeting multiple avenues, discussed below :

  1. Indexing Indexing items/documents (entire corpus of items) is commonly achieved with the help of an inverted index, which stores the docID in which a particular term appears. invertedIndex

  2. Understanding the Query terms Extract the important parts from the query, pass it to ElasticSearch (StructuredSearch) Highly unlikely that the userbase will be mono-lingual, this necessitates understanding of synonym pairs between languages. 

  3. Understanding Query Intent Entity Recognition : recognize famous names/places, not sure how this works, having a pre-defined set is mundane and would be required to be updated frequently.
    Username search ?
    Some platforms allow users to search for other users based on a unique userID/username. A small % of queries might not be marketplace-searches.
    Users might want to browse a popular store’s page, to scour to new items.
    Browsing Intent :
    Category browse or general marketplace browse ? If we can associate the query with a particular (or a bunch of) category, items attached to the identified categories are boosted (ranked higher).

  4. Retrieval and Ranking Based on the query, get the most suitable items (retrieval) and rank these items based on relevancy and business needs (soft exponential decay based on time gives a lower score to older items).

  5. Advanced Search Query to be broken down into chunks, which are connected by boolean operators.
    Exact match : "mango raw" only returns items with the two words, we do not consider partial matches (mango juice, sweet mango are discarded)
    Partial match : "rolex" green watch should return all items having the term ‘rolex’ in title/description, with green watch being a preference, but not absolutely.
    Negative match : "amul"  - milk should return items that only match with the “amul” and not “milk”
    Wildcard match : run* shoes evaluates to all variants of the word ‘run’ followed by any character (running, runic, runner, runway)

  6. Identify profanities / problematic words Can have a dictionary of such words and display 0 results to avoid backlash / legal issues.

+ve signals for marketplaces

Weights can be assigned to these signals to calculate a relevancy score for each pair, 

Click Through Rate of all forms are referred to as Search Success Metrics, these can be tracked by the total number of views/likes/add-to-cart/transactions;
Or to measure of how useful the ranked results are, one can monitor the avg/p50/p90 of listing-index interacted with. This helps us keep a track of how good the results are. If users consistently click on items with a high index, this signals poor ranking relevance.

Out of all results, how much is actually relevant ==> Precision
Out of all relevant content, how much did we retrieve ==> Recall

Precision : out of all items shown, how many are relevant to the query, P@4, P@8 is a metric.
Recall : out of all results, how many could my algorithm retrieve. This is a bit hard to compute, as getting the raw counts for all relevant items is tough.

Indexing

Pretty simple
For each word in our corpus, store the docIDs a particular term shows up in with the term-frequency in the doc.
This creates a postings list / inverted index as we saw in the first image.

Understanding the Query

Before we try to analyze the query, we need to normalize it by converting everything to the same case (usually lowercase) and then stem/lemmatize to get the root word.

Stemming Lemmatization
Crude process that chops off suffix Get to the morphological root of a word
Easy, quick, based on rules Based on a vocab, computation heavy
personalization ⇒ personalize personalization ⇒ person

Arriving at the root word, we split our phrase into tokens.

Query Expansion

Have a learned synonym map across languages and append synonyms to the search query before passing it to ElasticSearch; which will simply perform a union of posting lists of all the query terms to obtain all items in our index that have any of these terms.
Increases number of results.

Allows the search bar to work with natural language queries which is seamless as compared to a standard keyword search function wherein the user has to fill in values for attributes he’s looking for.
Critical for Fashion / Electronics wherein attributes are specified with the query itself

Atrtribute keys are not mentioned, but the values are passed in the query itself.
This increases number of search results and therefore the avg impressions per query, as we can further include relevant items that match attributes.
Alongside as retrieved results are more relevant, we should observe an increase in num_items viewed/clicked through.
We pass the query to a QueryUnderstanding Service, which performs Entity Recognition and filters out attribute-value pairs for us, that we pass onto ElasticSearch 

Ranking and Retreival

We want to get the matching-results from our database to present to the user.

Retrieval

Collect (mostly) relevant documents/items, this can be a massive collection.
Boolean retrieval is fast, just perform union/intersections of the query terms, usually unions are preferred to improve number of results

Ranking (can skip this section)

From the huge set of documents collected during the retrieval phase, we want to sort items based on a relevancy / recency score.
These methods ignore the order of words in a document, and hence referred to as a ‘bag of words’ concept.


Term Frequency

Number of occurences of a term in the document
Well not all words in a document are equally important, there are some statistically boring words as well. How do we identify these ?

Inverse Document Frequency

Certain words have no discriminating power, they exist in a lot of documents and do not help us in determining relevance.
Here, N is the total number of documents, and df is the number of documents containing the term.

\[IDF= {N \over df }\]

For a large corpus, we usually take the log of this

tf-IDF score

We would like to score documents higher that have query terms appear more times and these terms should not appear a lot in the corpus overall.
Words like and, the would appear in all documents and provide us with no relevance.

\[tf_-idf = TF * log{N \over df}\]

Things I don’t understand well enough

Acknowledgements

  1. Talk by Scott Hajek, PhD guy from New York

  2. Amazing book, have read the first 8 Chapters
    Introduction to Information Retrieval by Manning

  3. Lots of documentation as well, haha

  4. Zhihu, which is the Chinese version of Qoura

Last updated: July 10, 2025