How to Build Fully Automated QA System for a Large Scale Search Engine

Abstract: If you can not measure it, you can not improve it”. Relevancy of the documents retrieved by a search engine is crucial to the end users. Exposing end users to irrelevant documents is very expensive since those users will turn away; therefore, companies that rely on search services strive to improve their search algorithms. Whenever a tweak of an existing algorithm is done or a new algorithm is implemented, an assessment is required. Most of the existing techniques rely on running an A/B test by exposing a portion of the end users to a new search algorithm, then comparing the Click Through Rate (CTR) between the existing algorithm and the new one to measure the quality of each algorithm. In this talk I will walk through a process to build a fully automated QA system for search engines, which leverage implicit user feedback. The proposed system has been used successfully to assess CareerBuilder’s search engine. CareerBuilder operates the largest job board in the U.S. and has an extensive and growing global presence, with millions of job postings, more than 60 million actively-searchable resumes, over one billion searchable documents, and more than a million searches per hour. We implemented this system using Apache Spark. Spark enables us to derive implicit user feedback using about 19M search logs, then calculate the nDCG for different algorithms in less than 2 hours. We can report the estimated impact of a proposed changes in a few hours instead of running an A/B test and wait for days to figure out the impact.

Bio: Khalifeh AlJadda holds Ph.D. in computer science from the University of Georgia (UGA), with a specialization in machine learning. He has experience implementing large scale, distributed machine learning algorithms to solve challenging problems in domains ranging from Bioinformatics to search and recommendation engines. He is the lead data scientist on the search data science team at CareerBuilder, which is one of the largest job boards in the world. He leads the data science effort to design and implement the backend of CareerBuilder’s language-agnostic semantic search engine leveraging Apache Spark and the Hadoop ecosystem. Khalifeh is a frequent public speaker on topics related to data science, machine learning, semantic search, and big data analytics.