Text Retrieval and Search Engines

Start Date: 07/05/2020

Course Type: Common Course

Course Link: https://www.coursera.org/learn/text-retrieval

About Course

Recent years have seen a dramatic growth of natural language text data, including web pages, news articles, scientific literature, emails, enterprise documents, and social media such as blog articles, forum posts, product reviews, and tweets. Text data are unique in that they are usually generated directly by humans rather than a computer system or sensors, and are thus especially valuable for discovering knowledge about people’s opinions and preferences, in addition to many other kinds of knowledge that we encode in text. This course will cover search engine technologies, which play an important role in any data mining applications involving text data for two reasons. First, while the raw data may be large for any particular problem, it is often a relatively small subset of the data that are relevant, and a search engine is an essential tool for quickly discovering a small subset of relevant text data in a large text collection. Second, search engines are needed to help analysts interpret any patterns discovered in the data by allowing them to examine the relevant original text data to make sense of any discovered pattern. You will learn the basic concepts, principles, and the major techniques in text retrieval, which is the underlying science of search engines.

Course Syllabus

In this week's lessons, you will learn how the vector space model works in detail, the major heuristics used in designing a retrieval function for ranking documents with respect to a query, and how to implement an information retrieval system (i.e., a search engine), including how to build an inverted index and how to score documents quickly for a query.

Coursera Plus banner featuring three learners and university partner logos

Course Introduction

Text Retrieval and Search Engines In this course you will learn the advanced techniques for locating text in a wide variety of formats. Topics include: halving-line feed translations, point and click text extraction, color pickers, and automatic highlight. You will have a hands-on experience in the intricacies of human-written text, including the challenges of deciphering grammatical structures, the sub-set of standard languages, and the use of multiple files for efficient search. We’ll meet for a few hours each week on a variety of topics. We’ll discuss the important components of good text processing, including: halving-line feed translations, point and click text extraction, color pickers, automatic highlighting, search engines, and human-readable documents. We’ll also discuss file systems and centralized search engines, for access control, efficient search, and document formats. You’ll also learn modern file systems and centralized search engines, for access control, efficient search, and document formats. You’ll also learn how to find common problems that surface frequently in the search process, and to take advantage of search engines to find optimal solutions. “Get ready to work in a complex and fast-paced environment.” -John Hart, Founder & Director of Search Engines, Inc. “Know what your target market is, and how to position your product for them.” -Mark Suster, CEO, Recruiting & Select

Course Tag

Information Retrieval (IR) Document Retrieval Machine Learning Recommender Systems

Related Wiki Topic

Article Example
Document retrieval Document retrieval is sometimes referred to as, or as a branch of, text retrieval. Text retrieval is a branch of information retrieval where the information is stored primarily in the form of text. Text databases became decentralized thanks to the personal computer and the CD-ROM. Text retrieval is a critical area of study today, since it is the fundamental basis of all internet search engines.
Concept search Formalized search engine evaluation has been ongoing for many years. For example, the Text REtrieval Conference (TREC) was started in 1992 to support research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies. Most of today's commercial search engines include technology first developed in TREC.
Full-text search The deficiencies of free text searching have been addressed in two ways: By providing users with tools that enable them to express their search questions more precisely, and by developing new search algorithms that improve retrieval precision.
Text Retrieval Conference NIST claims that within the first six years of the workshops, the effectiveness of retrieval systems approximately doubled. The conference was also the first to hold large-scale evaluations of non-English documents, speech, video and retrieval across languages. Additionally, the challenges have inspired a large body of publications. Technology first developed in TREC is now included in many of the world's commercial search engines. An independent report by RTII found that "about one-third of the improvement in web search engines from 1999 to 2009 is attributable to TREC. Those enhancements likely saved up to 3 billion hours of time using web search engines. ... Additionally, the report showed that for every $1 that NIST and its partners invested in TREC, at least $3.35 to $5.07 in benefits were accrued to U.S. information retrieval researchers in both the private sector and academia."
Information retrieval In 1992, the US Department of Defense along with the National Institute of Standards and Technology (NIST), cosponsored the Text Retrieval Conference (TREC) as part of the TIPSTER text program. The aim of this was to look into the information retrieval community by supplying the infrastructure that was needed for evaluation of text retrieval methodologies on a very large text collection. This catalyzed research on methods that scale to huge corpora. The introduction of web search engines has boosted the need for very large scale retrieval systems even further.
Full-text search In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases (such as titles, abstracts, selected sections, or bibliographical references).
List of search engines This is a list of search engines, including web search engines, selection-based search engines, metasearch engines, desktop search tools, and web portals and vertical market websites that have a search facility for online databases. For a list of search engine software, see List of enterprise search vendors.
Full-text search In a full-text search, a search engine examines all of the words in every stored document as it tries to match search criteria (for example, text specified by a user). Full-text-searching techniques became common in online bibliographic databases in the 1990s. Many websites and application programs (such as word processing software) provide full-text-search capabilities. Some web search engines, such as AltaVista, employ full-text-search techniques, while others index only a portion of the web pages examined by their indexing systems.
Text Retrieval Conference The Text REtrieval Conference (TREC) is an ongoing series of workshops focusing on a list of different information retrieval (IR) research areas, or "tracks." It is co-sponsored by the National Institute of Standards and Technology (NIST) and the Intelligence Advanced Research Projects Activity (part of the office of the Director of National Intelligence), and began in 1992 as part of the TIPSTER Text program. Its purpose is to support and encourage research within the information retrieval community by providing the infrastructure necessary for large-scale "evaluation" of text retrieval methodologies and to increase the speed of lab-to-product transfer of technology.
Document retrieval Internet search engines are classical applications of document retrieval. The vast majority of retrieval systems currently in use range from simple Boolean systems through to systems using statistical or natural language processing techniques.
Text Retrieval Conference The test collections developed at TREC are useful not just for (potentially) helping researchers advance the state of the art, but also for allowing developers of new (commercial) retrieval products to evaluate their effectiveness on standard tests. In the past decade, TREC has created new tests for enterprise e-mail search, genomics search, spam filtering, e-Discovery, and several other retrieval domains.
Federated search Federated search is an information retrieval technology that allows the simultaneous search of multiple searchable resources. A user makes a single query request which is distributed to the search engines, databases or other query engines participating in the federation. The federated search then aggregates the results that are received from the search engines for presentation to the user.
Information retrieval Automated information retrieval systems are used to reduce what has been called "information overload". Many universities and public libraries use IR systems to provide access to books, journals and other documents. Web search engines are the most visible IR applications.
Self-defining Text Archive and Retrieval The Self-Defining Text Archive and Retrieval (STAR) File, or simply the STAR File, is a text-based file format for storing structured data.
BRS/Search BRS/Search is a full-text database and information retrieval system. BRS/Search uses a fully inverted indexing system to store, locate, and retrieve unstructured data. It was the search engine that in 1977 powered Bibliographic Retrieval Services (BRS) commercial operations with 20 databases (including the first national commercial availability of MEDLINE); it has changed ownership several times during its development and is currently sold as Livelink ECM Discovery Server by Open Text Corporation.
Information retrieval An information retrieval process begins when a user enters a query into the system. Queries are formal statements of information needs, for example search strings in web search engines. In information retrieval a query does not uniquely identify a single object in the collection. Instead, several objects may match the query, perhaps with different degrees of relevancy.
List of academic databases and search engines As the distinction between a database and a search engine is unclear for these complex document retrieval systems, see:
Outline of search engines Search engine – information retrieval system designed to help find information stored on a computer system. The search results are usually presented as a list, and are commonly called "hits".
Outline of search engines The following outline is provided as an overview of and topical guide to search engines.
Proximity search (text) Web search engines which support proximity search via an explicit proximity operator in their query language include Walhello, Exalead, Yandex, Yahoo!, Altavista, and Bing: