內容簡介
《搜索引擎:信息檢索實踐(英文版)》介紹瞭信息檢索(1R)中的關鍵問題。以及這些問題如何影響搜索引擎的設計與實現,並且用數學模型強化瞭重要的概念。對於網絡搜索引擎這一重要的話題,書中主要涵蓋瞭在網絡上廣泛使用的搜索技術。
《搜索引擎:信息檢索實踐(英文版)》適用於高等院校計算機科學或計算機工程專業的本科生、研究生,對於專業人士而言,《搜索引擎:信息檢索實踐(英文版)》也不失為一本理想的入門教材。
作者簡介
W.Bruce Croft,馬薩諸塞大學阿默斯特分校計算機科學特聘教授、ACM會士。他創建瞭智能信息檢索研究中心,發錶瞭200餘篇論文,多次獲奬,其中包括2003年由ACM SIGIR頒發的Gerard Salton奬。
Donald Metzler馬薩諸塞大學阿默斯特分校博士,是位於加州Santa Clara的雅虎研究中心搜索與計算廣告組的研究科學傢。
Trevor Strohman馬薩諸塞大學阿默斯特分校博士,是Google公司搜索質量部門的軟件工程師。他開發瞭Galago搜索引擎,也是Indri搜索引擎的主要開發者。
內頁插圖
目錄
1 Search Engines and Information Retrieval
1.1 What Is Information Retrieval?
1.2 The Big Issues
1.3 Search Engines
1.4 Search Engineers
2 Architecture of a Search Engine
2.1 What Is an Architecture
2.2 Basic Building Blocks
2.3 Breaking It Down
2.3.1 Text Acquisition
2.3.2 Text Transformation
2.3.3 Index Creation
2.3.4 User Interaction
2.3.5 Ranking
2.3.6 Evaluation
2.4 How Does It Really Work?
3 Crawls and Feeds
3.1 Deciding What to Search
3.2 Crawling the Web
3.2.1 Retrieving Web Pages
3.2.2 The Web Crawler
3.2.3 Freshness
3.2.4 Focused Crawling
3.2.5 Deep Web
3.2.6 Sitemaps
3.2.7 Distributed Crawling
3.3 Crawling Documents and Email
3.4 Document Feeds
3.5 The Conversion Problem
3.5.1 Character Encodings
3.6 Storing the Documents
3.6,1 Using a Database System
3.6.2 Random Access
3.6.3 Compression and Large Files
3.6.4 Update
3.6.5 BigTable
3.7 Detecting Duplicates
3.8 Removing Noise
4 Processing Text
4.1 From Words to Terms
4.2 Text Statistics
4.2.1 Vocabulary Growth
4.2.2 Estimating Collection and Result Set Sizes
4.3 Document Parsing
4.3.1 Overview
4.3.2 Tokenizing
4.3.3 Stopping
4.3.4 Stemming
4.3.5 Phrases and N-grams
4.4 Document Structure and Markup
4.5 Link Analysis
4.5.1 Anchor Text
4.5.2 PageRank
4.5.3 Link Quality
4.6 Information Extraction
4.6.1 Hidden Markov Models for Extraction
4.7 Internationalization
5 Ranking with Indexes
5.1 Overview
5.2 Abstract Model of Ranking
5.3 Inverted Indexes
5.3.1 Documents
5.3.2 Counts
5.3.3 Positions
5.3A Fields and Extents
5.3.5 Scores
5.3.6 Ordering
5.4 Compression
5.4.1 Entropy and Ambiguity
5.4.2 Delta Encoding
5.4.3 Bit-Aligned Codes
5.4.4 Byte-Aligned Codes
5.4.5 Compression in Practice
5.4.6 Looking Ahead
5.4.7 Skipping and Skip Pointers
5.5 Auxiliary Structures
5.6 Index Construction
5.6.1 Simple Construction
5.6.2 Merging
5.6.3 Parallelism and Distribution
5.6.4 Update
5.7 Query Processing
5.7.1 Document-at-a-time Evaluation
5.7.2 Term-at-a-time Evaluation
5.7.3 Optimization Techniques
5.7.4 Structured Queries
5.7.5 Distributed Evaluation
5.7.6 Caching
6 Queries and Interfaces
6.1 Information Needs and Queries
6.2 Query Transformation and Refinement
6.2.1 Stopping and Stemming Revisited
6.2.2 Spell Checking and Suggestions
6.2.3 Query Expansion
6.2.4 Relevance Feedback
6.2.5 Context and Personalization
6.3 Showing the Results
6.3.1 Result Pages and Snippets
6.3.2 Advertising and Search
6.3.3 Clustering the Results
6.4 Cross-Language Search
7 Retrieval Models
7.1 Overview of Retrieval Models
7.1.1 Boolean Retrieval
7.1.2 The Vector Space Model
7.2 Probabilistic Models
7.2.1 Information Retrieval as Classification
7.2.2 The BM25 Ranking Algorithm
7.3 Ranking Based on Language Models
7.3.1 Query Likelihood Ranking
7.3.2 Relevance Models and Pseudo-Relevance Feedback
7.4 Complex Queries and Combining Evidence
7.4.1 The Inference Network Model
7.4.2 The Galago Query Language
7.5 Web Search
7.6 Machine Learning and Information Retrieval
7.6.1 Learning to Rank
7.6.2 Topic Models and Vocabulary Mismatch
7.7 Application-Based Models
8 Evaluating Search Engines
8.1 Why Evaluate ?
8.2 The Evaluation Corpus
8.3 Logging
8.4 Effectiveness Metrics
8.4.1 Recall and Precision
8.4.2 Averaging and Interpolation
8.4.3 Focusing on the Top Documents
8.4.4 Using Preferences
……
9 Classification and Clustering
10 Social Search
11 Beyond Bag of Words
Reverences
Index
精彩書摘
After documents have been converted to some common format, they need to bestored in preparation for indexing. The simplest document storage is no document storage, and for some applications this is preferable. In desktop search, for example, the documents are already stored in the file system and do not need to be copied elsewhere. As the crawling process runs, it can send converted documents immediately to an indexing process. By not storing the intermediate converted documents, desktop search systems can save disk space and improve indexing latency.
Most other kinds of search engines need to store documents somewhere. Fast access to the document text is required in order to build document snippetsz for each search result. These snippets of text give the user an idea of what is inside the retrieved document without actually needing to click on a link.
Even if snippets are not necessary, there are other reasons to keep a copy of each document. Crawling for documents can be expensive in terms of both CPU and network load. It makes sense to keep copies of the documents around instead of trying to fetch them again the next time you want to build an index. Keeping old documents allows you to use HEAD requests in your crawler to save on bandwidth, or to crawl only a subset of the pages in your index.
Finally, document storage systems can be a starting point for information extraction (described in Chapter 4). The most pervasive kind of information extraction happens in web search engines, which extract anchor text from links to store with target web documents. Other kinds of extraction are possible, such as identifying names of people or places in documents. Notice that if information extraction is used in the search application, the document storage system should support modification of the document data.
前言/序言
為瞭進一步貫徹“國務院關於大力推進職業教育改革與發展的決定”的文件精神,加強職業教育教材建設,滿足現階段職業院校深化教學改革對教材建設的要求,根據現階段職業院校該專業沒有一套較為閤適的教材,大部分院校采用自編或行業的考證培訓教材組織教學,非常不適閤職業教育的實際情況,機械工業齣版社於2008年8月在北京召開瞭“職業教育金屬材料檢測類專業教學研討及教材建設會議”,在會上,來自全國該專業的骨乾教師、專傢、企業代錶研討瞭新的職業教育形勢下該專業的課程體係,本書就是根據會議所確定的教學大綱要求和高職教育培養目標組織編寫的。
本書根據國傢職業技能標準,將無損檢測技術專業不同等級的核心操作技能提煉齣來,用極具典型性和代錶性的實例加以錶現並分步驟進行講解。本書新穎的編排形式可以使讀者對每個案例的操作全過程一目瞭然,力求使讀者盡快熟練掌握無損檢測技術各個等級的核心操作技能,力求對讀者通過職業資格鑒定考試有所幫助。同時,讀者也可以將書中相應實例應用於實際生産操作。
本書以數十個操作訓練的實例較全麵地介紹瞭射綫檢測、超聲檢測、磁粉檢測、滲透檢測的操作過程和方法,重點強調無損檢測實際應用工藝,增加瞭典型檢測工藝卡和應用實例介紹,力求為無損檢測從業人員提供無損檢測技術應用方麵的指導和幫助。
全書共四個單元,鄧洪軍編寫第一、二單元,路寶學編寫第三、四單元。全書由鄧洪軍統稿,渤海船舶重工有限公司研究員級高工楊傢武主審。
編寫過程中,作者參閱瞭國內外齣版的有關教材和資料,得到瞭北京普匯恒達材料測試有限公司、河北石油職業技術學院、陝西工業職業技術學院、四川工程職業技術學院、包頭職業技術學院有關同誌的有益指導,在此一並錶示衷心感謝!
由於編寫時間倉促,加之作者水平有限,書中不妥之處在所難免,懇請讀者批評指正。
搜索引擎:信息檢索實踐(英文版) [Search Engines Information Retrieval in Practice] epub pdf mobi txt 電子書 下載 2024
搜索引擎:信息檢索實踐(英文版) [Search Engines Information Retrieval in Practice] 下載 epub mobi pdf txt 電子書
搜索引擎:信息檢索實踐(英文版) [Search Engines Information Retrieval in Practice] mobi pdf epub txt 電子書 下載 2024
搜索引擎:信息檢索實踐(英文版) [Search Engines Information Retrieval in Practice] epub pdf mobi txt 電子書 下載 2024