ESP_Advanced_Linguistics_Guide.pdf

(19570 KB) Pobierz
ESP Advanced Linguistics Guide
FAST Enterprise Search Platform
version:5.1
Advanced Linguistics Guide
Document Number: ESP1036, Document Revision: A, May 22nd, 2007
62677807.002.png 62677807.003.png
Copyright
Copyright © 1997-2007 by Fast Search & Transfer ASA (“FAST”). Some portions may be copyrighted
by FAST’s licensors. All rights reserved. The documentation is protected by the copyright laws of Norway,
the United States, and other countries and international treaties. No copyright notices may be removed
from the documentation. No part of this document may be reproduced, modified, copied, stored in a
retrieval system, or transmitted in any form or any means, electronic or mechanical, including
photocopying and recording, for any purpose other than the purchaser’s use, without the written
permission of FAST. Information in this documentation is subject to change without notice. The software
described in this document is furnished under a license agreement and may be used only in accordance
with the terms of the agreement.
Trademarks
FAST ESP, the FAST logos, FAST Personal Search, FAST mSearch, FAST InStream, FAST AdVisor,
FAST Marketrac, FAST ProPublish, FAST Sentimeter, FAST Scope Search, FAST Live Analytics, FAST
Contextual Insight, FAST Dynamic Merchandising, FAST SDA, FAST MetaWeb, FAST InPerspective,
GetSmart, NXT, LivePublish, Folio, FAST Unity, and other FAST product names contained herein are
either registered trademarks or trademarks of Fast Search & Transfer ASA in Norway, the United States
and/or other countries. All rights reserved. This documentation is published in the United States and/or
other countries.
Sun, Sun Microsystems, the Sun Logo, all SPARC trademarks, Java, and Solaris are trademarks or
registered trademarks of Sun Microsystems, Inc. in the United States and other countries.
Netscape is a registered trademark of Netscape Communications Corporation in the United States and
other countries.
Microsoft, Windows, Visual Basic, and Internet Explorer are either registered trademarks or trademarks
of Microsoft Corporation in the United States and/or other countries.
Red Hat is a registered trademark of Red Hat, Inc.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Linux is the registered trademark of Linus Torvalds in the U.S. and other countries.
AIX and IBM Classes for Unicode are registered trademarks or trademarks of International Business
Machines Corporation in the United States, other countries, or both.
HP and the names of HP products referenced herein are either registered trademarks or service marks,
or trademarks or service marks, of Hewlett-Packard Company in the United States and/or other countries.
Remedy is a registered trademark, and Magic is a trademark, of BMC Software, Inc. in the United States
and/or other countries.
XML Parser is a trademark of The Apache Software Foundation.
All other company, product, and service names are the property of their respective holders and may be
registered trademarks or trademarks in the United States and/or other countries.
Restricted Rights Legend
The documentation and accompanying software are provided to the U.S. government in a transaction
subject to the Federal Acquisition Regulations with Restricted Rights. Use, duplication, or disclosure of
the documentation and software by the government is subject to restrictions as set forth in FAR 52.227-19
Commercial Computer Software-Restricted Rights (June 1987).
Contact Us
Web Site
Please visit us at: http://www.fastsearch.com/
Contacting FAST Corporate Offices
US Headquarters, Boston, MA
FAST
Cutler Lake Corporate Center
117 Kendrick Street, Suite 100
Needham, MA 02492 USA
Tel: +1 (781) 304-2400 (8:30am - 5:30pm EST)
Fax: +1 (781) 304-2410
Corporate Headquarters, Oslo, Norway
FAST
Torggata 2-4-6
N-0181 Oslo, Norway
Tel: +47 2301 1200
Fax: +47 2301 1201
Technical Product Support
Technical Product Support is offered to FAST subscribers with active FAST Maintenance and Support
agreements. Please submit tickets and requests by using the Web-based ticketing system at
https://ticket.fast.no or email tech-support@fastsearch.com.
If you do not have access to FAST ticket system, please contact Customer Relations at
Email: customerservice@fastsearch.com
For additional information such as phone numbers and times, please refer to your FAST Customer
agreement.
Product Software and License
To request FAST licenses or software for customers, contact your FAST Account Manager or email
customerservice@fastsearch.com
To request FAST licenses or software for partners, contact your Channel Sales Representative or email
partners@fastsearch.com
For product evaluations, contact your FAST Sales Representative, FAST Sales Engineer, or Channel
Sales Representative.
Product Training
E-mail: fastuniversity@fastsearch.com
Sales Inquiries
E-mail: sales@fastsearch.com or contact your FAST account manager.
Partner Inquiries
E-mail: partners@fastsearch.com or contact your Channel Sales Representative.
Obtaining Updates in Product Documentation for FAST ESP
Customer and Partner Extranet
You can check the customer and partner extranet for updated versions of product documentation. To
obtain access to the extranet:
Send an email to this address register-extranet@fastsearch.com and include the following information:
• company
• email
• first and last name
Allow 48 hours during normal business hours for processing.
Note: Only companies with active Maintenance and Support (M&S) or active Partner agreement
with signed Extranet Access forms are eligible. Standard M&S entitles a company to 3 users, and
Premium 10 users. Partners, refer to your agreement for number of eligible users.
62677807.004.png
Contents
Preface..................................................................................................ii
Copyright..................................................................................................................................ii
Contact Us...............................................................................................................................iii
Obtaining Updates in Product Documentation for FAST ESP.................................................iv
Chapter 1: Linguistics in FAST ESP.................................................11
Linguistics and Relevancy......................................................................................................12
Linguistic Concepts................................................................................................................12
General Linguistics Settings in ESP.......................................................................................13
Installation Options......................................................................................................13
Changing the Default Query String Language.............................................................14
Linguistics at Runtime.................................................................................................14
Handling Language Codes..........................................................................................15
Language Specific Solutions..................................................................................................16
Language and Encoding Detection........................................................................................17
Supported Languages for Automatic Language Detection..........................................17
Language and Encoding Detection and Encoding Conversion...................................18
Language and Encoding Document Processors.........................................................19
Configuring Automatic Language Detection................................................................22
Chapter 2: Tokenization and Normalization....................................25
Customizing the Tokenization Process...................................................................................27
Deploying a Changed Tokenization Configuration..................................................................27
Tokenizers..............................................................................................................................28
Generic Tokenizer........................................................................................................28
Add a Generic Tokenizer.............................................................................................30
The Default Tokenizer..................................................................................................30
Tokenizer Plug-Ins.......................................................................................................31
Language Specific Tokenizer.......................................................................................31
Exceptions...................................................................................................................32
Tokenization Modes.....................................................................................................33
Input Normalization................................................................................................................35
Character Normalization.........................................................................................................35
Normalization of Accents and Special Characters......................................................35
Variants........................................................................................................................36
Character Normalization Independently from Tokenization....................................................38
CharacterNormalizer Document Processor.................................................................40
Character Normalizer Query Transformer....................................................................40
5
62677807.005.png 62677807.001.png
Zgłoś jeśli naruszono regulamin