![[計算機]Lucene代碼分析_第1頁](http://file3.renrendoc.com/fileroot_temp3/2022-2/16/5ce19274-4300-4601-ba6e-7c75e3c794cc/5ce19274-4300-4601-ba6e-7c75e3c794cc1.gif)
![[計算機]Lucene代碼分析_第2頁](http://file3.renrendoc.com/fileroot_temp3/2022-2/16/5ce19274-4300-4601-ba6e-7c75e3c794cc/5ce19274-4300-4601-ba6e-7c75e3c794cc2.gif)
![[計算機]Lucene代碼分析_第3頁](http://file3.renrendoc.com/fileroot_temp3/2022-2/16/5ce19274-4300-4601-ba6e-7c75e3c794cc/5ce19274-4300-4601-ba6e-7c75e3c794cc3.gif)
![[計算機]Lucene代碼分析_第4頁](http://file3.renrendoc.com/fileroot_temp3/2022-2/16/5ce19274-4300-4601-ba6e-7c75e3c794cc/5ce19274-4300-4601-ba6e-7c75e3c794cc4.gif)
![[計算機]Lucene代碼分析_第5頁](http://file3.renrendoc.com/fileroot_temp3/2022-2/16/5ce19274-4300-4601-ba6e-7c75e3c794cc/5ce19274-4300-4601-ba6e-7c75e3c794cc5.gif)
版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領
文檔簡介
1、.1. Lucene源代碼分析1首先講一下Lucene的發音是Loo-seen,這是Lucene in Action中提到過的。另外強調的一點是我用的版本是1.9版,大家看到這里不要驚訝,我用這么早的版本,是為了能更好的了解Lucene的核心。如果有人看過最新的版本就應該了解,對于一個初學者,Lucene最高版本并不那么簡單,它涉及了多線程,設計模式,對我反正是很挑戰了。我先看老版本這也是受LINUX內核完全注釋作者趙炯的啟發,他分析的不是最新的Linux內核,而是1.11的版本。我開始還是用調試的方式來解釋,我想大家和我一樣,如果看了半天analyzer也會有點不耐煩,我先寫一個什么意義都沒
2、有例子(有那么一點意義的例子,網上很多):package forfun; import org.apache.lucene.analysis.SimpleAnalyzer;import org.apache.lucene.index.IndexWriter; public class Test public static void main( String args ) throws ExceptionIndexWriter writer = new IndexWriter( "E:a", new SimpleAnalyzer(), true);Inde
3、xWriter是最核心的一個類,一般的Blogger把其它所有的包都分析完了,就剩這最核心的一個包的時候,就分析不動了。我們先看一下它的參數,第一個就是索引存放的路徑,第二個參數是一個Analyzer對象,它對輸入數據進行過濾,分詞等,第三個參數如果為true,那么它刪除原目錄內的所有內容重建索引,如果為false,就在已經存在的索引上追加新的內容。你可以先運行一下,就會發現指定的目錄下有一個segments文件。調試的時候,暫時不去管SimpleAnalyzer類。我們看一下IndexWriter類的構造函數:public IndexWriter(String path, Analyzer
4、a, boolean create)throws IOException this(FSDirectory.getDirectory(path, create), a, create, true);這里我們看到一個新的類FSDirectory:public static FSDirectory getDirectory(String path, boolean create) throws IOException return getDirectory(new File(path), create);再看getDirectory函數:public static FSDirectory getD
5、irectory(File file, boolean create)throws IOException file = new File(file.getCanonicalPath();FSDirectory dir;synchronized (DIRECTORIES) dir = (FSDirectory) DIRECTORIES.get(file);if (dir = null) try dir = (FSDirectory) IMPL.newInstance(); catch (Exception e) throw new RuntimeException("cannot l
6、oad FSDirectory class: " + e.toString();dir.init(file, create);DIRECTORIES.put(file, dir); else if (create) dir.create();synchronized (dir) dir.refCount+;return dir;DIRECTORIES是一個Hashtable對象,DIRECTORIES注釋上講,目錄的緩存,保證唯一的路徑和Directory對應,所以在Directory上同步可以對讀寫進行同步訪問。(This cache of directories ensures
7、that there is a unique Directory instance per path, so that synchronization on the Directory can be used to synchronize access between readers and writers.)也懶得解釋了,就是創建一下目錄,最后將refCount+。我們回過頭來看IndexWriter的構造函數:private IndexWriter(Directory d, Analyzer a, final boolean create,boolean closeDir) throws
8、IOException this.closeDir = closeDir;directory = d;analyzer = a; Lock writeLock = directory.makeLock(IndexWriter.WRITE_LOCK_NAME);if (!writeLock.obtain(WRITE_LOCK_TIMEOUT) / obtain write lockthrow new IOException("Index locked for write: " + writeLock);this.writeLock = writeLock; / sa
9、ve it synchronized (directory) / in- & inter-process syncnew Lock.With(directory.makeLock(IndexWriter.COMMIT_LOCK_NAME),COMMIT_LOCK_TIMEOUT) public Object doBody() throws IOException if (create)segmentInfos.write(directory);elsesegmentInfos.read(directory);return null;.run();這里讓我感興趣的是doBody
10、中的segmentInfos.writer,我們進入看一下這個函數:public final void write(Directory directory) throws IOException IndexOutput output = directory.createOutput("segments.new");try output.writeInt(FORMAT); / write FORMAToutput.writeLong(+version); / every write changes the indexoutput.writeInt(counter); / wr
11、ite counteroutput.writeInt(size(); / write infosfor (int i = 0; i < size(); i+) SegmentInfo si = info(i);output.writeString();output.writeInt(si.docCount); finally output.close(); / install new segment infodirectory.renameFile("segments.new", IndexFileNames.SEGMENTS);先看一下第一個
12、函數,它建立了一個segments.new的文件,你如果在調試,就可以看到這個文件產生了,它返回一個IndexOutput對象,用它來寫文件。我們就不去理睬這些有什么用了,第一個FORMAT是-1,第二個version是用System.currentTimeMillis()產生的,目的是產生唯一的一個版本號。下面counter是0。SegmentInfos繼承自Vector,下面的size()就是它有多少個元素,但是我們沒有對任何文檔建索引,所以它是空的。最后一句話是把segments.new文件名重命名為segment。你可以用UltraEdit或是WinHex打開segments看一下里面
13、的內容。我這里把它列出來:FF FF FF FF 00 00 01 22 15 02 07 2A 00 00 00 00 00 00 00 00writeInt是寫入四個字節,writeLong是八個字節,現在可以看到所寫入的四個內容分別是什么了。2. Lucene源代碼分析2上次提到了Analyzer類,說它是用于對輸入進行過濾,分詞等,現在我們詳細看一個這個類,Lucene中一個Analyzer通常由Tokenizer和TokenFilter組成,我們先看一下Tokenizer:public abstract class Tokenizer extends TokenStream /* T
14、he text source for this Tokenizer. */protected Reader input; /* Construct a tokenizer with null input. */protected Tokenizer() /* Construct a token stream processing the given input. */protected Tokenizer(Reader input) this.input = input; /* By default, closes the input Reader. */pub
15、lic void close() throws IOException input.close();只是一個抽象類,而且也沒什么值得我們關注的函數,我們看一下他的父類TokenStream:public abstract class TokenStream /* Returns the next token in the stream, or null at EOS. */public abstract Token next() throws IOException; /* Releases resources associated with this stream. */publi
16、c void close() throws IOException 原來值得我們關注的函數在它的父類中,next函數,它會返回流中的下一個token。其實剛才提到的另一個類TokenFilter也繼承自TokenStream:public abstract class TokenFilter extends TokenStream /* The source of tokens for this filter. */protected TokenStream input; /* Call TokenFilter(TokenStream) instead. * deprecated *
17、/protected TokenFilter() /* Construct a token stream filtering the given input. */protected TokenFilter(TokenStream input) this.input = input; /* Close the input TokenStream. */public void close() throws IOException input.close();先寫一個依然沒有意義的測試類:package forfun; import java.io.Buffered
18、Reader;import java.io.File;import java.io.FileReader;import org.apache.lucene.analysis.LetterTokenizer; public class TokenTest public static void main( String args ) throws ExceptionFile f = new File( "E:source.txt" );BufferedReader reader = new BufferedReader(new FileReader(f);Letter
19、Tokenizer lt = new LetterTokenizer( reader );System.out.println( lt.next() );Source.txt中我寫的hello world!。當然你也可以寫別的,我用LetterTokenizer進行分詞,最后打印分詞后的第一個token。我們先看一下他是如何分詞的,也就是next到底在做什么。public class LetterTokenizer extends CharTokenizer /* Construct a new LetterTokenizer. */public LetterTokenizer(Reader
20、in) super(in); /* Collects only characters which satisfy * link Character#isLetter(char).*/protected boolean isTokenChar(char c) return Character.isLetter(c);函數isTokenChar來判斷c是不是一個字母,它并沒有實現next函數,我們到它的父類看一下,找到了next函數:/* Returns the next token in the stream, or null at EOS. */public final Token
21、next() throws IOException int length = 0;int start = offset;while (true) final char c; offset+;if (bufferIndex >= dataLen) dataLen = input.read(ioBuffer);bufferIndex = 0;if (dataLen = -1) if (length > 0)break;elsereturn null; elsec = ioBufferbufferIndex+; if (isTokenChar(c) / if it
22、39;s a token char if (length = 0) / start of tokenstart = offset - 1; bufferlength+ = normalize(c); / buffer it, normalized if (length = MAX_WORD_LEN) / buffer overflow!break; else if (length > 0) / at non-Letter w/ charsbreak; / return 'em return new Token(new
23、 String(buffer, 0, length), start, start + length);看起來很長,其實很簡單,至少讀起來很簡單,其中isTokenChar就是我們剛才在LetterTokenizer中看到的,代碼中用start記錄一個token的起始位置,用length記錄它的長度,如果不是字符的話,就break;,我們看到一個新的類Token,這里它的構造參數有字符串,起始位置,結束位置。看一下Token的源代碼:String termText; / the text of the termint startOffset; / start in source textint
24、endOffset; / end in source textString type = "word" / lexical typeprivate int positionIncrement = 1; /* Constructs a Token with the given term text, and start & end offsets. The type defaults to "word." */public Token(String text, int start, int end) termText = text;star
25、tOffset = start;endOffset = end; /* Constructs a Token with the given text, start and end offsets, & type. */public Token(String text, int start, int end, String typ) termText = text;startOffset = start;endOffset = end;type = typ;和我們剛才用到的構造函數對應一下,就知道三個成員變量的意思了,type和positionIncrement我還是引用一下別
26、的人話,Type主要用來表示文本編碼和語言類型,single表示單個ASCII字符,double表示non-ASCII字符,Word是默認的不區分的字符類型。而positionIncrement表示位置增量,用于處理拼音之類的情況(拼音就在那個詞的上方)。3. Lucene源代碼分析3關于TokenFilter我們先看一個最簡單的LowerCaseFilter,它的next函數如下:public final Token next() throws IOException Token t = input.next(); if (t = null)return null; t.
27、termText = t.termText.toLowerCase(); return t;沒什么意思,就是把Token對象中的字符串換成了小寫,你想看有意思的可以看PortStemFilter,劍橋大學出的那本Introduction to information retrieval中也提到過這種方法,34頁。再看一個稍有一點意義的TokenFilter,StopFilter,我們看一下public static final Set makeStopSet(String stopWords) return makeStopSet(stopWords, false); pu
28、blic static final Set makeStopSet(String stopWords, boolean ignoreCase) HashSet stopTable = new HashSet(stopWords.length);for (int i = 0; i < stopWords.length; i+)stopTable.add(ignoreCase ? stopWordsi.toLowerCase(): stopWordsi);return stopTable; public final Token next() throws IOException /
29、 return the first non-stop word foundfor (Token token = input.next(); token != null; token = input.next() String termText = ignoreCase ? token.termText.toLowerCase(): token.termText;if (!stopWords.contains(termText)return token;/ reached EOS - return nullreturn null;makeStopSet是把所有要過濾的詞加到stopTable中去
30、(不清楚為什么不用HashSet呢),在next函數中,它過濾掉stopTable有的字符串。再來看一個簡單的Analyzer,StopAnalyzer的next函數:public TokenStream tokenStream(String fieldName, Reader reader) return new StopFilter(new LowerCaseTokenizer(reader), stopWords);記得這句話嗎?Lucene中一個Analyzer通常由Tokenizer和TokenFilter組成,這里就是這句話的證據,我們先對reader傳進來的字符串進行分詞,再對它
31、進行過濾。而其中的tokenStream當然就是我們在分詞時要調用的那個函數了。4. Lucene源代碼分析4寫一個略有一點意義的例子,我們把”Hello World”加入索引:package forfun; import org.apache.lucene.analysis.SimpleAnalyzer;import org.apache.lucene.document.Document;import org.apache.lucene.document.Field;import org.apache.lucene.index.IndexWriter; public cl
32、ass FieldTest public static void main( String args ) throws ExceptionIndexWriter writer = new IndexWriter( "E:a", new SimpleAnalyzer(), true);writer.setUseCompoundFile( false ); Document doc = new Document();Field name = new Field( "TheField", "hello world",Field.S
33、tore.YES, Field.Index.TOKENIZED );doc.add( name );writer.addDocument( doc );writer.close();Document和Field不太想講了,參數的含意也查的到,直接開始最重要的函數,IndexWriter中的addDocument函數:public void addDocument(Document doc, Analyzer analyzer) throws IOException DocumentWriter dw = new DocumentWriter(ramDirectory, analyzer, th
34、is);dw.setInfoStream(infoStream);String segmentName = newSegmentName();dw.addDocument(segmentName, doc);synchronized (this) segmentInfos.addElement(new SegmentInfo(segmentName, 1,ramDirectory);maybeMergeSegments();我們看到它用DocumentWriter進行加入文檔dw.addDocument,這里newSegment是”_o”,我們看它的加入文檔函數:final void addD
35、ocument(String segment, Document doc) throws IOException / write field namesfieldInfos = new FieldInfos();fieldInfos.add(doc);fieldInfos.write(directory, segment + ".fnm"); / write field valuesFieldsWriter fieldsWriter = new FieldsWriter(directory, segment,fieldInfos);try fieldsWriter
36、.addDocument(doc); finally fieldsWriter.close(); / invert doc into postingTablepostingTable.clear(); / clear postingTablefieldLengths = new intfieldInfos.size(); / init fieldLengthsfieldPositions = new intfieldInfos.size(); / init fieldPositionsfieldOffsets = new intfieldInfos.size(); / init fi
37、eldOffsets fieldBoosts = new floatfieldInfos.size(); / init fieldBoostsArrays.fill(fieldBoosts, doc.getBoost(); invertDocument(doc); / sort postingTable into an arrayPosting postings = sortPostingTable(); / write postingswritePostings(postings, segment); / write norms of ind
38、exed fieldswriteNorms(segment);有一個新的類,FieldInfos,看它的名字應該是保存Field信息的,看一下它的addDocument函數:public void add(Document doc) Enumeration fields = doc.fields();while (fields.hasMoreElements() Field field = (Field) fields.nextElement();add((), field.isIndexed(), field.isTermVectorStored(),field.isSt
39、orePositionWithTermVector(), field.isStoreOffsetWithTermVector(), field.getOmitNorms();果然是記錄Field的信息的,dig deeper:public void add(String name, boolean isIndexed, boolean storeTermVector,boolean storePositionWithTermVector,boolean storeOffsetWithTermVector, boolean omitNorms) FieldInfo fi = fieldInfo(
40、name);if (fi = null) addInternal(name, isIndexed, storeTermVector,storePositionWithTermVector, storeOffsetWithTermVector,omitNorms); else if (fi.isIndexed != isIndexed) fi.isIndexed = true; / once indexed, always indexif (fi.storeTermVector != storeTermVector) fi.storeTermVector = true; / once vecto
41、r, always vectorif (fi.storePositionWithTermVector != storePositionWithTermVector) fi.storePositionWithTermVector = true; if (fi.storeOffsetWithTermVector != storeOffsetWithTermVector) fi.storeOffsetWithTermVector = true; if (fi.omitNorms != omitNorms) fi.omitNorms = false; / once norms are stored,
42、always store再到addInternal中看一下:private void addInternal(String name, boolean isIndexed,boolean storeTermVector, boolean storePositionWithTermVector,boolean storeOffsetWithTermVector, boolean omitNorms) FieldInfo fi = new FieldInfo(name, isIndexed, byNumber.size(),storeTermVector, storePositionWithTer
43、mVector,storeOffsetWithTermVector, omitNorms);byNumber.add(fi);byName.put(name, fi);回到fieldInfos.add(doc)的下一句:public void write(Directory d, String name) throws IOException IndexOutput output = d.createOutput(name);try write(output); finally output.close();這個函數我們見過,但是因為Directory是RAMDirectory,所以并沒有以文
44、件的形式產生。public void write(IndexOutput output) throws IOException output.writeVInt(size();for (int i = 0; i < size(); i+) FieldInfo fi = fieldInfo(i);byte bits = 0x0;if (fi.isIndexed)bits |= IS_INDEXED;if (fi.storeTermVector)bits |= STORE_TERMVECTOR;if (fi.storePositionWithTermVector)bits |= STORE_
45、POSITIONS_WITH_TERMVECTOR;if (fi.storeOffsetWithTermVector)bits |= STORE_OFFSET_WITH_TERMVECTOR;if (fi.omitNorms)bits |= OMIT_NORMS;output.writeString();output.writeByte(bits);我們這里可以看到,”_0.fnm”中保存的是Field的名字以及設置。我們看一下,一共向_0.fnm中寫入了多少內容,第一,寫入了有多少個Field,第二,分別寫入了Field的名字與設置。我把”_0.fnm”中的內容列出來看一下:0
46、1 08 54 68 65 46 69 65 6C 64 01 可能會有點奇怪,寫入的是size()是一個int,為什么就用了一個字節表示了呢?其實與我們上次看到的writeInt不同的是這里用的是writeVInt:public void writeInt(int i) throws IOException writeByte(byte) (i >> 24);writeByte(byte) (i >> 16);writeByte(byte) (i >> 8);writeByte(byte) i); /* Writes an int in a variable-length format. Writes between one and * five bytes. Smaller values take fewer bytes. Negative numbers are not * supported.*/public void writeVInt(int i) throws IOEx
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經權益所有人同意不得將文件中的內容挪作商業或盈利用途。
- 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
- 6. 下載文件中如有侵權或不適當內容,請與我們聯系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
評論
0/150
提交評論