2009年9月8日星期二

[fw-mvc] Zend_Search_Lucene and PDF files

Hello,

I am implementing Lucene and need to index my PDF files. 

I have found several solutions, but they all require some non PHP component such as XPDF, etc...  I need this to be cross platform, so those are generally out.

I also started looking for ways to get inside Zend_PDF to get at the elements of each page with no success yet.  I was hoping that I could iterate the pages in a PDF (done), get a list of the elements on that page (?) and then grab the text from perhaps the Zend_Pdf_Element_String I was able to find in there.  Since I am not going to be displaying the context in my search, the location of the text does not matter to me so much.

I am getting totally bogged down in the source code for the pages and the parsers, partially at least because I am not familiar with the nomenclature of PDF internals  :(

Does anyone have any pointers on how to approach this?  Ideally I'd like to keep it Zend, but I can use other PDF libraries if I need to.

Thanks

Bill

没有评论: