Hello,
I am implementing Lucene and need to index my PDF files.
I have found several solutions, but they all require some non PHP component such as XPDF, etc... I need this to be cross platform, so those are generally out.
I also started looking for ways to get inside Zend_PDF to get at the elements of each page with no success yet. I was hoping that I could iterate the pages in a PDF (done), get a list of the elements on that page (?) and then grab the text from perhaps the Zend_Pdf_Element_String I was able to find in there. Since I am not going to be displaying the context in my search, the location of the text does not matter to me so much.
I am getting totally bogged down in the source code for the pages and the parsers, partially at least because I am not familiar with the nomenclature of PDF internals :(
Does anyone have any pointers on how to approach this? Ideally I'd like to keep it Zend, but I can use other PDF libraries if I need to.
Thanks
Bill
2009年9月8日星期二
订阅:
博文评论 (Atom)
没有评论:
发表评论