On Wed, Sep 9, 2009 at 8:00 AM, Matthias W. <Matthias.Wangler@e-projecta.com> wrote:
Hi,
some time ago I had the same problem. But I needed the support for other
documents, too (Excel, Powerpoint, ...).
Because of this I created my index with java Apache projects: Lucene, PDFBox
(PDF parser/writer) and POI (Office document parser/writer).
I think it wouldn't be much work to parse your PDF docs Java-side...
--
Bill Chmura-2 wrote:
>
> Hello,
>
> I am implementing Lucene and need to index my PDF files.
>
> I have found several solutions, but they all require some non PHP
> component such as XPDF, etc... I need this to be cross platform, so
> those are generally out.
>
> I also started looking for ways to get inside Zend_PDF to get at the
> elements of each page with no success yet. I was hoping that I could
> iterate the pages in a PDF (done), get a list of the elements on that
> page (?) and then grab the text from perhaps the Zend_Pdf_Element_String
> I was able to find in there. Since I am not going to be displaying the
> context in my search, the location of the text does not matter to me so
> much.
>
> I am getting totally bogged down in the source code for the pages and
> the parsers, partially at least because I am not familiar with the
> nomenclature of PDF internals :(
>
> Does anyone have any pointers on how to approach this? Ideally I'd like
> to keep it Zend, but I can use other PDF libraries if I need to.
>
> Thanks
>
> Bill
>
>
>
View this message in context: http://www.nabble.com/Zend_Search_Lucene-and-PDF-files-tp25352084p25363483.html
Sent from the Zend MVC mailing list archive at Nabble.com.
--
Shaun J. Farrell
Washington, DC
(202) 713-5241
www.farrelley.com
没有评论:
发表评论