Zend FrameWork: Re: [fw-mvc] Zend_Search

2009年9月9日星期三

Re: [fw-mvc] Zend_Search_Lucene and PDF files

About a 1 1/2 yrs ago I wrote a 2 part post on how to index pdf's with Zend. (http://www.kapustabrothers.com/2008/01/20/indexing-pdf-documents-with-zend_search_lucene/) the Framework has come along way since then so it's probably out of date. I have been thinking about updating the topic. The current implementation uses XPDF which at the time was the best to convert PDF's to Text. I have been looking for some other libraries but have no luck. I'm also looking so ill let you know if i find anything.

On Wed, Sep 9, 2009 at 8:00 AM, Matthias W. <Matthias.Wangler@e-projecta.com> wrote:

Hi,
some time ago I had the same problem. But I needed the support for other
documents, too (Excel, Powerpoint, ...).
Because of this I created my index with java Apache projects: Lucene, PDFBox
(PDF parser/writer) and POI (Office document parser/writer).

I think it wouldn't be much work to parse your PDF docs Java-side...

Bill Chmura-2 wrote:
>
> Hello,
>
> I am implementing Lucene and need to index my PDF files.
>
> I have found several solutions, but they all require some non PHP
> component such as XPDF, etc... I need this to be cross platform, so
> those are generally out.
>
> I also started looking for ways to get inside Zend_PDF to get at the
> elements of each page with no success yet. I was hoping that I could
> iterate the pages in a PDF (done), get a list of the elements on that
> page (?) and then grab the text from perhaps the Zend_Pdf_Element_String
> I was able to find in there. Since I am not going to be displaying the
> context in my search, the location of the text does not matter to me so
> much.
>
> I am getting totally bogged down in the source code for the pages and
> the parsers, partially at least because I am not familiar with the
> nomenclature of PDF internals :(
>
> Does anyone have any pointers on how to approach this? Ideally I'd like
> to keep it Zend, but I can use other PDF libraries if I need to.
>
> Thanks
>
> Bill
>
>
>

--
View this message in context: http://www.nabble.com/Zend_Search_Lucene-and-PDF-files-tp25352084p25363483.html
Sent from the Zend MVC mailing list archive at Nabble.com.

--
Shaun J. Farrell
Washington, DC
(202) 713-5241
www.farrelley.com

没有评论:

发表评论

Zend FrameWork

2009年9月9日星期三

Re: [fw-mvc] Zend_Search_Lucene and PDF files

没有评论:

博客归档