On Wed, Sep 9, 2009 at 8:39 AM, Bill Chmura <Bill@explosivo.com> wrote:
Thanks Shaun and Matthias,
Shaun: I actually already found your post, and so far it is the most likely scenario if I cannot get a pure PHP solution working - The server is OpenBSD, but development is done on OSX, Linux, and Windows so it presents a problem with the XPDF. But if push comes to shove it's where I will be heading.
Matthias: It needs to be able to update on the fly, and running Java up there may be a bit dicey... There is also a db component, so some of the meta data comes from my model, and well - its seeming to look painful as I move ahead either way - thanks for the suggestion though!
I was really hoping someone with Zend_PDF knowledge would see this and yell, hey - just grab this array from the PDF object, its got your strings :)
Thanks guys!
Shaun Farrell wrote:About a 1 1/2 yrs ago I wrote a 2 part post on how to index pdf's with Zend. (http://www.kapustabrothers.com/2008/01/20/indexing-pdf-documents-with-zend_search_lucene/) the Framework has come along way since then so it's probably out of date. I have been thinking about updating the topic. The current implementation uses XPDF which at the time was the best to convert PDF's to Text. I have been looking for some other libraries but have no luck. I'm also looking so ill let you know if i find anything.
On Wed, Sep 9, 2009 at 8:00 AM, Matthias W. <Matthias.Wangler@e-projecta.com> wrote:
Hi,
some time ago I had the same problem. But I needed the support for other
documents, too (Excel, Powerpoint, ...).
Because of this I created my index with java Apache projects: Lucene, PDFBox
(PDF parser/writer) and POI (Office document parser/writer).
I think it wouldn't be much work to parse your PDF docs Java-side...
--
Bill Chmura-2 wrote:
>
> Hello,
>
> I am implementing Lucene and need to index my PDF files.
>
> I have found several solutions, but they all require some non PHP
> component such as XPDF, etc... I need this to be cross platform, so
> those are generally out.
>
> I also started looking for ways to get inside Zend_PDF to get at the
> elements of each page with no success yet. I was hoping that I could
> iterate the pages in a PDF (done), get a list of the elements on that
> page (?) and then grab the text from perhaps the Zend_Pdf_Element_String
> I was able to find in there. Since I am not going to be displaying the
> context in my search, the location of the text does not matter to me so
> much.
>
> I am getting totally bogged down in the source code for the pages and
> the parsers, partially at least because I am not familiar with the
> nomenclature of PDF internals :(
>
> Does anyone have any pointers on how to approach this? Ideally I'd like
> to keep it Zend, but I can use other PDF libraries if I need to.
>
> Thanks
>
> Bill
>
>
>
View this message in context: http://www.nabble.com/Zend_Search_Lucene-and-PDF-files-tp25352084p25363483.html
Sent from the Zend MVC mailing list archive at Nabble.com.
--
Shaun J. Farrell
Washington, DC
(202) 713-5241
www.farrelley.com
--
Shaun J. Farrell
Washington, DC
(202) 713-5241
www.farrelley.com
没有评论:
发表评论