2009年12月19日星期六

Re: [fw-mvc] Zend_Search_Lucene and PDF files



Guys,

Better late than never...  The PHP PDF solution for indexing (or at least a tool to do it) is up here

https://launchpad.net/pdf2text

Mind you it has not gotten a lot of mileage yet, so if you see something busted, let me or the maintainer (tom) know.

It's basically for getting the data out, not much in the way of formatting, etc...








On 9/11/09 2:01 PM, Shaun Farrell wrote:
Bill,

Are you going to open source that code?


On Fri, Sep 11, 2009 at 1:58 PM, Bill Chmura <Bill@explosivo.com> wrote:

Just to bring closure to this... basically what we ended up doing was writing the PDF code ourselves to grab only the text out of the PDF.   The spec's are available from Adobe for the PDF format, so it was not that bad in the end.  At least it is still all PHP.

Thanks to everyone for the suggestions on this




Bill Chmura wrote:

Thanks Shaun and Matthias,

Shaun: I actually already found your post, and so far it is the most likely scenario if I cannot get a pure PHP solution working - The server is OpenBSD, but development is done on OSX, Linux, and Windows so it presents a problem with the XPDF.  But if push comes to shove it's where I will be heading.

Matthias:  It needs to be able to update on the fly, and running Java up there may be a bit dicey...  There is also a db component, so some of the meta data comes from my model, and well - its seeming to look painful as I move ahead either way - thanks for the suggestion though!

I was really hoping someone with Zend_PDF knowledge would see this and yell, hey - just grab this array from the PDF object, its got your strings :)

Thanks guys!






Shaun Farrell wrote:
About a 1 1/2 yrs ago I wrote a 2 part post on how to index pdf's with Zend. (http://www.kapustabrothers.com/2008/01/20/indexing-pdf-documents-with-zend_search_lucene/)  the Framework has come along way since then so it's probably out of date. I have been thinking about updating the topic.  The current implementation uses XPDF which at the time was the best to convert PDF's to Text.  I have been looking for some other libraries but have no luck.  I'm also looking so ill let you know if i find anything.

On Wed, Sep 9, 2009 at 8:00 AM, Matthias W. <Matthias.Wangler@e-projecta.com> wrote:

Hi,
some time ago I had the same problem. But I needed the support for other
documents, too (Excel, Powerpoint, ...).
Because of this I created my index with java Apache projects: Lucene, PDFBox
(PDF parser/writer) and POI (Office document parser/writer).

I think it wouldn't be much work to parse your PDF docs Java-side...


Bill Chmura-2 wrote:
>
> Hello,
>
> I am implementing Lucene and need to index my PDF files.
>
> I have found several solutions, but they all require some non PHP
> component such as XPDF, etc...  I need this to be cross platform, so
> those are generally out.
>
> I also started looking for ways to get inside Zend_PDF to get at the
> elements of each page with no success yet.  I was hoping that I could
> iterate the pages in a PDF (done), get a list of the elements on that
> page (?) and then grab the text from perhaps the Zend_Pdf_Element_String
> I was able to find in there.  Since I am not going to be displaying the
> context in my search, the location of the text does not matter to me so
> much.
>
> I am getting totally bogged down in the source code for the pages and
> the parsers, partially at least because I am not familiar with the
> nomenclature of PDF internals  :(
>
> Does anyone have any pointers on how to approach this?  Ideally I'd like
> to keep it Zend, but I can use other PDF libraries if I need to.
>
> Thanks
>
> Bill
>
>
>

--
View this message in context: http://www.nabble.com/Zend_Search_Lucene-and-PDF-files-tp25352084p25363483.html
Sent from the Zend MVC mailing list archive at Nabble.com.




--
Shaun J. Farrell
Washington, DC
(202) 713-5241
www.farrelley.com





--
Shaun J. Farrell
Washington, DC
(202) 713-5241
www.farrelley.com

没有评论: