Zend FrameWork: Re: [fw-mvc] Zend_Search

2009年9月9日星期三

Re: [fw-mvc] Zend_Search_Lucene and PDF files

What about writing a java webservice.
With Apache XML-RPC its really easy to setup a webservice.

The webservice could share PDFBox functionality to your PHP Application...

Bill Chmura-2 wrote:
>
>
> Thanks Shaun and Matthias,
>
> Shaun: I actually already found your post, and so far it is the most
> likely scenario if I cannot get a pure PHP solution working - The server
> is OpenBSD, but development is done on OSX, Linux, and Windows so it
> presents a problem with the XPDF. But if push comes to shove it's where
> I will be heading.
>
> Matthias: It needs to be able to update on the fly, and running Java up
> there may be a bit dicey... There is also a db component, so some of
> the meta data comes from my model, and well - its seeming to look
> painful as I move ahead either way - thanks for the suggestion though!
>
> I was really hoping someone with Zend_PDF knowledge would see this and
> yell, hey - just grab this array from the PDF object, its got your
> strings :)
>
> Thanks guys!
>
>
>
>
>
>
> Shaun Farrell wrote:
>> About a 1 1/2 yrs ago I wrote a 2 part post on how to index pdf's with
>> Zend.
>> (http://www.kapustabrothers.com/2008/01/20/indexing-pdf-documents-with-zend_search_lucene/)
>> the Framework has come along way since then so it's probably out of
>> date. I have been thinking about updating the topic. The current
>> implementation uses XPDF which at the time was the best to convert
>> PDF's to Text. I have been looking for some other libraries but have
>> no luck. I'm also looking so ill let you know if i find anything.
>>
>> On Wed, Sep 9, 2009 at 8:00 AM, Matthias W.
>> <Matthias.Wangler@e-projecta.com
>> <mailto:Matthias.Wangler@e-projecta.com>> wrote:
>>
>>
>> Hi,
>> some time ago I had the same problem. But I needed the support for
>> other
>> documents, too (Excel, Powerpoint, ...).
>> Because of this I created my index with java Apache projects:
>> Lucene, PDFBox
>> (PDF parser/writer) and POI (Office document parser/writer).
>>
>> I think it wouldn't be much work to parse your PDF docs Java-side...
>>
>>
>> Bill Chmura-2 wrote:
>> >
>> > Hello,
>> >
>> > I am implementing Lucene and need to index my PDF files.
>> >
>> > I have found several solutions, but they all require some non PHP
>> > component such as XPDF, etc... I need this to be cross platform,
>> so
>> > those are generally out.
>> >
>> > I also started looking for ways to get inside Zend_PDF to get at
>> the
>> > elements of each page with no success yet. I was hoping that I
>> could
>> > iterate the pages in a PDF (done), get a list of the elements on
>> that
>> > page (?) and then grab the text from perhaps the
>> Zend_Pdf_Element_String
>> > I was able to find in there. Since I am not going to be
>> displaying the
>> > context in my search, the location of the text does not matter
>> to me so
>> > much.
>> >
>> > I am getting totally bogged down in the source code for the
>> pages and
>> > the parsers, partially at least because I am not familiar with the
>> > nomenclature of PDF internals :(
>> >
>> > Does anyone have any pointers on how to approach this? Ideally
>> I'd like
>> > to keep it Zend, but I can use other PDF libraries if I need to.
>> >
>> > Thanks
>> >
>> > Bill
>> >
>> >
>> >
>>
>> --
>> View this message in context:
>>
>> http://www.nabble.com/Zend_Search_Lucene-and-PDF-files-tp25352084p25363483.html
>> Sent from the Zend MVC mailing list archive at Nabble.com.
>>
>>
>>
>>
>> --
>> Shaun J. Farrell
>> Washington, DC
>> (202) 713-5241
>> www.farrelley.com <http://www.farrelley.com>
>
>
>

--
View this message in context: http://www.nabble.com/Zend_Search_Lucene-and-PDF-files-tp25352084p25364343.html
Sent from the Zend MVC mailing list archive at Nabble.com.

没有评论:

发表评论

Zend FrameWork

2009年9月9日星期三

Re: [fw-mvc] Zend_Search_Lucene and PDF files

没有评论:

博客归档