Zend FrameWork: Re: [fw-mvc] Zend_Search

2009年10月30日星期五

Re: [fw-mvc] Zend_Search_Lucene and PDF files

Not yet, got prioritized to something else. A few more days maybe...
hopefully monday

sebdev wrote:
> Hi there,
>
> we are also looking for a PHP only solution.
>
> Did you put the classes yet for download?
>
> Thanks,
> Seb.
>
>
> Bill Chmura-2 wrote:
>
>> Hey, I spoke with the guy who wrote it and he is cool with putting it
>> out - he wanted a day or two to include some brief docs
>>
>> I'll post it then
>>
>> Following that we are going to read keywords and titles also, which it
>> don't do now and wrap it as a lucence_PDF class and give that one out also
>>
>>
>> Bill Chmura wrote:
>>
>>> I don't see why we wouldn't. Let me clean it up a bit, and I will
>>> post it.
>>>
>>> Nothing terribly complicated, but it could save some time for other
>>> people.
>>>
>>>
>>> Shaun Farrell wrote:
>>>
>>>> Bill,
>>>>
>>>> Are you going to open source that code?
>>>>
>>>>
>>>> On Fri, Sep 11, 2009 at 1:58 PM, Bill Chmura <Bill@explosivo.com
>>>> <mailto:Bill@explosivo.com>> wrote:
>>>>
>>>>
>>>> Just to bring closure to this... basically what we ended up doing
>>>> was writing the PDF code ourselves to grab only the text out of
>>>> the PDF. The spec's are available from Adobe for the PDF
>>>> format, so it was not that bad in the end. At least it is still
>>>> all PHP.
>>>>
>>>> Thanks to everyone for the suggestions on this
>>>>
>>>>
>>>>
>>>>
>>>> Bill Chmura wrote:
>>>>
>>>>> Thanks Shaun and Matthias,
>>>>>
>>>>> Shaun: I actually already found your post, and so far it is the
>>>>> most likely scenario if I cannot get a pure PHP solution working
>>>>> - The server is OpenBSD, but development is done on OSX, Linux,
>>>>> and Windows so it presents a problem with the XPDF. But if push
>>>>> comes to shove it's where I will be heading.
>>>>>
>>>>> Matthias: It needs to be able to update on the fly, and running
>>>>> Java up there may be a bit dicey... There is also a db
>>>>> component, so some of the meta data comes from my model, and
>>>>> well - its seeming to look painful as I move ahead either way -
>>>>> thanks for the suggestion though!
>>>>>
>>>>> I was really hoping someone with Zend_PDF knowledge would see
>>>>> this and yell, hey - just grab this array from the PDF object,
>>>>> its got your strings :)
>>>>>
>>>>> Thanks guys!
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Shaun Farrell wrote:
>>>>>
>>>>>> About a 1 1/2 yrs ago I wrote a 2 part post on how to index
>>>>>> pdf's with Zend.
>>>>>>
>>>>>> (http://www.kapustabrothers.com/2008/01/20/indexing-pdf-documents-with-zend_search_lucene/)
>>>>>> the Framework has come along way since then so it's probably
>>>>>> out of date. I have been thinking about updating the topic.
>>>>>> The current implementation uses XPDF which at the time was the
>>>>>> best to convert PDF's to Text. I have been looking for some
>>>>>> other libraries but have no luck. I'm also looking so ill let
>>>>>> you know if i find anything.
>>>>>>
>>>>>> On Wed, Sep 9, 2009 at 8:00 AM, Matthias W.
>>>>>> <Matthias.Wangler@e-projecta.com
>>>>>> <mailto:Matthias.Wangler@e-projecta.com>> wrote:
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>> some time ago I had the same problem. But I needed the
>>>>>> support for other
>>>>>> documents, too (Excel, Powerpoint, ...).
>>>>>> Because of this I created my index with java Apache
>>>>>> projects: Lucene, PDFBox
>>>>>> (PDF parser/writer) and POI (Office document parser/writer).
>>>>>>
>>>>>> I think it wouldn't be much work to parse your PDF docs
>>>>>> Java-side...
>>>>>>
>>>>>>
>>>>>> Bill Chmura-2 wrote:
>>>>>> >
>>>>>> > Hello,
>>>>>> >
>>>>>> > I am implementing Lucene and need to index my PDF files.
>>>>>> >
>>>>>> > I have found several solutions, but they all require some
>>>>>> non PHP
>>>>>> > component such as XPDF, etc... I need this to be cross
>>>>>> platform, so
>>>>>> > those are generally out.
>>>>>> >
>>>>>> > I also started looking for ways to get inside Zend_PDF to
>>>>>> get at the
>>>>>> > elements of each page with no success yet. I was hoping
>>>>>> that I could
>>>>>> > iterate the pages in a PDF (done), get a list of the
>>>>>> elements on that
>>>>>> > page (?) and then grab the text from perhaps the
>>>>>> Zend_Pdf_Element_String
>>>>>> > I was able to find in there. Since I am not going to be
>>>>>> displaying the
>>>>>> > context in my search, the location of the text does not
>>>>>> matter to me so
>>>>>> > much.
>>>>>> >
>>>>>> > I am getting totally bogged down in the source code for
>>>>>> the pages and
>>>>>> > the parsers, partially at least because I am not familiar
>>>>>> with the
>>>>>> > nomenclature of PDF internals :(
>>>>>> >
>>>>>> > Does anyone have any pointers on how to approach this?
>>>>>> Ideally I'd like
>>>>>> > to keep it Zend, but I can use other PDF libraries if I
>>>>>> need to.
>>>>>> >
>>>>>> > Thanks
>>>>>> >
>>>>>> > Bill
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>>
>>>>>> http://www.nabble.com/Zend_Search_Lucene-and-PDF-files-tp25352084p25363483.html
>>>>>> Sent from the Zend MVC mailing list archive at Nabble.com.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Shaun J. Farrell
>>>>>> Washington, DC
>>>>>> (202) 713-5241
>>>>>> www.farrelley.com <http://www.farrelley.com>
>>>>>>
>>>>
>>>>
>>>> --
>>>> Shaun J. Farrell
>>>> Washington, DC
>>>> (202) 713-5241
>>>> www.farrelley.com <http://www.farrelley.com>
>>>>
>>
>>
>
>

没有评论:

发表评论

Zend FrameWork

2009年10月30日星期五

Re: [fw-mvc] Zend_Search_Lucene and PDF files

没有评论:

博客归档