we are also looking for a PHP only solution.
Did you put the classes yet for download?
Thanks,
Seb.
Bill Chmura-2 wrote:
>
>
> Hey, I spoke with the guy who wrote it and he is cool with putting it
> out - he wanted a day or two to include some brief docs
>
> I'll post it then
>
> Following that we are going to read keywords and titles also, which it
> don't do now and wrap it as a lucence_PDF class and give that one out also
>
>
> Bill Chmura wrote:
>>
>>
>> I don't see why we wouldn't. Let me clean it up a bit, and I will
>> post it.
>>
>> Nothing terribly complicated, but it could save some time for other
>> people.
>>
>>
>> Shaun Farrell wrote:
>>> Bill,
>>>
>>> Are you going to open source that code?
>>>
>>>
>>> On Fri, Sep 11, 2009 at 1:58 PM, Bill Chmura <Bill@explosivo.com
>>> <mailto:Bill@explosivo.com>> wrote:
>>>
>>>
>>> Just to bring closure to this... basically what we ended up doing
>>> was writing the PDF code ourselves to grab only the text out of
>>> the PDF. The spec's are available from Adobe for the PDF
>>> format, so it was not that bad in the end. At least it is still
>>> all PHP.
>>>
>>> Thanks to everyone for the suggestions on this
>>>
>>>
>>>
>>>
>>> Bill Chmura wrote:
>>>>
>>>> Thanks Shaun and Matthias,
>>>>
>>>> Shaun: I actually already found your post, and so far it is the
>>>> most likely scenario if I cannot get a pure PHP solution working
>>>> - The server is OpenBSD, but development is done on OSX, Linux,
>>>> and Windows so it presents a problem with the XPDF. But if push
>>>> comes to shove it's where I will be heading.
>>>>
>>>> Matthias: It needs to be able to update on the fly, and running
>>>> Java up there may be a bit dicey... There is also a db
>>>> component, so some of the meta data comes from my model, and
>>>> well - its seeming to look painful as I move ahead either way -
>>>> thanks for the suggestion though!
>>>>
>>>> I was really hoping someone with Zend_PDF knowledge would see
>>>> this and yell, hey - just grab this array from the PDF object,
>>>> its got your strings :)
>>>>
>>>> Thanks guys!
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Shaun Farrell wrote:
>>>>> About a 1 1/2 yrs ago I wrote a 2 part post on how to index
>>>>> pdf's with Zend.
>>>>>
>>>>> (http://www.kapustabrothers.com/2008/01/20/indexing-pdf-documents-with-zend_search_lucene/)
>>>>> the Framework has come along way since then so it's probably
>>>>> out of date. I have been thinking about updating the topic.
>>>>> The current implementation uses XPDF which at the time was the
>>>>> best to convert PDF's to Text. I have been looking for some
>>>>> other libraries but have no luck. I'm also looking so ill let
>>>>> you know if i find anything.
>>>>>
>>>>> On Wed, Sep 9, 2009 at 8:00 AM, Matthias W.
>>>>> <Matthias.Wangler@e-projecta.com
>>>>> <mailto:Matthias.Wangler@e-projecta.com>> wrote:
>>>>>
>>>>>
>>>>> Hi,
>>>>> some time ago I had the same problem. But I needed the
>>>>> support for other
>>>>> documents, too (Excel, Powerpoint, ...).
>>>>> Because of this I created my index with java Apache
>>>>> projects: Lucene, PDFBox
>>>>> (PDF parser/writer) and POI (Office document parser/writer).
>>>>>
>>>>> I think it wouldn't be much work to parse your PDF docs
>>>>> Java-side...
>>>>>
>>>>>
>>>>> Bill Chmura-2 wrote:
>>>>> >
>>>>> > Hello,
>>>>> >
>>>>> > I am implementing Lucene and need to index my PDF files.
>>>>> >
>>>>> > I have found several solutions, but they all require some
>>>>> non PHP
>>>>> > component such as XPDF, etc... I need this to be cross
>>>>> platform, so
>>>>> > those are generally out.
>>>>> >
>>>>> > I also started looking for ways to get inside Zend_PDF to
>>>>> get at the
>>>>> > elements of each page with no success yet. I was hoping
>>>>> that I could
>>>>> > iterate the pages in a PDF (done), get a list of the
>>>>> elements on that
>>>>> > page (?) and then grab the text from perhaps the
>>>>> Zend_Pdf_Element_String
>>>>> > I was able to find in there. Since I am not going to be
>>>>> displaying the
>>>>> > context in my search, the location of the text does not
>>>>> matter to me so
>>>>> > much.
>>>>> >
>>>>> > I am getting totally bogged down in the source code for
>>>>> the pages and
>>>>> > the parsers, partially at least because I am not familiar
>>>>> with the
>>>>> > nomenclature of PDF internals :(
>>>>> >
>>>>> > Does anyone have any pointers on how to approach this?
>>>>> Ideally I'd like
>>>>> > to keep it Zend, but I can use other PDF libraries if I
>>>>> need to.
>>>>> >
>>>>> > Thanks
>>>>> >
>>>>> > Bill
>>>>> >
>>>>> >
>>>>> >
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>>
>>>>> http://www.nabble.com/Zend_Search_Lucene-and-PDF-files-tp25352084p25363483.html
>>>>> Sent from the Zend MVC mailing list archive at Nabble.com.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Shaun J. Farrell
>>>>> Washington, DC
>>>>> (202) 713-5241
>>>>> www.farrelley.com <http://www.farrelley.com>
>>>>
>>>
>>>
>>>
>>>
>>> --
>>> Shaun J. Farrell
>>> Washington, DC
>>> (202) 713-5241
>>> www.farrelley.com <http://www.farrelley.com>
>>
>
>
>
--
View this message in context: http://www.nabble.com/Zend_Search_Lucene-and-PDF-files-tp25352084p25991729.html
Sent from the Zend MVC mailing list archive at Nabble.com.
没有评论:
发表评论