java - Does iText support OCR? -


i ask question itext. facing problem searching text in pdf file.

i can create plain text file using gettextfrompage() method described in following code sample:

/** original pdf parsed. */     public static final string preface = "d:/b.pdf";     /** resulting text file. */     public static final string result = "d:/result.txt"; public void parsepdf(string from, string destination) throws ioexception{          pdfreader reader = new pdfreader(preface);           printwriter out = new printwriter(new fileoutputstream(result));                       (int = 1; <= reader.getnumberofpages(); i++) {                                    out.println(pdftextextractor.gettextfrompage(reader, i));           }         out.flush();         out.close();         reader.close();      } 

i'm trying find specific string in resulting text this:

    public void findwords(string from) {         try{             string lignelue;                         linenumberreader lnr=new linenumberreader(new filereader(result));             try{                                 while((lignelue=lnr.readline())!=null){                     searchforsvhc(lignelue,svhclist);                 }             }             finally{                                 lnr.close();             }         }         catch(ioexception e){             system.out.println(e);}         }        public void searchforsvhc(string ligne,list<string> list){         for(string cas :list){             if(ligne.contains(cas)){                 system.out.print("yes  "+cas);                 break;         }}     } 

my problem pdfs i'm parsing consist of scanned images, means there no real text, pixels.

does itext support optical character recognition (ocr) , follow-up question: there way determine if pdf consists of scanned images?

i've done thorough edit of question before answering it.

when pdf consists of scanned images, there no real text parse, there images pixels look text. you'd need ocr know written on such scanned page, , itext doesn't support ocr.

regarding follow-up question: it's hard find out if pdf contains scanned images. first give-away be: there's image in page, , there's no text.

however: don't know nature of images (maybe have pdf containing nothing holiday photos), it's hard find out if pdf document full of scanned pages of text (that is: rasterized text).


Comments

Popular posts from this blog

Change php variable from jquery value using ajax (same page) -

Pull out data related to my apps from Android Play Store and iOS App Store -

How can I fetch data from a web server in an android application? -