Board logo

subject: PDF- Imaging Errors in OCR [print this page]


PDF- Imaging Errors in OCR
PDF- Imaging Errors in OCR

There is a tool, lexicon, which contains thousands of words that is used by many well-known Optical Character Recognition systems so as to improve accuracy. It is very vital to sort ambiguities which may be present. For instance, a word typed as tart instead of start could be checked by checking its lexicon as the system discovers that which word is valid and which is not so that the right one is used. The chances are bleak that the system opts for the wrong term. Even the scanning procedure comes with its defects. The fraction of a page under the scanner head is illuminated by a light that is fluorescent. Every detector tends to collect light that is focused on and reflected from a tiny area of the entire page. The point-spread function of the scanner is a light sensitivity and a sensor component which is the center of the spot. If the spot is large then it means that more digital output is stimulated by light originating nearer to the center area of adjacent sensors. That minute fraction of light collected in the smallest amount of time taken is transformed into an electric signal. You'd see that in bi-level scanning process, the page is converted into a variation of 0's and 1's that are better recognized as bi-level image are those signals which are threshold. Signal that is larger than the threshold is converted to a 0 which is white and a smaller signal is converted into black that is 1. At higher thresholds, not enough light is reflected to fetch a 0 and at lower thresholds there is ample light to attain a 0. This point-spread is a functioning property of scanner optics that the operator cannot change at any time.

But users do get the opportunity to adjust the threshold via the brightness control' function in scanning software. This choice made in threshold does affect the accuracy to be seen in OCR since lower range gives rise to sectioned characters and higher range creates touching characters. Such imaging errors can be avoided using all to pdf as it converts scanned images and scanned PDFs with perfection. It might as well fluctuate due to thermal noise and the sensitivity itself might be different amongst sensor components due to imperfections in the manufacturing procedure. So it means that characters that are identical but appearing on different parts of a page would result in diverse bi-level images. Paper cannot be termed to be a high-contrast medium. The level of light that is reflected from white paper is twenty times more than a solid dark paper.




welcome to loan (http://www.yloan.com/) Powered by Discuz! 5.5.0