The Art of Crafting a Computer Vision Model

5 MIN Read

Last Updated on April 24, 2021 by Team Wobot

In the 1950s, Alan Turing first published- “Computing Machinery and Intelligence” and Artificial Intelligence was born. All the jargon set aside, the bottom line is, we want machines to think, learn and be smarter. Now, that we want the machines to be smarter, do we want them to think and act like humans or want them to be way more efficient in decision-making and task handling? Either way, we have to first try to replicate how humans receive sensory data and how a human brain processes it to make the most viable decisions.

However, the development of Artificial Intelligence didn’t start from taking data, rather it started from making decisions. Firstly, the frameworks were built to test the decision-making abilities of the machines, then ability to learn from the data, followed by the ability to learn in real-time like humans. While all of these are explained under the banner of Artificial Intelligence, Machine Learning, and Deep Learning, in this blog let’s look at building a Computer Vision Model.

What is Human Vision? Or what is Computer Vision? It’s the ability to capture and recognize objects, just as humans do. Computer Vision lies in the Deep Learning domain and involves automated extraction of information from captured images and visual data. This information can be used to recognize, differentiate or search for content. At this stage, it becomes essential to understand how a machine would recognize objects or elements from an image. By far, we know that a computer can process numbers and perform various calculations. Using this as a foundation, we develop algorithms to convert all the images into numbers or matrices of numbers, which are then processed to extract information.

To start with, if we consider a binary image, which essentially is visualizing an image with only two colors, either black or white. To step up, we could consider a greyscale image, which envelops a variation in grey color in a range of 256 shades from black to white. Every time we capture images and store them, the graphics are converted to binary data basis on the above-discussed variations and then stored. For example, the minimum value of grey is “0” which would give a black color or absence of luminescence, and the brightest white variant is “255”.

Hypothetically, let’s say you can break the image into small chunks and identify the intensity of color by a number in the range [0,255]. The chunks being so small that there is no variation across each of them, then each of them can be recognized by a number instead of a graphic color. This information/data can be processed by a computing machine to identify various entities and features of the image. To take it to the next level, if one considers a color image or RGB image, instead of just different shades of grey color, it’s a combination of variants of three colors that is Red-Green-Blue. If all the three colors are coded “0”, it would result in black color, and for the maximum of “255” each, would give a white color, therefore, the entire spectrum of colors can fit into this matrix.

Source: Computer Vision Primer: How Artificial Intelligence sees an Image

Now, if the machine can read an image in its native language, processing it and extracting information out of it is a doddle with accurate algorithm design and efficient coding. It is evident that if the image size increases or the size of square blocks decreases, the accuracy of recognition increases but at the same time complexity of processing and the need for computational power increases. Since commencement, a large number of techniques, and frameworks have been built and developed to enhance the entire process, from capturing, processing, sorting, to extracting information from the images.

Source: Pretrained Deep Learning Models | Computer Vision

Now that we understand how a machine can read an image and process it, we can proceed to model a Computer Vision model for which we would build, feed, train, test, and deploy. Since Machine Learning algorithms are a base over which we enable Deep Learning and Computer Vision Models, the first step is to have a framework built for the Computer Vision Model.

Now that we have a machine model, we treat it like a baby and train it to recognize and understand objects from the fed data.

The entire process of crafting a Computer Vision Model can be understood by the CATE (Create- Annotate- Train- Execute):

  • CREATE a sufficiently large database following the task at hand.
  • ANNOTATE each entity in your database.
  • TRAIN your model with extensive learning models, analogous to the way a toddler would learn based on the features and task at hand.
  • EXECUTE and evaluate by feeding in the data that was not used while training.

“Treat your machine like a baby!” If we wish a baby to learn, understand, and differentiate between an orange and an apple, we start with making an example set for the baby. Show an orange, an apple, variants of each, and so on. Similarly, for the machine, it starts with curating a huge dataset of such images which would be fed as an example set for the machine. Then, for each element in the database, we annotate or label them accordingly. For example, creating a database of million images of oranges and apples combined and then annotating each of them accordingly. This is the most crucial stage as it decides the accuracy of the model when deployed. Then, you train the model and check if the machine can identify between simple to mid-complex objects in the images. This step is iterative, in the sense that, once the model can handle a dataset with set accuracy and precision, you increase the complexity of data and iterate the training models. Now, when we reach a point of set accuracy, we execute or deploy the model into real-time situations. We test the model with images and data that weren’t used in the training modules.

In the age of AI, Robots, and Smart-Everything, we want machines to see the world with our eyes, and Computer Vision Models are its base elements as transistors for Silicon chips. The tech-nerds in the mid-20th Century devised Computer Vision Models to distinguish between typed and handwritten texts. Today, Computer Vision Models are deployed almost everywhere around us, and a few examples that stand out are Google lens to recognize OCR characters, Instagram or Snapchat filters from face recognition, QR Code Scanners for payment and geotagging, and so on. If an image is processed on a smart device using a camera to extract information or data, be sure that an apt Computer Vision Model runs in the background.

Thanks to advancements in Deep Learning, Machine Learning, and Neural Networks, machines today can train themselves without hardwiring everything. Computer Vision models have been deployed in almost all parts of our lives, from detecting cancer cells, self-driving cars to forecasting hurricanes and everything in between. While the lesser is imagined, the possibilities are endless.


Want to know how Wobot.ai uses Computer Vision for creating its models? Visit our website.