The Role of Memory Storage in Computing
In recent years, computing was more focused on increasing memory storage in Random Access Memory Modules (RAM). An accomplishment that sees milestones being met every quarterly. Memory storage is very important as it determines how fast processed data is copied to and from the CPU registers. The efficiency of a computing device depends on various operating components. For example the bus speed will determine the speed at which data is transferred between the ram to the processor and also determines the speed at which data from a storage device cycles to and from the RAM. The higher the bus speed the less the lag in the time taken to process data and retrieve information. Most of the application that were developed in the nineteen eighties relied heavily on the computing power of the processor. It was highly sought after to have a CPU that had the highest clocking speed. For example a 3.4 gigahertz processor is able to perform computations at the given speed. The new era has changed our view of the perfect CPU and beside 3.4 Ghz being the highest attainable speed, manufacturing has laid off on peaking the processing power and moved to increasing the number of processing cores. This will speed up the processing in applications that utilize multiprocessing.
Machine Learning relies heavily on the computing power and large RAM capacities that are used to train the neuron in gaining experience in understanding and predicting data patterns on test data. Processing power in the CPU’s are still proving to be inefficient. This is because during the training process, much of the training is done on images. This means that there is need to provide a hardware device that is dedicated to image processing.
Since time immemorial most of the graphics processing units were designed specifically to render user interfaces. The advancement in the growth of GPU’s was brought about by 3D gaming and software’s that deal with Computer Aided Design (CAD). This rupture in usage of 3D in mostly gaming led manufacturers to enhance the operation of their GPU. They did this increasing its processing power as well as its memory.
What has made GPU’s to be crucial in machine learning is that they perform specialized tasks and therefore, they are able to dedicate all the computing power and memory to only a single task – image processing. Training time is lessened since the strain on CPU is alleviated. This leaves the CPU to only perform computation on the matrix objects that were extracted from the image by the GPU.
Machine Learning and deep learning has two of the most pronounced words: object detection and object recognition. The former relies on the latter. It can be likened to the slogan of ‘see so that you can understand’.
As a result, we can use this example to distinguish between these three computer vision tasks:
Classification of images is accomplished by Make a guess as to what kind of object you’re looking at.
Photographic images that contain single images of objects are acceptable as inputs for our model.
Why Processing Power in CPU’s is Inefficient for Machine Learning
Labels that are used to identify the classes will be assigned accorded to the classes that are to be classified.
There are softwares such as labelImage that facilitate the annotation of images. Annotation involves the manual drawing of bounding boxes that are later used as references. These references represent the x_min, y_min, x_max and y_max co-ordinates that provides the actual location of the image we want to incorporate into our training model.
An example of an input is a picture that includes one or more items. Bounding boxes can be output in various ways for example the model will be able to capture the four co-ordinates from the annotated image and return the x and y min and max which are then connected by straight lines to form a square or rectangle around the detected object. As an example, a bounding box and the categories or classes of items discovered in a picture may be used to identify the existence of objects.You can use one or more photographs as input.A class label and one or more bounding boxes are required for each bounding box (e.g. a point, width, and height).In addition to this split of computer vision tasks, the term “object segmentation,” also known as “object instance segmentation” or “semantic segmentation,” is used to describe the action of accentuating particular pixels of an object rather than a broad bounding box. Based on this split, we may conclude that object recognition is a collection of difficult computer vision problems.However, object detection and object localisation might be difficult to differentiate, especially when all three jobs are referred to as object recognition.
Humans are capable of detecting and identifying objects in images. The human visual system is a marvel since it is quick, precise, and capable of difficult tasks including distinguishing numerous objects and detecting impediments. Thanks to massive data sets, faster GPUs, and improved algorithms, computers can now be taught to accurately detect and categorize several items in a single image. At this point, we’ll look at “You only look once,” a technique for detecting objects, as well as other related terminology like “object detection,” “object localisation,” and the “loss function” (YOLO).
A class label must be supplied to an image in order to categorize it, whereas a bounding box must be created around an item in order to find it. When these two activities are combined, the procedure of drawing a box around each object of interest and labeling it becomes more difficult. All of these concerns are addressed by object recognition.There are several associated actions that must be completed in order to identify certain objects in digital photography. R-CNNs, or region-based Convolutional Neural Networks, can be used to solve object recognition and localization problems. YOLO, or You Only Look Once, is a group of object tracking methods that are efficient in detecting objects. They are characterized as being efficient and accurate in correct classification and are mainly used in real-time systems.
The following are what this project was to fulfil:
- To develop a system is easy to navigate and is user friendly.
- To develop a system makes use of machine learning and deep learning in training a model that can be ported to work on mobile devices. The model must be able to classify fruits and vegetables and is able to assign each identified food item with the proper calorie label.
- To successfully classify detected objects and properly classify the items and finally append the caloric content labeling that is placed within the bounding box of the live video capture. Furthermore, the system must be able to perform multiple detection and classification.
How GPU’s Provide Dedicated Computing Power for Machine Learning
Artificial intelligence is an emerging technology that gaining ground very fast. The gap between hearing the name of an item and picturing how it looks like in your mind is closing very rapidly. At the current pace machine learning has surpassed the human mind by being able to discern complex images such as QR codes and solve mazes within a fraction of a second. The advancement of computing speed and faster graphical processing units are what has made this possible.
It is disheartening that with the current advancement in technology, there is still a lag in the development of applications that utilise the potential of artificial intelligence, machine learning and deep learning.
It is due to this that this project aims to create awareness and help system users to identify more with artificial intelligence in their everyday lives. A sample application will be one that is able to identify fruits and vegetables and is also able to provide the system user with information pertaining to the items caloric content value.
This will benefit the system users since they will be able to make more informed decisions concerning how they plan their meals. As a benefit, people will be able to track their daily caloric intake that they record in their meal registers.
The robustness of such a system lies at the heart of how simple it is to operate. With just a single click to launch the application you are ready to capture the caloric content of any fruit item that is within the trained dataset of the application.
This project aims to provide a fast lookup and computation of the calories contained in fruits and vegetables. This is a robust way in which people can easily check the colorie content of a fruit by merely detecting it by scanning an image or by viewing a live streamed image. This will enable individuals to take note of the calorie content of food items and makes it easier to record the content in their food diaries. Furthermore, this easy method will add to the awareness of the existence of Artificial intelligence and will herald more liking and support to the field. As people will be able to track down their caloric intake, individuals will be in a better position to manage their weights as they see fit.
The general objective of this project was to enlighten people on the existence of Artificial intelligence.
The main objectives of this project are the following:
- Train a model to detect and recognize fruits and vegetables from images and videos.
- Develop an Application Programming Interface (API) that enables the reverse lookup of the caloric content of fruits and vegetables and then display them alongside the identified food items.
Detecting the target object and tracking it successfully while dealing with occlusions and other complications is required in a variety of applications. Many researchers (Zou, et al, 2019) experimented with various object tracking approaches. The techniques’ nature is mostly determined by the application domain. The following are some of the research projects that evolved into suggested work in the subject of object tracking.
A challenging, but crucial, talent is the ability to locate things in the visual area. Different than image search, picture annotation, scene interpretation and object tracking. Additionally, it is used in a wide number of other applications.’ Moving object tracking in video sequences has been a prominent topic in computer vision. Medical and biological applications had previously been applied, as well as smart video monitoring and artificial intelligence in the military (Papageorgiou, 2000, p.17).
The Advancement and Importance of Object Detection and Recognition in Machine Learning
Some single-object tracking systems have been developed recently, but they have their limitations when dealing with multiple objects. Objects that are partially or completely obscured from view also make it more difficult for humans to recognize them, further complicating the problem. Slightly lowering the illumination and shifting the camera’s acquisition angle. MLP-based object tracking system models can be classified as being robust and effiecient, this is owed to the optimal selection of unique features that it is able to carry out and also to how it implements the Adaboost (Szegedy, 2013, p. 26) strong classification approach.
Local lighting fluctuations, such as shadows and highlights, and globe illumination shifts may all be accommodated using the background subtraction method (Horprasert et al., 1999). This method makes use of a backdrop model which is statistically modeled on unit pixels. When working in RGB color mode, there is a distortion in the brightness as well as in the chromatic property of the the color. These phenomenas are both used to distinguish a shaded background from a conventional one. This occurrence further enables for detection of a change in the background color when a foreground moving item casts a shadow on the background canvas. This kind of detection which relies on the ‘shadow effect’ is what is termed as background subtraction. Background subtraction was accomplished using the methods outlined below. Ei (the intended color value), si (the standard deviation of color value) and the brightness distortion variation of the ith pixel (bi) were each represented by a four-tuple (e.g., Ei = ai + si + bi). Background images were compared to current images in the following step. To each every pixel, we applied the basic backdrop, any shaded backgrounds or shadows, any highlighted backgrounds, and any moving foreground elements. Using a technique developed by Li et al., it is possible to recognize foreground items in non-stationary complex settings with moving background objects (2003). Background and foreground alterations in a scene were analyzed using inter-frame color co-occurrence statistics. In addition, a technique for storing and retrieving statistics on color cooccurrence was developed. Using this method, the foreground elements were identified in two phases.
We were able to distinguish the often changing backgrounds using the Bayes decision rule after learning about color co-occurrence data. Short term and long term procedures were used to learn the often changing backdrop. For example there was the applications that including the identification of “abandoned/stolen objects and parked automobiles”, Bayona et al (2010) found that an approach centered on gathering immobile foreground areas worked well. In this algorithm, there were just two stages involved. A background-removal-based sub-sampling algorithm was initially created in order to generate immobile foreground zones. This identifies changes in the foreground in the same pixel locations at various periods. This was done with the use of a Gaussian distribution function (Goodman, el.al, 1963). The basic method was also tweaked, with the previously computed subtraction remaining in the thresh variable, for example. In order to decrease the amount of immobile foreground objects found, this approach was used.
Understanding the Different Computer Vision Tasks Involved in Machine Learning
This is a technique for identifying pixelated parts of a picture that are similar to a reference image. Finding the perfect match is as simple as sliding from the image’s upper left to lower right side with your template. It is recommended that the dimensions shown in the reference image be used as a guide when designing the template. It identifies the group of people who are most likely to be interested in what you have to say. There is a subset of image I in S that has a pattern that is similar to that of image T. If this is the case, then locate image I in S. (2013). A method that uses upper and lower limits to find the best “k” matches was devised by Schweitzer et al (2011).
Using Euclidean distance and Walsh transform kernels, the match measure is determined. Priority setting was one of the team’s stronger suits. When excellent matches were found, intrinsic costs took precedence and performance improved. Decisions about which bounds to use improved. Because there weren’t enough good matches, it was more expensive to queuing and to perform arithmetic operations. Template matching was proposed as an alternative to using a queue in order to save money. As seen by Ken Ito and Shigeyuki Sakane, the two main types of visual tracking systems are “feature-based” and “regional.” An object’s 3D posture is estimated using the image characteristics around its borders when using the feature-based method of analysis. Using this method, the data will be processed slowly. Region-based techniques include the parametric approach and the view-based method. Using a model of the target image, the parametric technique calculates the best fit to pixel data in a specific region. Using a reference template as a starting point, we applied view-based methods to find the best match for our search area. The advantage of this approach is that it is less computationally intensive than the approach based on features.
Using the DSSD strategy for network model training, we are able to make considerable advancements (When compared to the VGG network, the performance of this technique is significantly superior). The goal is to improve accuracy, and instead of employing VGG (Simonyan et al., 2014), the original SSD utilized ResNet (Simonyan et al., 2014). (Hager and colleagues, 2004). Targ and colleagues (Targ and colleagues (Targ et al., 2016). Convolution feature layers will be added at the end of the underlying network when the underlying network has been completed. The ability to forecast detection results on a variety of scales will be enabled by gradually shrinking the size of these feature layers.
. (Girshick et al.,2015) devised a technique that uses a selective search to extract just 2000 regions from a picture, which they refer to as “region recommendations” in order to overcome the problem of picking a huge number of areas. Searching by Interest:
- Initiate the initial sub-segmentation by identifying a large number of potential regions.
- Use the greedy strategy to recursively merge similar parts into bigger ones.
The final candidate region ideas should be based on the newly constituted regions.
- There exists issues with training the network because each image has 2000 area suggestions that must be classified.
- It is not possible to implement it on live video captures since it suffers a lag of an estimated 47 seconds as it processes the generation of a test image (Girshick et al.,2015).
- Using a preset method, you may narrow down your search results. Because of this, there is no growth at that point. As a result, inefficient candidate regions may be generated.
- Performs, a faster object identification system developed by the same author as the prior publication of R-CNN, and as a result it has contributed to solving some of the shortcomings of the latter. The latter algorithm uses a similar procedure implementation where the input picture is fed to the CNN instead of area suggestions, resulting in a convolutional feature map. There are seeral ways in which we might discover and transform proposals into squares using the convolutional feature map. We can restructure them into a predetermined size for input into a fully linked layer using a region of interest pooling layer. This feature vector’s bounding box offset values and the indicated region’s class may both be predicted using a softmax layer (Hou, et. al, 2017).
- R-CNN is quicker than “Fast R-CNN” since the convolutional neural network does not need to be fed with 2000 region suggestions every time. Convolution within the model is only performed in a single step on the picture, and then feature map is constructed from it, rather than performing it several times.
The network’s performance suffers as a result of the search, which is CPU-time intensive and is a generally sluggish procedure.
Object Segmentation and its Importance in Machine Learning
Fast R-CNN uses a convolutional network to process an image and produce a convolutional feature map. To discover area suggestions, a different network is used instead of the feature map’s selective search approach. After classifying the image and predicting bounding box offset values, the suggested region is reshaped by using a ROI pooling layer and a bounding box resizing layer (Ren, et. al, 2015).
As shown above, Faster R-CNN is significantly faster. Therefore, it is the best candidate for being utilised in real time object detection in applications.
The position of an item within an image has always been determined by using regions in earlier object detection algorithms. The network doesn’t take into account the whole scope of things. Instead, look for areas in the image that have a high likelihood of containing the target. YOLO is a method of object recognition that differs significantly from the region-based methods previously mentioned. YOLO forecasts the bounding boxes and class probabilities for these boxes using a single neural network (Huang, et. al, 2017).
Each bounding box’s class probability and offset value are generated by the network. In order to locate the item in the picture, the bounding boxes with a class probability larger than a preset threshold value are chosen.
In comparison to other object identification methods, YOLO is quicker and performs object detections at a frame rate topping 45 FPS (Huang, et. al, 2017). There are certain limitations to the YOLO algorithm, such as its inability to recognize tiny objects in the image, such as a flock of birds. Due to the algorithm’s geographical limitations, this is the case.
Convolution filters are used to recognize objects after feature maps have been extracted in the SSD object identification process.
The Conv4 3 layer is utilized in the detection of objects. The Conv4 3 is made spatially 8 8 to serve as an example (it should be 38 38). Each cell in the picture is predicted to contain four different objects (also known as location).
There are 21 classes and one additional class for unknown objects to be grouped within it during predictions, and we use the highest-scoring class for each prediction as the bounded item’s class. The total number of predictions generated by Conv4 3 is 38 38 4: four predictions per cell, regardless of featuremap level (liu, et. al, 2016, p.25). According to common sense, a substantial proportion of forecasts have no purpose. This class “0” is set aside by SSD to denote the delegated regions for feature mapping.
Delegated region proposals are not used by SSD on the network. Rather than that, it comes down to a simple procedure. For the class scores as well as the location co-ordinates, it employs convolutional filters with very small dimensions. Each cell is predicted using three convolution filters after feature maps are extracted. There are no significant differences between these filters and typical CNN filters. One border box and twenty-one scores per class are generated by each filter.
For starters, let’s talk about what a single layer of the SSD can recognize. There are numerous layers that are utilised during the detection of objects within this systems’ implementation. The resolution of CNN’s feature maps reduces as the network’s spatial dimension decreases. Low-resolution SSD layers are used to identify objects that scale a large area.
Emerging Technologies in Object Recognition for Machine Learning
In addition to VGG16, SSD adds six auxiliary convolution layers. Object detection adds five more layers. Rather than four forecasts in the first two levels, we come up with six in the third layers. A total of 8732 predictions are generated by SSD, which employs six convolutional layers to do so.
It can be deduced that a model that makes use of multiple features is able to attain a comparably high accuracy score to those that make use of fewer feature maps. As a result of this accuracy score, better object detection can be achieved, and to a much higher degree of accuracy and better results at recognition.
From table 1 above it is show that with an increase in layer clustering of varied dimensions, the system is able to make predictions that have a very high accuracy score. It is also true to deduce that the system is able to make more accurate object detection since more boundary boxes are being captured.
Because of its complexity and heated discussion, target identification has remained one of the most difficult and contentious challenges in computer vision since its debut. In target detection, the purpose is to assess whether or not a picture contains any instances of a certain item type. When an object is detected in a given picture, target detection provides the geographical locations and spatial extents of the item’s occurrences, as well as the number of occurrences of the item (based on the use a bounding box, for example). In order to go further in picture understanding and computer vision, target and object identification must be learned as a fundamental building block before moving on to more complex visual tasks such as monitoring objects in images and segmenting instances. Target detection is used in a wide range of artificial intelligence and information technology applications, including machine vision, self-driving automobiles, and human–computer interfaces, among others. Automatic feature learning from data using deep learning has resulted in a significant improvement in target recognition in recent years. Deep learning is built on top of neural networks, which serve as the building blocks. More effective neural networks must be built in order to enhance target identification algorithms and performance, which has been critical in recent years. Recent developments in object detection technology have resulted in the development of two types of convolutional neural networks (CNN)-based detectors:
According to Liu et al., the SSD was developed in order to retain real-time speed while maintaining precision in the various object detectors discussed above. Despite the fact that the SSD is quicker than YOLO, it has accuracy that is comparable to that of the most recent region-based target detectors, as seen above.SSD combines YOLO regression with the Faster R–CNN anchor box mechanism to predict the object region using feature mappings from various convolution layers. As a result, SSD predicts the object region using discretized multiscale and proportional default box coordinates as a result of the prediction. When presented with a sequence of candidate frames, the convolution kernel predicts the compensation of frame coordinates for each frame as well as the confidence level for each category. To establish the findings for each point in the entire picture, local feature maps of the multiscale area are utilized in conjunction with the multiscale area. This preserves the speed of the YOLO approach while also creating a frame placement impact that is comparable to that of the Faster R–CNN method. SSD does not include any important contextual links; instead, it constructs a feature pyramid by creating it straight from the backbone VGG16 and four extra layers formed via convolution with stride 2. SSD does not include any significant contextual relationships.
These issues were addressed by the development of MANet, a single-stage detection architecture that accumulates feature information at diverse sizes. MANet With an 82.7 percent mean absolute performance on the PASCAL VOC 2007 exam, MANet earned a high level of performance (Xie et al., 2019).
Fast R-CNN is more accurate and quicker than normal R-CNN, which we covered in depth in the prior part. Fast R-CNN is more accurate and faster than standard R-CNN, which we discussed in detail in the preceding section.
As a contrast to the sophisticated algorithms discussed above, how does Fast R-CNN do in relation to them?
The performance of Fast R-CNN, Faster R-CNN, and SSD on the PASCAL VOC 2007 test set was evaluated by Liu et al. in a 1999 paper published in the journal Artificial Intelligence (see section 4.5 for discussion of the standard benchmarks).
A mean average accuracy (mAP) of 66.9 percent was attained by the Fast R-CNN when trained on data from the PASCAL VOC 2007 training dataset. This result is much higher than the industry norm of 64.9 percent (see section 4.6 regarding explanation of evaluation techniques). The quicker R-CNN outperformed the slower one by a little margin, with a mean absolute performance (mAP) of 69.9 percent. In tests with input sizes of 300 x 300 pixels, solid-state disks (SSDs) achieved average performance of 68.0 and 71.6 percentiles on the performance scale. The SSD appears to outperform the traditional Fast R-CNN and Faster R-CNN implementations, which utilize 600 as the length of the shorter dimension of the input picture, in terms of performance when dealing with images of comparable size. SSD, on the other hand, will require a large amount of data augmentation in order to reach this aim. We do not yet know whether or whether the Fast R-CNN and the faster R-CNN would benefit from further augmentation. In addition, neither Fast R-CNN nor Faster R-CNN employ any augmentation other than horizontal flipping, and it is unclear whether or not they would benefit from additional augmentation. Fast R-CNN and Faster R-CNN both rely exclusively on horizontal flipping for their performance, and it is unknown if they would benefit from extra augmentation if they were used in conjunction with one another.
The key advantage of the enhanced techniques, despite the fact that they are more exact than Fast R-CNN, is the increased speed with which they can be applied. As illustrated in the video, the SSD512 44 can function at 19 frames per second on a Titan X GPU when the great majority of low-probability detections are removed by the use of thresholding and non-maximum suppression, respectively. Meanwhile, an R-CNN architecture based on the VGG-16 architecture achieves a frame rate of 7 frames per second, while a faster R-CNN architecture based on the VGG-16 architecture achieves a frame rate of 7 frames per second with an R-CNN design based on the VGG-16 architecture achieves a frame rate of 7 frames per second. Faster R-CNN, according to its developers, operates at a frame rate of 5 frames per second, or 0.2 seconds per picture, during its operation, according to their claims. While Fast R-CNN is comparable in evaluation speed to R-CNN, it requires more time to calculate region recommendations than R-CNN. When compared to R-CNN, Fast R-CNN is comparable in evaluation speed to R-CNN. Selection Search takes two seconds each image on a CPU and Edge Boxes takes only one-tenth of a second per image on a GPU, depending on the technique used.
Be critical of the literature you read. Consider having a section called – ‘Gaps in the literature’ where you talk about things you could not find or not explained well in the literature.
Artificial intelligence has undergone three evolutions. The basic is the intelligence of copying binary actions such as pick and place to the more evolved machine learning that enables for picking identifying picked items and placing them in specific containers. What is puzzling about deep learning is that how is the accuracy of detecting and identifying objects keep improving when its provided with more training data
Machine learning includes a variety of subcategories, including deep learning. Predictability and classification of data can be improved by training a computer to filter inputs through layers. Visual, written, or aural observations can be made. The way the human brain filters information is the inspiration for deep learning. Its goal is to emulate the human brain’s functioning in order to produce true magic. An estimated 100 billion neurons are found in the human brain. Around 100,000 additional neurons link to each individual neuron. Machines can comprehend what we’re doing because we’re doing it in ways that they can understand. Dendrites, axons, and the neuron’s body are all components of the human brain’s neuron. Sending a signal down the axon of one neuron reaches the dendrites of the following cell. The signal flows from one synapse to another via this link. Nerve cells don’t do much on their own. If you have enough of them, though, they may work their magic in concert. In a deep learning algorithm, that is the underlying principle Observation is used to gather information, which is then consolidated into one layer. It creates an output that is subsequently fed into the next layer, and so on. To get your ultimate output signal, you’ll have to go through this process again and again. One or more signals (input values) are received by the neuron (node). This neuron is in charge of generating the signal that is sent to the outside world.
You may think of the input layer as being made up of all of your senses. These are independent variables that don’t have any bearing on a single finding. These numbers and bits of binary data can be used by a computer. The range consistency of these variables will be maintained by standardizing or normalizing them. Multiple nonlinear processing layers are employed to extract and transform data. A layer’s output serves as its input for the next layer. As a result of what they’ve learned, a hierarchy of concepts has been established. A more abstract and composite representation of incoming data may be created at each rung of this structure. A matrix of pixels can be used as the input for a picture, for example. The edges of the pixels can be encoded and reassembled using the first layer. The margins of the next layer may be arranged in some fashion. An picture of a face with a nose and eyes might be added to the next layer. If the image features a face, for example, the next layer may be able to tell.
Since we established before, the goal of this project was to create a system that could reliably monitor three-dimensional objects. Specifically, the goal was to develop a software solution that would allow for real-time tracking of this object, laying the groundwork for a completely automated control system. An extensive design effort was needed in order to capture all of these elements in real time and to accurately represent the flow of data through a system. The system was upgraded to a multi-threaded environment and a single-process software. As a result, the system’s graphical user interface ran on its own thread, separate from the rest of the system.
Agile software development is a methodology that concentrates more on the development part more than on the documentation part. Agile methodologies place a premium on real-time collaboration, preferably in person, over written documents.. This helps the developers to get more lively specifications over other methodologies like waterfall.(Abrahamsson, et. al, 2019). Agile specifications are real-time i.e. face-to-face communication which gives first hand more accurate specifics. Changes are prone for such public systems since it gains new users from different demographics now and then.
What takes places at each segment of the process is as follows:
Requirements – Define the iteration’s requirements based on the product backlog, sprint backlog, and feedback from customers and stakeholders.
I will acquire the requirements I need for the development of the multi-auth system. That is, I will need a laptop to do coding and further research, internet to help me in researching and learning and Adobe illustrator and Adobe XD for designing the User Interfaces.
Naturally, revision control was a critical component of the project’s development. Backups were created throughout development using the Git revision control system (See A.3). Git was chosen partly for its familiarity, as the author of this project has used it previously, and for its simplicity. Git is an extremely powerful tool, and a complete project repository can be built in a matter of minutes with just a few instructions. Code backups were pushed to the remote repository through remote backup. This provided a secure environment in which to write programs. Additionally, the revision control system’s branching capability enabled the creation of a new branch for developing a specific component of the system without affecting the existing master stable build. When this branch is suitable, it can be merged into the master build and the procedure repeated. This technique is far more streamlined and error-free than the traditional backup process, which involves building archives of files and numbering them for logging purposes.
Additionally, test driven development was used during the project’s development. Test code was built to assist in validating application code using the Google test framework gtest (Oxberry et al., 2016). While this principle was not rigorously followed, it was adhered to to the greatest extent practicable, with certain exceptions made due to the difficulties of building tests for message passing code. Additionally, because this was a research study rather than a software development project, the emphasis was on testing and assessing modern computer vision technology rather than on developing industry standard application code.
Delivery – Integrate and deliver the working iteration into production. After testing the functionality, I will allow anybody to use to perform authentication in order to access a resource.
Integration and delivery – Integrate and deliver the working iteration. After testing the feature, I will make it available for anybody to utilize in order to gain access to a resource.
Feedback -Accept customer and stakeholder input and incorporate it into the next iteration’s needs. I’ll need to review feedback from consumers or anyone who has utilized the application system.
Sample images that will make up the training data set was to be acquired from online sources. Multiple images of fruits, vegetables and common foods to be used in creating the training data set were scraped from the Google search Engine using Firefox and Chrome extensions that enable for scraping of images from an images search. Collected images were then to be resized to a uniform standard dimension of 300 pixels by 300 pixels.
The processing stage will make use of feature extraction during the recognition task. After training the images the information of trained images is stored in a tflite (Tensorflow lite ) cache file (Louis, et. al, 2019, p. 6). When a probe image is to be recognized a matching training image is indexed from the saved cache file. The design to be employed is one of descriptive and qualitative. The descriptive approach will help bring out abstraction of system-level and technical information therefore presenting it in a simple and clear way of how the system operates. Such will enable the sampling population to grab the context and be able to contribute for the purpose of developing a better system that meets their needs and validates client expectations. Qualitative approach will help bring out the goodness appropriateness and general security implications combatted by the proposed system implementation
The target population of the research was to be a group of 20 individuals that hail from the locality. This sample population will constitute of both technically savvy and novice users. The sample population will consist of members with the following qualification criteria in mind:
The select 20 respondents that will consist of 10 men and 10 women. Sampling methods used are snowball and simple random probability sampling. Random individuals were selected from the general public and on completing the study were encouraged to refer interested and like-minded individuals to partake in the study.
The research work will make use of questionnaires, interviews and focus groups. Questionnaires in this study featured sections concerning system functionality, system response time and ease of system usability. Questionnaires consisted of both open-ended and closed-ended questions. The open-ended questions enabled the person answering the questions to be able to offer personal views pertaining to the system. Interviews offered a more personalized approach and it involved rehearsing through the questions already asked through the questionaires but to a more indepth manner that what was presented in writing. To be able to gain more insight and new ideas, the use of focus groups proved advantageous in that people contributed fascinating ideas such as text messaging when a user has been authenticated to an account and sends through the transaction details. For example, an account holder would receive the amount deposited/withdrawn from their account.
Participants were assured that the information collected was to be treated with confidentiality and the data collected would only be used for research purposes. Furthermore, participation were to volunteer as participants could withdraw at any time.
The use of common fruits and vegetables in the dataset ensured that no one would feel left out in the participation.
Data collection was done through use of questionnaires where the key informants consisted of 50 individuals. Each questionnaire item was designed to targeted toward providing estimate atomic information that could be used in solving the research problem of providing an optimal, usable and acceptable system. The use of both open ended and close-ended questions were used. Based on the accumulated data the following approximations and deductions were made:
There were 50 individuals who were given the questionnaires to answer. Out of the whole population, 28 were satisfied with the new system, 10 were not satisfied and the remaining 12 did not fill the questionnaires as required and neither did they give back any response nor feedback.
In essence, an interview is a planned dialogue in which one party asks questions and the other responds. The term “interview” is commonly used to refer to a one-on-one talk between an interviewer and an interviewee. The pie chart below was created as a result of the aforementioned interviews.
In requirement analysis, two distinct categories of requirements are considered: functional and non-functional requirements. The primary distinction between functional and non-functional needs is that functional requirements specify what the system should perform, but non-functional requirements specify how the system operates and what the system can still do without it. Product requirements are focused on the problems that the software should answer.
Refers to the capabilities and specifications that must be met to provide for user interaction with the system. The proposed system has satisfied the following functionalities:
- Efficient and robust identification of object and fast recognition of object utilizing the least device’s processing and memory.
- Simplistic user interface that facilitate interaction with the camera through buttons for example taking photos and switching between back and front cameras.
Refers to standards that are used to make judgement concerning the system operation. These standards affect the systems performance, security and reliability.
The proposed system will enforce the following non-functional requirements:
Robust, quick and efficient object identification/recognition
The system was to train a model and save the model to a file. This file would then be loaded by the application and processed to identify video streams. Therefore, the model has to be saved in a format that will require the least time to process the file in order to present the user with the recognized objects.
User interface employs bright colors and visible large text for readability and makes use of command buttons that drive the program from one page to the next.
The system aimed to be able to support the identification of multiple items on a video or static image.
The system was to be readily available and for this the program was to be ported to as a mobile application. This will enable for quick access and prompt usage.
This is the analysis and evaluation of the proposed solution in accomplishing the requirements to determine whether it is practical and workable. The proposed system is tested in the light of its workability, meeting user requirements, effective use of resources and cost effectiveness.
The researcher has the technical equipment needed which includes a computer and internet access. In addition, the researcher has the technical skills and capabilities required to develop the fruit and vegetable calorie content recognition system with languages such as python3 and C++ as back end.
The researcher has the financial capabilities to ensure the completion of this research project. The financial resources will cover the costs of data collection, internet usage, printing and presentation of the final documentation. The proposed system was to be developed to completion within the specified budget.
The system being proposed has the capability of alleviating the burden of memorization of combination of foods and their calorie content.
The researcher has the required resources to ensure the processes and procedures conducive for project success are efficiently and effectively conducted.
The system made use of Google’s Tensorflow and the model was saved as a .tflite file that can be easily loaded and processed by micro computing devices that have low computing power. For example the internet of things (IoT) devices and mobile phone and tablets.
Below is a snippet of how a model architecture is loaded
spec = model_spec.get(‘efficientdet_lite2’)
Loading the data to our model
train_data, validation_data, test_data = object_detector.DataLoader.from_csv(‘fruits_vegetables_ml_learn.csv’)
Aim is to reduce the val_loss to the lowest most possible value.
At the end of last epoch the value is 0.8881 from 1.7439 at the beginning of training.
on the same directory a file called model.tflite will be created.
This is the file that will be loaded offline on the mobile device for object detection.
There are so many apps to help people follow their diets but they are mostly boring or not so helpful. Other apps also do not have artificial intelligence in their system to make things easy. For the applications that have artificial intelligence you still need to click and select too many options to get your result.
The current system aims to alleviate the complexity by presenting a straight forward interface and applies machine learning to detect fruits and vegetables from a livestream video. The application will further display the calory content of the scanned and identified objects
The proposed system will make use of deep learning in its implementation of object detection and identification. The system will make use of Tensorflow and will use the base model architecture of ‘efficientdet_lite2’ while training. Due to the base model used the system performance will be resilient in computing devices that have little memory and comparably low computing power. The system presents a minimal user interface with Large viewable object detection and processing window and action buttons that make use of universal icons. The icons were picked because they are familiar to most users.
The proposed system has some constraints that include the following:
- Increasing the scope of the project to include more fruits and in the real world will require a lot of time collecting images and in training of the model to identify the classified fruits and vegetables. the use of third party application programming interfaces (APIs) in training the model with the help of graphical processing units (GPUs) will prove to be expensive aside from the time taken in acquiring affiliation with concerned Finance Authorities. Budgeting constraints can be both discouraging and damaging as it is difficult while planning to ensure necessary budgeting and proper flow and use of funds for the completion of the whole project.
- As for every trained model, the accuracy of detection is not perfect. This imperfection is what is termed as false positives and also false negatives. An example false positive happens when an item is misclassified. A lemon is identified as an orange. A false negative is when a small or misshapen orange is not identified by the model
The system consists of 3 structural components. The user interface, the object detection middleware and the presentation layer. The user interface allows the user to interact with the system by allowing them to make changes on how many core-threads of the CPU can be used. The more threads in use makes the model to quickly and efficiently compute and classify more objects. The middleware, consists of a lookup library that is saved as a tensorflow lite file. This file contains the saved context of the model during training. Therefore, this file provides for the dynamic loading of tensors, that can be used to detect and track objects. The tensor when accessed return the co-ordinates of where to place the bounding boxes and also returns the class names.
The application layer performs the processing of the bounding box and draws boxes on the canvas where the live video captions are being shown. This layer further attaches the class names and further performs a lookup on the calorie content of the class identified. This layer finally attaches the following to the canvas:
- The class name
- The calorie content of the class
- The bounding box for multiple images identified and reported by the middleware layer.
This is the means through which the user will interact with the system. This is much like the applications homepage where images and live video streams are rendered and processed. This view contains two action buttons that enables the user to take a photo to be saved as well as switch back and forth between the front and back camera.
The Program Link and Database Middle Ware
During processing, the snapshots are passed through our model detector middleware and then the model returns a list of the identified objects. The data results from the model detector will provide a data structure list that contains the x and y co-ordinates of the identified image as well as the class name of the object identified. This data is then passed to an object classifier function that will add bounding boxes to the identified objects in real time.
In summary this section includes the subroutines that handle querying of information from the saved model file (model.tflite) (Ignatov, et. al, 2021) and the cross-checking, identification and retrieval of the detected object.
This is a feedback from the object detection middleware and once an object is identified, this module will attach the labels on the livestream snapshot images. Furthermore, it will perform a cross lookup for the calorie content of identified foods and attach a label within the bounding box of the detected object
First the user will capture a live feed video. The video will then be split into time slots and the time slots into snap shots. The snapshots are then fed into the trained model for feature extraction and data processing. Then a graph is generated that maps the features that were extracted. The processed model will then produce bounding box regions as well the highest scoring class label. The two will be returned as output. Finally the bounding boxes are drawn on canvas and the class labels are attached within the bounding boxes of the detected and classified objects. A lookup table is then checked for the corresponding calorie value of the identified class which is then attached on canvas.
Maintenance is a term that refers to changes made to a software system after it has been released. System maintenance is a continuous process that encompasses a number of tasks, including the correction of program and design flaws, the updating of documentation and test data, and the updating of user assistance. Maintenance can be broadly grouped into the three categories below
This is used to remove errors in the program, which occurs when the product is delivered as well as during maintenance. Thus in corrective maintenance the product is modified to solve the discovered errors after the software product is being delivered to customer.
Adaptive maintenance is generally not requested by client but it is imposed by the outside environment. It may include following organizational changes: · Change in the object · Change in algorithms for faster performance · Change in frames like instead of live detecting we need video frames · Change in system controls and security needs etc.
This entails altering the software in order to enhance certain characteristics, such as adding new functionality, increasing computer efficiency, and simplifying its use. This form of maintenance is utilized to address the extra needs of users that come up due to unavoidable environmental and business induced changes. Among them we have the following:
- Changes in software
- Economic and competitive conditions
- Changes in models System evaluation is the process of checking the performance of a complete system to acknowledge how it is likely to perform in live market conditions. It measures the performance of the system that whether it may compete or not
This chapter discussed on how the system is implemented and how the user interacts with it through actions and how the system responds with visible outputs. It discussed the system’s components and their function during the system’s execution. Details on the implementation of external libraries were provided, along with references to their specific usage inside the system. The system’s operation was demonstrated through the use of UML diagrams and class interaction demonstrations. All code snippets were included, particularly the critical portions of training and importing the trained model file into an application. The system’s functioning, as well as its operation and function calls, were well specified.
Moreover, in the terms of model detection, it can be deduced that a trained model is not perfect as it will have both false positives and false negatives while classifying trained data objects.
Conclusion
This chapter outlines the research’s limitations, findings, and recommendations. Additionally, it brings to a close the different concerns raised in the preceding chapters. Acquired knowledge in terms of concept, theory, technology, and application.
The main objective was to develop a fruit and vegetable object detection system that will remedy the issue of identity the caloric content of food. The system was meant to offer convenience in querying and getting the caloric content of a food item just by taking a photo of it or by livestreaming through the camera.. The system met the objectives through the addition successful object detection and proper labelling as to the caloric content of the given fruit or vegetable identified. The caloric labelling of food items follows object detection, object classification and finally a label is added within the object’s bounding box of the caloric content it contains.
The specific objectives were met as follows:
- The system is easy to navigate and is user friendly.
- The system has employed machine learning and well as deep learning in training a model that can be ported to work on mobile devices.
- A successful classification and caloric content labeling of food items. Furthermore, the system is able to perform multiple detection and classification. This was made possible through multithreading.
The system works best if the training of is done through use of 3D models that will provide more accuracy and rid of false-negatives. The sources of training data will play the most crucial role in that it should be non-biased and to have been take under different lighting conditions. This is solely to ensure there is accurate classification.
The project has a broad scope across various domains and may simply be expanded by adding more efficient algorithms. Several of the areas include the following:
Medical Diagnose: Using object detection and identification to detect X-Ray reports and brain tumors in medical diagnosis.
Recognize shapes from entire regions inphotos.
Cartography is defined as the field of study concerned with the creation, production, dissemination, and study of maps.
Robotics: Object detection is used in robotics to detect the movement of body components and motion sensing.
Conclusion
Hitherto, a lot of attention is being directed toward the evolution of object detection through utilisation of the deep learning methodology. The growth of deep learning has led to creation of models such as the object detection pipeline being utilised by this project, and as a first step, it has served as the basis for the rest of the completion of this project. Object detection, face detection, and pedestrian detection may all be accomplished with this technology. In order to do this, the researchers used two methods: deep learning and OpenCV for object identification, and OpenCV for fast, threaded video streaming. The outcome can be affected by the camera sensor noise and lighting conditions, which can make it difficult to detect the object. The final product is a 26-frame-per-second object detector based on deep learning..
References
Abrahamsson, P., Salo, O., Ronkainen, J. and Warsta, J., 2017. Agile software development methods: Review and analysis. arXiv preprint arXiv:1709.08439.
Bayona, Á., SanMiguel, J.C. and Martínez, J.M., 2010, September. Stationary foreground detection using background subtraction and temporal difference in video surveillance. In 2010 IEEE International Conference on Image Processing (pp. 4657-4660). IEEE.
Cherkassky, V. and Ma, Y., 2004. Practical selection of SVM parameters and noise estimation for SVM regression. Neural networks, 17(1), pp.113-126.
Girshick, R., 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 1440-1448).
Ding, C., Wang, S., Liu, N., Xu, K., Wang, Y. and Liang, Y., 2019, February. REQ-YOLO: A resource-aware, efficient quantization framework for object detection on FPGAs. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (pp. 33-42).
Goodman, N.R., 1963. Statistical analysis based on a certain multivariate complex Gaussian distribution (an introduction). The Annals of mathematical statistics, 34(1), pp.152-177.
Hager, G.D., Dewan, M. and Stewart, C.V., 2004, June. Multiple kernel tracking with SSD. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004. (Vol. 1, pp. I-I). IEEE.
Horprasert, T., Harwood, D. and Davis, L.S., 1999, September. A statistical approach for real-time robust background subtraction and shadow detection. In Ieee iccv (Vol. 99, No. 1999, pp. 1-19). Citeseer.
Hou, Y., Zhang, H., Zhou, S. and Zou, H., 2017. Efficient ConvNet feature extraction with multiple RoI pooling for landmark-based visual localization of autonomous vehicles. Mobile information systems, 2017.
Huang, R., Pedoeem, J. and Chen, C., 2018, December. YOLO-LITE: a real-time object detection algorithm optimized for non-GPU computers. In 2018 IEEE International Conference on Big Data (Big Data) (pp. 2503-2510). IEEE.
Ignatov, A., Romero, A., Kim, H. and Timofte, R., 2021. Real-time video super-resolution on smartphones with deep learning, mobile ai 2021 challenge: Report. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2535-2544).
Ito, K. and Sakane, S., 2001, May. Robust view-based visual tracking with detection of occlusions. In Proceedings 2001 ICRA. IEEE International Conference on Robotics and Automation (Cat. No. 01CH37164) (Vol. 2, pp. 1207-1213). IEEE.
Li, L., Huang, W., Gu, I.Y. and Tian, Q., 2003, November. Foreground object detection from videos containing complex background. In Proceedings of the eleventh ACM international conference on Multimedia (pp. 2-10).
Liu, R.Y., Parelius, J.M. and Singh, K., 1999. Multivariate analysis by data depth: descriptive statistics, graphics and inference,(with discussion and a rejoinder by liu and singh). The annals of statistics, 27(3), pp.783-858.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y. and Berg, A.C., 2016, October. Ssd: Single shot multibox detector. In European conference on computer vision (pp. 21-37). Springer, Cham.
Louis, M.S., Azad, Z., Delshadtehrani, L., Gupta, S., Warden, P., Reddi, V.J. and Joshi, A., 2019, June. Towards deep learning using tensorflow lite on risc-v. In Third Workshop on Computer Architecture Research with RISC-V (CARRV) (Vol. 1, p. 6).
Oxberry, G., 2016. Google Test MPI Listener (No. gtest-mpi-listener; 005466MLTPL00). Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States).
Papageorgiou, C. and Poggio, T., 2000. A trainable system for object detection. International journal of computer vision, 38(1), pp.15-33.
Ren, S., He, K., Girshick, R. and Sun, J., 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28.
SAI, D.N.R., 2020. A Study on Object Tracking and Identifying Name of Object Using Open CV. International Journal for Innovative Engineering and Management Research, 9(09).
Shervashidze, N., Schweitzer, P., Van Leeuwen, E.J., Mehlhorn, K. and Borgwardt, K.M., 2011. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(9).
Simonyan, K. and Zisserman, A., 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Szegedy, C., Toshev, A. and Erhan, D., 2013. Deep neural networks for object detection. Advances in neural information processing systems, 26.
Targ, S., Almeida, D. and Lyman, K., 2016. Resnet in resnet: Generalizing residual architectures. arXiv preprint arXiv:1603.08029.
Warden, P. and Situnayake, D., 2019. Tinyml: Machine learning with tensorflow lite on arduino and ultra-low-power microcontrollers. O’Reilly Media.
Xie, Y., Zheng, J., Hou, X., Naqvi, I.R., Xi, Y. and Kuang, N., 2021. Multi?dimensional weighted cross?attention network in crowded scenes. IET Image Processing, 15(14), pp.3585-3598.
Zou, Z., Shi, Z., Guo, Y. and Ye, J., 2019. Object detection in 20 years: A survey. arXiv preprint arXiv:1905.05055.