X

Why a Self-Driving Car Might Run a “STOB” Sign: To ViT or Not To ViT

September 19, 2024

The all-too-familiar hexagonal red STOP sign with white trim around the border (at least in the United States) instructs the driver to come to a complete stop, look in both directions, and yield to traffic as required before proceeding. If the stop sign had the misfortune of being vandalized, perhaps a casualty of graffiti, the human driver would most likely be able to look beyond the graffiti to recognize that it is a STOP sign, and take the appropriate action.

It could be said that this behavior is analogous to “inductive priors,” a term used in the science of neural networks, where logical assumptions are made based upon the input data – even though some of the input data may be corrupted or occluded in some manner. In this case, the driver has seen enough stop signs and has enough cues in this sign to recognize it as a stop sign, despite the fact that it has been defaced.

Convolutional Neural Networks (CNNs) typically employ an equivalent type of inductive priors. This class of neural networks assumes that nearby pixels from a camera image are related (typically a reasonable assumption). This class of neural net also assumes that the importance of the different parts of the image is weighted similarly (no one part of the image is any more important than another part). The result is that for CNNs, a relatively limited amount of training is required before the network starts to demonstrate significant levels of accuracy. However, even when fully trained, the inherent accuracy of the CNN is limited when compared to other classes of neural nets, which means it might “jump to conclusions too quickly” to make a judgment and make mistakes.

Slightly off topic for a second, the intelligence of different breeds of dogs is typically measured by how many times a dog of a certain breed typically has to hear a command before that breed of dog can recognize a command, or is trained. I am going to draw the analogy that different classes of neural networks are like different breeds of dogs in both the manner as to how long it takes before they are trained and the accuracy by which they are repeatedly and accurately able to recognize a command from different voices.

While CNNs have been the mainstay deep neural network (DNN) employed to address perception in ADAS applications, as of late, Vision Transformers (ViT) have been gaining significant interest within the ADAS design communities, primarily due to their improved accuracy in object classification over CNNs. In the case of the ViT, however, the increased accuracy comes with the need for a very extensive set of training data. The ViT also demands increased compute performance over that of the CNN during deployment.

ViTs have very low inductive biases. While this is one of the factors leading to improved accuracy (this form of neural networks is less prone to “jump to conclusions”) a significantly larger training set is needed before the ViT can achieve a level of accuracy equivalent to that of the CNN. Also, because the ViT does not assume the importance of adjacent pixels, the full frame with all pixels present must be in place before ViT calculations are performed. This, in turn, is one of several factors that drive up requisite compute performance during deployment/inference.

Increasingly, image sensors with higher resolutions are being employed in the automobile as they can provide a greater amount of information about the surrounding environment of the vehicle as compared to their lower resolution (Mpixel) counterparts. Sensors that provide more information enable objects to be more readily detected from greater distances. This, in turn, enables the vehicle to operate autonomously at higher speeds. As an example, a lower resolution camera may provide only 5 pixels, which reflects an object that has been detected off in the distance. This, however, is an insufficient amount of information to accurately recognize the image and plan the appropriate action. The higher resolution camera, on the other hand, may provide 100 pixels of that same object, enabling the ADAS perception engine to readily recognize that object off in the distance as a pedestrian, cyclist, or stationary object providing more time to the autonomous vehicle to take the right action.

While the CNN doesn’t require the full frame of the image to accurately detect the image, the ViT does. This further exacerbates the computational requirements of the ViT due to the greater amount of data within the frame that must be processed from the higher-resolution camera image in the same time period.

Just as some breeds of dogs have different propensities towards being trained, neural network architectures can be viewed as being on a spectrum of inductive biases from strong to weak. CNNs are on one end of the spectrum while ViTs are on the other end. 

As typically seems to be the case, there is no free lunch. The increased power consumption of the ViT driven by the inherent higher computation and the required increased training datasets and training times represent meaningful recurring and non-recurring costs. Hybrid architectures are emerging that combine CNNs and ViTs into one architecture, leveraging the inherent strengths of each neural network. These architectures strike a middle ground between CNNs’ and ViTs' inductive biases, finding a balance between the learning flexibility and accuracy of the ViT while reducing the amount of training required.

While the focus has been on comparing CNN vs. ViT, similar debates and tradeoffs are ongoing in other areas wherever AI is being employed. To that end, this is why it is paramount that upgradable AI architectures are deployed in the field to ensure that the optimal neural network can be deployed as it is unearthed and thus avoiding the wrong response to a STOB sign.

The all-too-familiar hexagonal red STOP sign with white trim around the border (at least in the United States) instructs the driver to come to a complete stop, look in both directions, and yield to traffic as required before proceeding. If the stop sign had the misfortune of being vandalized, perhaps a casualty of graffiti, the human driver would most likely be able to look beyond the graffiti to recognize that it is a STOP sign, and take the appropriate action.

It could be said that this behavior is analogous to “inductive priors,” a term used in the science of neural networks, where logical assumptions are made based upon the input data – even though some of the input data may be corrupted or occluded in some manner. In this case, the driver has seen enough stop signs and has enough cues in this sign to recognize it as a stop sign, despite the fact that it has been defaced.

Convolutional Neural Networks (CNNs) typically employ an equivalent type of inductive priors. This class of neural networks assumes that nearby pixels from a camera image are related (typically a reasonable assumption). This class of neural net also assumes that the importance of the different parts of the image is weighted similarly (no one part of the image is any more important than another part). The result is that for CNNs, a relatively limited amount of training is required before the network starts to demonstrate significant levels of accuracy. However, even when fully trained, the inherent accuracy of the CNN is limited when compared to other classes of neural nets, which means it might “jump to conclusions too quickly” to make a judgment and make mistakes.

Slightly off topic for a second, the intelligence of different breeds of dogs is typically measured by how many times a dog of a certain breed typically has to hear a command before that breed of dog can recognize a command, or is trained. I am going to draw the analogy that different classes of neural networks are like different breeds of dogs in both the manner as to how long it takes before they are trained and the accuracy by which they are repeatedly and accurately able to recognize a command from different voices.

While CNNs have been the mainstay deep neural network (DNN) employed to address perception in ADAS applications, as of late, Vision Transformers (ViT) have been gaining significant interest within the ADAS design communities, primarily due to their improved accuracy in object classification over CNNs. In the case of the ViT, however, the increased accuracy comes with the need for a very extensive set of training data. The ViT also demands increased compute performance over that of the CNN during deployment.

ViTs have very low inductive biases. While this is one of the factors leading to improved accuracy (this form of neural networks is less prone to “jump to conclusions”) a significantly larger training set is needed before the ViT can achieve a level of accuracy equivalent to that of the CNN. Also, because the ViT does not assume the importance of adjacent pixels, the full frame with all pixels present must be in place before ViT calculations are performed. This, in turn, is one of several factors that drive up requisite compute performance during deployment/inference.

Increasingly, image sensors with higher resolutions are being employed in the automobile as they can provide a greater amount of information about the surrounding environment of the vehicle as compared to their lower resolution (Mpixel) counterparts. Sensors that provide more information enable objects to be more readily detected from greater distances. This, in turn, enables the vehicle to operate autonomously at higher speeds. As an example, a lower resolution camera may provide only 5 pixels, which reflects an object that has been detected off in the distance. This, however, is an insufficient amount of information to accurately recognize the image and plan the appropriate action. The higher resolution camera, on the other hand, may provide 100 pixels of that same object, enabling the ADAS perception engine to readily recognize that object off in the distance as a pedestrian, cyclist, or stationary object providing more time to the autonomous vehicle to take the right action.

While the CNN doesn’t require the full frame of the image to accurately detect the image, the ViT does. This further exacerbates the computational requirements of the ViT due to the greater amount of data within the frame that must be processed from the higher-resolution camera image in the same time period.

Just as some breeds of dogs have different propensities towards being trained, neural network architectures can be viewed as being on a spectrum of inductive biases from strong to weak. CNNs are on one end of the spectrum while ViTs are on the other end. 

As typically seems to be the case, there is no free lunch. The increased power consumption of the ViT driven by the inherent higher computation and the required increased training datasets and training times represent meaningful recurring and non-recurring costs. Hybrid architectures are emerging that combine CNNs and ViTs into one architecture, leveraging the inherent strengths of each neural network. These architectures strike a middle ground between CNNs’ and ViTs' inductive biases, finding a balance between the learning flexibility and accuracy of the ViT while reducing the amount of training required.

While the focus has been on comparing CNN vs. ViT, similar debates and tradeoffs are ongoing in other areas wherever AI is being employed. To that end, this is why it is paramount that upgradable AI architectures are deployed in the field to ensure that the optimal neural network can be deployed as it is unearthed and thus avoiding the wrong response to a STOB sign.

Robert Bielby

Sr. Director, System Architecture and Product Planning for Automotive

Subscribe to TechArena

Subscribe