MIT research develops new machine learning model that understands underlying relationships between objects in a scene
Deep learning models don’t see the world the same way we humans do. Humans have the ability to see objects and interpret their relationships. However, DL models do not understand the complex interactions between individual objects.
To solve this problem, researchers at MIT have developed a machine learning approach which understands the underlying relationships between objects in a scene. First, this model illustrates individual relationships one at a time. Then he combines them to describe the full picture.
The new frame creates an image of a scene based on a textual description of the objects and their connections, such as “A wooden table to the left of a blue stool”.
To do this, he first divides those words into two smaller parts, one for each specific relation (eg, “a wooden table to the left of a blue stool”), then models each part independently. After that, an optimization method is used to put the components together, resulting in a scene image. Even when the scene contains multiple elements in varying relationships, the model can create more precise images from text descriptions.
Most DL systems would consider all relationships holistically and generate the image from the description in one step. However, these models fail when they deal with non-distribution descriptions, such as those that include more relationships. This is because these models cannot actually adopt a single shot to generate images with more relationships. According to the researchers, their model can simulate a larger number of links and adapt to unexpected combinations as it combines these different, smaller models.
To capture the associations of individual objects in a scene description, the researchers used a machine learning technique called energy-based models. Using this model, the method encodes each relational description and then compiles them together, inferring all objects and relationships.
The system can also recombine the lines in various ways by breaking them into shorter chunks for each relation, making it more apt to fit into scene descriptions that it has never seen before.
What’s more interesting is that the method can also work in reverse, finding textual descriptions that correspond to relationships between elements of a scene from an image. Moreover, their model can also modify an image by rearranging the elements of the scene to meet a new description.
When evaluated with other DL models while they were given text descriptions and tasked with creating visuals describing objects and their interactions, the proposed model surpassed baselines in each case.
Humans were also asked to assess whether the photos generated matched the scene description. About 91 percent of participants concluded that the new model worked best in the most difficult situations, where descriptions had three links.
When the system receives two related scene descriptions that represent the same image in different ways, the model is able to recognize that they were equivalent. The researchers state that even as they increase the number of relationship descriptions in the sentence from one to two, three, or even four, their approach continues to generate images that are appropriately described by those descriptions, while others methods fail.
This algorithm has the ability to learn from fewer data points while generalizing to more complex scenarios or imagery. It finds application in real world situations where robots have to perform complex multi-step manipulation tasks. Above all, this research holds great promise as it brings the field closer to the possibility for machines to learn and interact with their environment in the same way humans do.
In the future, the researchers plan to examine their model’s performance on more complex real-world photos, such as those with noisy backgrounds and objects blocking each other. They also hope to incorporate their model into robotic systems in the future, allowing a robot to infer element relationships from videos and then use that information to manipulate objects in the real world.