"Multi-modal" data refers to different types of data that a user would like to be able to relate to each other within one application. For example:
A model combines a user's text search (mode 1) to find an image (mode 2) and song (mode 3) that corresponds to their prompt.
A model is given images (mode 1), written testimonials (mode 2), and electrical black box data from a car (mode 3) to create a simulation (mode 4) and written description (also mode 2) that summarizes an officer's findings at an accident scene.
A model takes x-ray scans (mode 1), surface images of the injury (mode 2), and a description of symptoms (mode 3) to give a recommendation for a physician to review regarding the best possible recovery plan for the patient.
and you could likely think of (and at this point may even use) many more.
Implementation of multi-modal processing models gets complicated in the real world. So let's dig briefly into the theory behind multimodal data, as that is fairly simple.
All data can be sent through an "encoder" to be turned into a long list of high-precision numbers called an "embedding".
This list of specific numbers can represent text, image, waveform, lidar, spectroscopy, etc. The encoder itself performs weighted matrix math to get the data down to the correctly sized list of numbers (the embedding).
Those embeddings numerically represent things that we as humans both can and can't observe about a certain piece of data.
The embeddings are also biased to represent certain aspects of the data based on how the encoder was trained. Training is the process of giving an algorithm a task by making it look at lots of data and telling it how to classify the data. The algorithm's "weights" adjust with each piece of data to better accomplish the task.
For example, suppose I train an image encoder to make sure that it can find the "blue" images among many others. Undoubtedly then, the embeddings that result from that encoder will, with their constituent numbers, over-represent color. This is because the training process mandates that the encoder needs to know what is and isn't blue. Something like the shapes in the image will likely be less heavily weighted in the embedding.
There are numerous well-noted cases of biased encoders in the machine learning literature. The severity of this issue depends on context, but nonetheless, the issue should be taken seriously and accounted for in all cases.
Embeddings from any "mode" (data type) can be compared.
Once textual writing becomes numerical along with pictures, electrical waveforms, formatting decisions, etc. Then what was formerly apples to oranges is now apples to apples. Embeddings can be grouped (clustered) to see which are most related. Quite simply, two embeddings can just be subtracted from one another to calculate their distance apart.
Those multi-modal embedding comparisons will be and currently are the basis for so many of the tools that eventually will be prevalent. They make natural language a new interface for what has become truly unsearchable vats of data, even just on your personal computer.