A powerful AI for image search in English has come to global life thanks to researchers at RISE. Their method of connecting different languages to the pre-trained image algorithms has already been downloaded nearly two million times worldwide.
The neural network CLIP, from research company OpenAI, has changed the playing field for how text and image can be linked together. Fredrik Carlsson, researcher in AI and deep neural networks at RISE, explains that the model has been trained on 400 million images and captions:
“CLIP connects textual and visual information in a single room. It is useful for searching images, but also the other way around: what text matches this image?”
One potential use is law enforcement investigative work, such as reviewing x number of days of surveillance camera video.
“A search query could be ‘white van, sticker with logo on the side’. It could generate a response with the frames that best match the query.
Automatic image tagging
An obvious application is tagging images automatically. Or performing text searches in images that are not already tagged. Carlsson also sketches out how experiments with an AI-enhanced Photoshop application could work. Through text input alone, you can generate a portrait image, decide to add makeup, and adjust the hairstyle or skin tone.
“Or give the person an Asian appearance. Or make them more like Emma Watson, or even Hillary Clinton.”
Carlsson explains, however, that it was not obvious that CLIP could be used outside the English language area:
“There is much less data, which is the major problem with smaller languages such as Swedish, Catalan, Finnish, and so on. That, and the cost of training these models because they require extremely high computational power.”
We understood that the world looks fairly similar in all languages
Solution with no major requirement for new data
According to Carlsson, although RISE trains its own language models and has leading experts, greater resources can be found elsewhere. He mentions major AI groups such as Google, Facebook, Nvidia, and Microsoft:
“For CLIP we found a useful shortcut. We understood that the world looks fairly similar in all languages. There may be more Dala horses in Sweden. but we are not interested in learning about the visual world.”
The solution is to keep the existing models but replace the pre-trained English encoder with one that is pre-trained in Swedish (or another language). The time for calculation can be as short as 24 hours.
“It became an incredibly computationally efficient and cost-effective method. We hardly needed new data.”
Carlsson says that the method, named Multilingual-Clip, has been widely distributed and has been downloaded around two million times. Mainly in East Asia, China, and India.
“I don’t know what exactly it is being used for, but I would naturally assume image searches. Most operators with large image databases can use CLIP along with the multilingual package,” concludes Fredrik Carlsson.
Multilingual clip at GitHub: https://github.com/FreddeFrallan/Multilingual-CLIP