Google is launching Gemini, which will now allow Bard to function. Additionally, it will be featured in the Google Pixel or an Ultra version slated to arrive in 2024.
You’re likely familiar with ChatGPT. The ChatGPT app, the user interface for interaction, is often confused with GPT-4 and GPT-3.5, the templates employed in the background to process user queries.
For Google’s AI, Bard, we now know the model behind it: Gemini. According to Google, it is “the most ambitious and successful of its AI models.”
Gemini, a Multimodal Model
In reality, Gemini is even more than that; it’s a multimodal model. This means it’s not just an AI model reacting to a written user request but one that responds to various types of sources. In Google’s words, it’s “able to generalize, understand fluidly, process, and combine different information media, including text, code, audio, image, and video.”
To fully understand, just watch the demonstration published by the Mountain View giant on YouTube. We see Google’s AI describing a drawing made in real-time, inventing a live game, immediately understanding a posed riddle and solving it, establishing a link between two objects, and proposing logical interpretations. In short, the demonstration speaks for itself, demonstrating great versatility.
Google details in a press release its approach to achieving such a result. The company explains why Gemini is more efficient than previous multimodal models. “To date, the usual approach to creating multimodal models has been to train separate components for each use and then assemble them by reconstructing an integrated functionality as best it can.” This approach achieves commendable results but struggles with more complex tasks.
“Gemini has been designed to be natively multimodal,” the statement added. It has been pre-trained to handle various modalities, and its effectiveness was later enhanced by additional multimodal data. This approach gives Gemini the ability to understand and reason with all types of inputs. That’s why its performance far exceeds that of existing models, and its capabilities push the boundaries of state-of-the-art in almost every field”.
Not One Gemini, Not Two, but Three Geminis
There are actually three Geminis: Gemini Pro, Gemini Nano, and Gemini Ultra.
Gemini Pro will be integrated into Google Bard as of today (only in the English language, unfortunately). The goal is to make Google’s generative AI “more proficient at understanding, summarizing, reasoning, suggesting ideas, writing, or planning.”
Gemini Nano, on the other hand, will be integrated into the Pixel 8 Pro. The goal is to give the smartphone new capabilities, such as the ‘summarize’ function in the Recorder app or the automatic replies generated in Gboard — first in WhatsApp and in other messaging apps from next year.
Finally, Gemini Ultra will initially be tested by customers, developers, partners, and cybersecurity experts before feeding an improved version of Bard “in early 2024,” logically named Bard Advanced.
Update: Google Admits to Tampering with Gemini AI Demo
Google has gone out of its way to showcase Gemini, its new AI, in the best light. The Mountain View firm even went so far as to modify one of the demonstration videos of the generative artificial intelligence model. The liberties taken by Google, which is very keen to compete with rival OpenAI, have been singled out by Bloomberg.
Google Explains How the Gemini Demo Was Modified
In order to present Gemini in the best light, Google did not hesitate to edit one of the AI demonstration videos. According to Bloomberg, the American company has indeed admitted that changes have been made to the Gemini hands-on video. The footage, which can be seen below, shows several interactions between a user and the AI. In particular, the user asks the multimodal model to predict what a design will look like from its genesis or to follow the progress of a coin from one hand to another.
In the description visible on YouTube, Google indicates that this six-minute demonstration was not carried out in real conditions. In fact, “latency has been reduced, and Gemini’s responses have been shortened for brevity,” the company admits. In other words, the AI did not respond and react instantly to the images provided by its interlocutor, contrary to what the video indicates. Instead of filming an exchange in real-time, without editing, Google used a combination of footage.
Also, Google hasn’t really let an individual interact with Gemini by voice. According to a spokesperson interviewed by Bloomberg, “still images of the footage” and text queries were used. These queries were then recorded and added to the video. Once edited, the video gives the impression that a user has verbally communicated with the AI, which is not the case. Finally, Gemini’s actual performance, whether it’s the speed of reaction or its ability to converse with a human, seems far from being demonstrated.
Even worse, Google has shortened queries to Gemini. To get a complete and relevant answer, you actually have to ask the AI long, detailed questions. In the video, the voice-over, added to the editing, is content with short, rather vague questions. It was therefore surprising that the model was able to understand so easily what his interlocutor was trying to achieve. Finally, Gemini needs a comprehensive query to be effective, just like its rival GPT-4.
According to Bloomberg, Google’s liberties don’t stop there. According to the outlet, the demonstration was carried out using Gemini Ultra, the most powerful and sophisticated version of the AI model. However, Google is careful not to specify which version is behind the video, while the Ultra version is not yet available. The company deliberately plays on the vagueness by suggesting that it is version 1.0 of Gemini that is at the origin of the sequence’s prowess.
A “Proof of Concept”
On his X account (formerly Twitter), Oriol Vinyals, the vice president of research and head of deep learning at DeepMind, the Google subsidiary behind Gemini, strongly defends the changes made by the group’s teams. The manager reminds us that the queries and answers seen in the video are very real:
“All user questions and answers in the video are real, shortened for brevity. The video illustrates what multimodal user experiences built with Gemini could look like. We created it to inspire developers.”
It is therefore likely to be more of a “proof of concept,” i.e. a practical demonstration aimed at illustrating the feasibility and viability of a technology, rather than a presentation. Clearly, the way Google presented interactions with Gemini could differ greatly from the final result.