Skip to main content

OpenAI has unveiled voice and image functionalities for ChatGPT. We at PL Talents obviously couldn’t just walk by and not cover it.

The integration of a backend model named GPT-4V will oversee image inputs. Moreover, an enhanced DALL-E model is set to generate these images. Excitingly, ChatGPT’s mobile app users will soon be able to have voice dialogues with the chatbot.

About OpenAI’s innovative steps

ChatGPT Plus and Enterprise users can anticipate the arrival of DALL-E 3, OpenAI’s latest image generation iteration, which is currently in “research preview”. This integration signifies that users can formulate prompts more efficiently with chatbot assistance. The foundation for understanding image inputs lies in a multimodal variant of the GPT model, known as GPT-4 Vision (GPT-4V). For the voice component, OpenAI’s Whisper automatic speech recognition (ASR) model interprets user voice inputs. Subsequently, a new text-to-speech (TTS) model transforms ChatGPT’s text replies into one of five voice options. The phased deployment of these innovations, motivated by safety considerations, has undergone rigorous beta testing and “red teaming” to counteract potential risks.

OpenAI elaborates:

Large multimodal models present varying limitations, increasing the risk spectrum in contrast to solely text-based models. GPT-4V embodies the strengths and constraints of each modality (text and vision), while simultaneously unveiling unique capacities stemming from the modality intersection and the insights of expansive models.

GPT-4V testing insights

OpenAI shared their GPT-4V testing methodology through a publication. They employed this model in a tool dubbed ‘Be My AI’, devised to assist the visually impaired by elucidating image content. Between March and August 2023, a pilot scheme with 200 beta testers was conducted. This subsequently expanded to 16,000 users in September 2023. A developer alpha programme also ran, granting over 1,000 developers model access for three months to obtain feedback on GPT-4V interactions.

OpenAI’s paper provided comprehensive evaluations of the model’s behavioural attributes. This encompassed declining harmful content generation, declining individual identification from images, breaking CAPTCHA capabilities, and declining image-centric “jailbreaks”. Furthermore, OpenAI engaged “red teams” to probe the model’s proficiency in academic arenas, specifically examining images from publications and deciphering medical images, like CT scans. Importantly, the current GPT-4V iteration isn’t deemed suitable for medical purposes.

User and partner reactions

Some users have remarked on the noticeable delay between questions and answers, akin to other voice assistants. For improvement, they stressed the necessity for a conversation turn-taking dataset and model.

OpenAI’s partners have also been leveraging these fresh features. Spotify introduced a Voice Translation for select podcasts using “OpenAI’s latest voice generation technology”, which emulates the original speaker. Meanwhile, Mikhail Parakhin, Microsoft’s CEO of Advertising and Web Services, broadcasted DALL-E 3’s introduction to Bing’s image generation tool on X (previously Twitter). OpenAI further proclaimed that ChatGPT’s “Browse with Bing” would soon be universally accessible.

Lastly, Microsoft has embedded OpenAI’s DALL-E 3 into Bing Image Creator and Bing Chat. This ensures enhanced image quality, precision for human features, and alignment with user inclinations. However, it’s designed to avoid harmful or inappropriate imagery. Every created image will now carry a watermark, indicating its AI origin and timestamp.

Is your company currently looking to expand and build a team of superstars? We can help with that. Speak to a PL Talents recruitment expert today.