Abstract
As the entire world is becoming increasingly a global village, the need for reliable, smooth, and easy-to-use applications that facilitate the communication process between people speaking different languages worldwide becomes essential, especially in the tourism industry. While numerous online and mobile applications attempt to bridge the linguistic gap using text-to-text, text-to-voice, or voice-to-text-to-voice translators, they often fall short due to constraints such as the need for a single shared device, manual setup of speaker’s gender and preferred language, and an inability to communicate from a distance. These applications struggle to mimic the practical nature of real-time multilingual conversations where immediate and clear communication is paramount. This paper introduces an intelligent peer-to-peer polyglot voice-to-voice mobile application to facilitate the communication of people speaking different languages worldwide transparently mimicking a live conversation whether the involved parties are close to each other or at a nearby distance. People can interact with others transparently using their preferred language, irrespective of others’ languages, while the application automatically recognizes the language, gender of the speaker, and spoken words with very high accuracy. Five languages were implemented in the developed application as a proof-of-concept, and it is designed to smoothly and simply adapt more in future updates.
Introduction
In an era of rapid globalization, the world is metaphorically shrinking each day. With a huge increase in cross-cultural interactions and a surge in multilingual communication, there is an escalating need for reliable, seamless, and user-friendly applications that can bridge the linguistic divide and facilitate effortless communication among people who speak different languages worldwide. These tools are expected to break down language barriers, foster inclusivity, and promote global understanding. Furthermore, the usage of such tools will boost the travel and tourism industry that have become an integral part of human life, contributing significantly to the global economy. According to the World Travel and Tourism Council, “the travel and tourism sector in 2019 was one of the world’s largest sectors, accounting for 10.418/04/2024%of global GDP (USD 9.2 trillion), 10.6%of all jobs (334 million), and was responsible for creating 1 in 4 of all new jobs across the world. Moreover, international visitor spending amounted to USD 1.7 trillion in 2019 (6.8%of total exports, 27.4%of global services exports)” [1].
Language barriers in tourism can lead to misunderstandings, dissatisfaction, and loss of business opportunities. Tourists often face difficulties in understanding local languages, which can lead to a lack of comprehension and appreciation of the local culture and heritage. This can significantly affect their overall travel experience. For example, language barriers negatively affect the satisfaction and revisit intention of international tourists [2].
Moreover, language barriers can also lead to negative experiences for local residents and service providers in the tourism industry. They often struggle to communicate effectively with tourists, leading to misunderstandings and conflicts. This can result in a negative perception of tourists and tourism, affecting the sustainability of the tourism industry [3].
In addition to affecting the experiences of tourists and service providers, language barriers also have economic implications for the tourism industry. They can lead to a decrease in tourist expenditure, affecting the profitability of tourism businesses and the economic development of tourist destinations.
Several online and mobile applications currently in use are aimed at simplifying the translation process when people want to communicate directly with each other in real-time scenarios through text-to-text, text-to-voice, and voice-to-text-to-voice translators. They represent technological advancements that have significantly contributed to overcoming linguistic barriers. However, a critical analysis reveals that most of these applications are fraught with limitations. First and foremost, the inability of these applications in facilitating real-time, user-friendly, and conversational communication. They are not designed to mimic live conversations, creating a somewhat artificial and disjointed communication experience, thereby hindering the overall users’ experiences. Furthermore, typically, users interact using the same device or several devices in a non-conversational manner which requires in many scenarios to share devices with strangers; this is generally considered undesirable due to privacy, hygiene and/or safety concerns. Moreover, involved parties are unable to converse from a distance. As a result, despite their initial promise, these applications have not fully realized the potential to enable smooth and authentic multilingual conversations.
Recognizing the indispensable role of technology in facilitating cross-linguistic communication, this paper introduces a complete and innovative solution that overcomes the above-mentioned shortcomings and provides additional features. We design and implement an intelligent polyglot voice-voice mobile application that seamlessly facilitates the communication process among people speaking different languages worldwide, whether the parties involved are in close proximity or at a nearby distance. The groundbreaking feature of this application lies in its ability to allow each user to use his own smart device and transparently speak and listen in his preferred language, without paying attention to the languages used by others in a procedure that effectively mimics a real-time conversation. It is capable of automatically recognizing the language and gender of a speaker, and the spoken words, with remarkable accuracy. Importantly, this application is adaptable and scalable, with the initial version supporting five languages: English, Arabic, Hindi, Spanish, and Japanese, and designed to effortlessly accommodate more in future updates.
This paper introduces the conceptualization, development, and evaluation of this transformative intelligent mobile application, poised to enhance the landscape of multilingual communication in a continuously interactive world in the era of digital ubiquity.
The rest of the paper is organized as follows. In section 2, we provide a brief review of the literature addressing this topic. We then clarify the functionality of the developed application and explain the design and development of its main components in section 3. We then demonstrate the implementation of the application in section 4. For evaluation purposes, performance analysis is conducted and the obtained results are illustrated in section 5. Finally, the paper is concluded in section 6.
Related work
In previous eras, engaging in cross-cultural communication with individuals of diverse nationalities and linguistic backgrounds necessitated the mastery of the interlocutors’ native languages or the adoption of a lingua franca, such as English. The endeavor to acquire fluency in foreign languages often involved relying on translation resources. Traditional print dictionaries, exemplified by esteemed publications such as The Oxford English Dictionary [4] and The Merriam-Webster Dictionary [5], were ubiquitous in households across the globe, serving as indispensable tools for facilitating translation efforts.
In contemporary times, technological advancements have significantly streamlined and simplified translation and communication processes. It is now a rare sight to witness individuals utilizing traditional hard-copy dictionaries for manual, word-by-word translations between languages. Instead, digital dictionaries have largely supplanted these conventional resources. Initially, these dictionaries appeared in digital forms operating on specialized pocket devices [6]. Over time, they have evolved and now primarily function via two mediums: online platforms [7–10] and mobile applications [11–13]. The proliferation of the internet has led to a surge in the use of online dictionaries for word-by-word translations. Services such as Google Translate [14] and DeepL [15] facilitate the translation of entire texts, further expediting written communication processes. Various mobile applications have also been developed to facilitate the communication process by allowing users to comprehend and converse in foreign languages without the necessity of rote memorization [16–18]. These applications enable users to effortlessly translate between a vast selection of languages. Presently, individuals can access their smartphones, browse through numerous translation applications, configure their preferred language pairs, and decide whether to translate text or audio content [19]. However, certain limitations persist, such as accurately pronouncing the phonemes of unfamiliar languages. Some special devices, shaped mostly as pens, have also appeared that read a text by passing the device over a text written in a language and then translating it to any other preferred language; the translated text is then converted to speech [20]. A more recent innovation, predominantly available through mobile applications, encompasses voice-to-text-to-voice translators to mimic multilingual conversations. In this context, the interlocutors can either input text or vocalize in their native language, which is subsequently translated on the device and displayed as text or vocalized in the counterpart’s language, streamlining multilingual interactions. Despite the ingenuity of this method, it presents several limitations. For instance, communication between parties must occur on a single device, rendering the process somewhat impractical and creating privacy, hygiene and/or safety concerns. Conversations from a nearby distance cannot be facilitated which also limits their applications. Languages must be manually adjusted within the mobile application at the onset of the communication process. Additionally, voice input does not account for the speaker’s gender, resulting in potential discrepancies between the speaker’s gender and the translated output voice. In other words, current mobile applications utilizing this approach lack automatic gender detection feature. Some of these applications are Google Translate, Translate App, iTranslate Translator, Talking Translator, Translate –Language Translator, Translate All Languages App, Naver Papago, Talking Translator, VoiceTra and SayHi. These applications are available at the Google Play Store [21]. Microsoft also introduced a multilingual conversation feature in their Microsoft Translator application [22] which also shares the same aforementioned shortcomings except it allows people to use their own devices and communicate from a distance. However, it has the following two main shortcomings. The application on the involved devices, in distance, is able to communicate over the internet and hence the communication process is bounded by the internet traffic conditions. Moreover, the translated text appears on the other’s device only as text and not heard as a speech.
Our proposed application overcomes all the above-mentioned shortcomings and effectively mimics a real-time polyglot conversation, either when the involved parties are close to each other or at a nearby distance and without sharing devices.
In the scientific research community, many works have addressed speech translation technology from different aspects [23– 30]. However, the works that introduced solutions have either been implemented on a limited scale, restricting their global accessibility and practical utility, or they have remained theoretical without practical implementation. In this work, our objective is to develop a practical and implementable solution suitable for smart devices, such as cell phones, to provide widespread utility, with a particular focus on the tourism sector.
Intelligent mobile application design
In this section, we will discuss the development of the mobile application using the Android operating system [31]. First, we will explain the functionality of the application as follows. Each person uses his/her own smart device and speaks his/her mother tongue or any language he/she prefers without paying attention or even knowing the language of the other side. The application itself automatically detects the spoken language, the gender of the speaker, and the spoken words; converts the spoken words into text; sends the text from the speaker’s smart device to the smart device of the other party over a secured communication link using Wi-Fi Direct or Bluetooth; converts the translated text into a voice of the same gender of the speaker. The speech is heard by the other party in his/her preferred language based on the default language already set on the mobile application. All the mentioned steps of the process happen transparently and in less than one second, hence mimicking a real-time conversation between the involved parties. This process is illustrated in Fig. 1.

Polyglot voice-to-voice communication illustration.
The application is based on five main parts: a wireless network that connects the involved parties to communicate through; a gender detection system to predict the gender of the speaker and hence the person on the other side (listener) would hear a voice that belongs to the same gender of the speaker; a language detection system to predict the language of the speaker as any involved party in the conversation can talk in any language he prefers without any prior settings; encryption algorithms to secure the connection against eavesdroppers; a speech recognition system to convert a speech to text, and a translation system to translate the speech from the speaker’s language to the other party’s language.
The development of each part of the system is discussed separately in the following.
The application can run over two different wireless networks: Bluetooth and Wi-Fi Direct. A person can choose which technology to use when the application runs based on his preferences and/or the availability of the network at the time of running the application. For example, in one scenario, the Wi-Fi module of the smart device may be used in an internet connection; hence, the Bluetooth module can be used by the proposed mobile application. In another scenario, the user connects a headphone or a smartwatch through the smart device’s Bluetooth module; then the Wi-Fi module is used by the application. Please note that in most recent scenarios, mobile internet users connect to the internet using a 3G, 4G, or 5G cellular technology instead of stationary Wi-Fi networks. This makes the Wi-Fi module of a smart device available very frequently when using the application.
We discuss the development of each network in the following.
Bluetooth networking
The Android development platform supports the Bluetooth network stack, allowing devices to connect wirelessly and exchange data. Bluetooth was first introduced in Android 2.0, a.k.a. Eclair, and has been supported in all subsequent versions of Android. The Android platform provides a set of Application Programming Interfaces (APIs) for Bluetooth development, which allows developers to create applications that allow devices to communicate using Bluetooth modules. Through these APIs, a connection can be established and utilized. In order for Bluetooth devices to communicate and transfer data, a connection should be established for the first time using a pairing process. Both devices remain bonded for future communication sessions as long as they are in range and neither device removed the bond.
Wi-Fi direct networking
The Android development platform also supports Wi-Fi Direct, a.k.a. Wi-Fi P2P, network stack. Wi-Fi Direct was introduced in Android 4.0, a.k.a. Ice Cream Sandwich, and has been supported in all subsequent versions of Android. The Android platform provides APIs that allow developers to integrate Wi-Fi Direct functionality into their mobile applications. The Wi-Fi Direct APIs in Android allow developers to scan for nearby Wi-Fi Direct devices, create Wi-Fi Direct groups, and send and receive data directly over a Wi-Fi Direct connection without the need for a wireless access point or a router.
Language detection and gender recognition
Part of the major functionalities of the proposed application is its ability to recognize the gender of the speaker and detect the spoken language. To reduce the computational burden and save storage and battery resources of smart devices, which are realistic constraints for such devices, we implemented the gender and language recognition functionality in a separate server, outside the application. This server is accessed by the application to pass parameters and also to get results transparently in the background. We will discuss the implementation of these functionalities in the following.
We utilize a framework called Flask [32] to implement the server since Python scripts can run on such servers. We developed Python scripts, residing on the server, that are accessed by the application and executed on the server whenever a request from the application is received. POST and GET requests are used whenever language detection and gender recognition are processed. The steps that the application follows to interact with the server to perform language detection and gender recognition are illustrated in Fig. 2.
For language detection, we develop a comparing script within which English is set as the default language. At the beginning of the communication process between the involved parties, each one has to say the word “hello” in his preferred language. The word is saved as an audio file on the smart device and also transcribed in its English letters form and passed to the server. The transcribed word is then compared with entries of an array containing the word “hello” in all supported languages. Whenever a match occurs, the comparing script returns the language name of the matched word to the application. In spite of the simplicity of the adopted approach, we show in Section 5 the high accuracy achieved when the performance of the application is analyzed.

Language detection and gender recognition.
For gender recognition, we adopted a machine learning approach using TensorFlow [33]. It is a strong platform for machine learning with Python scripting, especially for audio processing. We developed two scripts to implement this functionality. The first script is responsible for loading samples in a data file to an array to train the model. Everyone uses the application for the first time, the application stores his audio sample into the data file which allows a continuous training of the model to increase its accuracy. The second script process an audio file, which contains the word “hello” mentioned earlier, to detect the speaker’s gender. The script, when called by the application, processes both the audio file and a trained model weight file obtained from running the first script. The script extracts features from the audio file and inserts them into a NumPy array. Features are extracted using librosa [34] which is a Python library for analyzing and processing audio signals. It is designed to make it easy to work with audio data in Python and provides a wide range of functionality for working with audio signals, including loading and saving audio files, extracting features from audio signals, and performing signal processing operations. It provides feature extraction using different methods such as mel-frequency cepstral coefficients (MFCCs), chroma, spectral contrast, tonnetz and mel-scaled spectrogram.
After extensive research in the literature, we found that the use of the mel-scaled spectrogram could suit our purpose to extract features for gender recognition. A mel-scaled spectrogram is a representation of the power spectrum of an audio signal, with the frequency axis scaled to mimic the non-linear nature of human hearing. This results in a more compact representation of the signal that is more closely aligned with human auditory perception and it differs according to the gender’s voice. After features are extracted and saved into a NumPy array, it is passed along with the trained model weight file to the predict() function in TensorFlow to predict the speaker’s gender. The predict() function returns a value ranging from zero to one which represents a probability. If the number is higher than 0.5, the predicted gender is male, otherwise, the predicted gender is female.
Rivest– Shamir– Adleman (RSA) Asymmetric encryption algorithm is implemented to secure the connections in our App. RSA is an asymmetric encryption algorithm, widely used for secure data transmission and digital signatures in modern cryptography. The algorithm is based on the mathematical properties of prime numbers and modular arithmetic. It derives its security from the difficulty of factoring large composite numbers into their prime factors, which is considered computationally infeasible for sufficiently large numbers. Asymmetric encryption, also known as public key cryptography, uses two different but mathematically related keys for encryption and decryption, known as the public key and the private key. Each device is set to generate an RSA Key Pair including a public key and a private key at the beginning of a communication process. Both devices exchange public keys immediately after establishing the connection successfully and prior to any further communication. Any data sent over the connection including gender and language information are encrypted using the other device’s public key before any transmission activity. The other device is able to decrypt the received data using its private key. The procedure is illustrated in Fig. 3.

Encryption and decryption process.
For speech recognition implementation, Android’s built-in speech recognition feature is modified to listen continuously for a user’s speech instead of its default operation of running only once. It is also forced to use Google’s Speech to Text Engine. The Google Speech-to-Text Engine provides several notable features and benefits such as: high accuracy, it offers high recognition accuracy due to its extensive training on a diverse range of data sources and languages; multilingual support, it supports over 125 languages and dialects; real-time processing, it is designed to support real-time speech recognition; noise robustness, it can effectively handle noisy environments and different audio qualities providing reliable recognition even in suboptimal conditions; among other features and benefits. For all these features and benefits, we found that it is a perfect fit for our application and its used environment. Most Android-based devices have this engine already being installed on the operating system by default.
While running the App, after a user initiates a connection with another user and the language is detected and the gender is recognized as described in the previous sections, the Speech to Text Engine listens for speeches said in the detected language. Whenever the user finishes saying a sentence, the transcribed text is encrypted and then sent to the other device. Speech to Text Engine automatically runs again to listen for new spoken sentences.
Translation and text to speech
In our App, Google Translate is used to translate from the sender’s language to the receiver’s language. Google Translate is one of the most popular and widely-used translation services in the world. It uses neural machine translation models to provide more accurate and natural-sounding translations and supports translation between more than 100 languages. However, Google Translate Library is not supported in Android Java by default. Fortunately, there are many open-source projects posted on GitHub that implemented it for Android. Mannar Mannan’s project [35] is used for this purpose due to its simplicity.
After integrating the code with our developed App, the translation functionality is enabled. Google Text to Speech Engine is used to convert the translated text into audio waves heard by the listener. It is configured to check and adopt the default language set on a user’s device in order to release speeches with excellent dialect.
In this stage, after a speech is transcribed and sent over the connection as text to the other device, the text is translated from the sender’s language to the receiver’s language on the receiving device. The translated text is then passed to the Text to Speech Engine to convert the translated text into speech.
Implementation and demonstration
The application begins by prompting the user to vocalize their name, initiating user identification. Following this, the user is asked to select a network type, either Bluetooth or Wi-Fi Direct, to facilitate communication. The device then commences the discovery process, scanning available communication channels and displaying a list of discovered devices. The user is then prompted to select a device to connect with. Once a device has been selected, a connection request is sent. If the recipient accepts the connection, an encryption key exchange occurs between the two devices to ensure secure communication. If the connection request is declined, the application returns to the discovery process, prompting the user to select a different device.
Upon successful connection completion, the application proceeds to detect the user’s preferred language and gender. The user is prompted to say “Hello” in their preferred language, and the recorded audio signal is sent to a server for analysis. The detected language and gender are then displayed to the user.
Next, the application initiates the exchange of information between the connected parties. The user’s name, detected language, and gender are sent to the connected device, and the application listens for the corresponding information from the other user. Once this exchange is completed, the application enables real-time translation and conversation. The name, language, and gender of the other user are displayed, along with the time of connection and icons for talking, ending the connection, and clearing the chat.
As long as the connection remains active, the users can engage in a conversation. When the talk icon is pressed, the application listens to the user’s speech, converts it into text, and displays the text in the chat area. The text is then encrypted and sent to the other user’s device. Simultaneously, the application listens to the connected device for incoming data. Upon receiving encrypted text, it decrypts and translates the text into the user’s language. The translated text is then converted into audible speech and displayed in the chat area under the other user’s name.
In case the chat needs to be cleared, pressing the clear chat icon will remove the current chat display. If the end connection icon is pressed, the application will return to the network selection stage, allowing the user to initiate a new conversation with a different device. When the application is closed, all processes cease, and the application is terminated.
Overall application functionality is shown in Fig. 4 and the mobile application is illustrated in Fig 5.

Overall application functionality.

Application illustration.
In this section, we analyze the performance of the application from many aspects as follows:
Overall delay time
Since the main functionality of our proposed application is to provide polyglot conversations between humans, the time difference between the moment a person finishes his speak at the transmitter side and the moment the translated speech is heard by the other person at the receiver side should be small enough to mimic real-time conversations. We define this time difference as the overall delay time. Several tests were performed at different distances between the involved parties with a maximum value of 30 meters as shown in Table 1. The results show that the delay is slightly increased with the increased distances between the devices when using Wi-Fi Direct while it significantly increased when using Bluetooth. This is due to the higher range, bandwidth and transmission speeds that Wi-Fi Direct provides over Bluetooth. In general, an overall delay time with average values of 0.712 seconds and 0.411 seconds are obtained when using Bluetooth and Wi-Fi Direct, respectively. It is worth mentioning that the bottleneck of the delays is the medium access mechanisms of the technologies which apply retransmissions on collided or lost frames.
Overall delay time
Overall delay time
Therefore, from the obtained results, it can be concluded that running a conversation over a Wi-Fi Direct network mimic a real- time conversation better than a Bluetooth network. Still, an overall delay time of less than a second is very acceptable for humans’ interactions. The implementation of Bluetooth networks in the application is recommended for many reasons: the availability of the Wi-Fi Direct at the time of using the application as explained earlier in Section 3; the wide implementation of Bluetooth modules in smart mobile devices when compared with Wi-Fi Direct modules; and the low power consumption of Bluetooth modules also when compared with Wi-Fi Direct modules. The latter is very important as the power consumption is one of the most important realistic constraints for smart mobile devices.
It is worth mentioning that Bluetooth Ver. 5 was used in our implementation so older versions of Bluetooth may provide different results.
The application implementation of the speech recognition system of Android Red Velvet Cake (Ver. 11) was evaluated on three languages; English, Spanish, and Arabic. A sentence is recited in each language 30 times and the average numbers of correct words out of the total, for each language, are recorded. There are 15, 8, and 12 words in the sentences recited in English, Arabic, and Spanish languages, respectively. The obtained results are shown in Table 2. On average, the accuracy of the sentences recited in English, Arabic, and Spanish languages are 100%, 87.5%, and 91.7%, respectively as shown in Table 2.
Speech recognition – evaluation results
Speech recognition – evaluation results
Since we are using Google Translate in our translation part; thus the translation accuracy of our application is identical to the Google Translate accuracy.
Google Translate was firstly introduced in 2006 employing a statistical, Phrase-Based Machine Translation (PBMT) model. In 2016, Google announced that they switched to Google’s Neural Machine Translation (GNMT) model for their translation service. The introduction of GNMT significantly improved the quality and accuracy of Google Translate. Google researchers conducted in [36] translation performance analysis on four languages, English, Spanish, French, and Chinese. They translated random samples from Wikipedia and news websites. They reported the following accuracy enhancement comparing their old model, PBMT with the newer model GNMT: 87%, 64%, 58%, 63%, 83%, and 60%for English to Spanish, English to French, English to Chinese, Spanish to English, French to English, and Chinese to English, respectively. They concluded that the GNMT model reduced translation errors by more than 60%compared to the PBMT model on these major pairs of languages. The authors of a more recent study [37] conducted comprehensive evaluation on the new model using 51 languages. Their results were consistent with the results obtained in [36]. Many recent studies also evaluated Google Translate on several languages and reported results more than 70%as was discussed in [38].
Language detection and gender recognition accuracies
To evaluate the performances of the language detection and gender recognition implementations, fifteen males and fifteen females were chosen randomly and were asked to send five samples of the word ‘Hello’ in five different languages: English, Arabic, Hindi, Spanish, Japanese.
The obtained results for gender recognition are shown in Table 3. As illustrated in the table, the gender was recognized by the application with a score of 100%and average confidence level of 94.2%. Therefore, the machine language model used in the App performed very well. The obtained results for language detection are also illustrated in Table 3. The results were accurate for all languages with accuracy value of 100%, except for Arabic language; an accuracy value of 93.3%was obtained. It is worth mentioning that in spite of the fact that people who provided the samples are all Arabs and definitely they cannot pronounce other languages as mother-tongue speakers, still the system were able to detect the languages, except the Arabic language, perfectly. As illustrated in the last column of the table, the language detection and gender recognition implementations in the application performed very well with samples containing background noise or with low-quality audio samples.
Language detection and gender recognition – evaluation results
Language detection and gender recognition – evaluation results
We conducted a survey where twenty-five people from different ethnic backgrounds were asked whether they prefer to use a multilingual application to communicate with other people who do not speak their languages using a single shared device or using separate devices. Twenty-two participants preferred to use their own devices without sharing them with others due to either privacy, hygiene and/or safety concerns while three participants did not mind sharing devices with others. Therefore, since, to the best of our knowledge, Microsoft Translator is the only application that provides a multilingual conversation feature and allows people to use their own devices and communicate from a distance, we illustrated SayHello and Microsoft Translator to the participants. They were asked about the following: the graphical user interface (GUI) of both applications, the ease of use of each of them, the translation accuracy, the connection type, whether a direct connection is preferred or a connection through the internet, the number of provided languages and the gender detection. The obtained results are as follows. 76%of the participants preferred the GUI of Microsoft Translator over SayHello while the rest had no preference. All of them agreed that both applications are simple, easy to use and accurate enough for the purpose of multilingual conversations. All of the participants preferred to use Bluetooth or Wi-Fi Direct that SayHello provides for the connection rather than consuming their 4G/5G data plans while connecting to the internet when using Microsoft Translator on travel. 56%of them had no preference of the connection type if they are using the applications at their hometowns. All of them also had no preference on the type of the connection if free Wi-Fi networks are available around when using the applications. It is important to emphasize that our application mainly targets the tourism sector where tourists communicate with strangers in the visited countries in the streets and public places where mostly no free Wi-Fi networks are available. Moreover, cellular companies around the world usually provide 4G/5G data plans with expensive rates and limited data caps for tourists. All of the participants agreed that Microsoft Translator outperforms SayHello in the number of languages it supports. The participants praised the gender detection feature that Sayhello provides and expressed their comfort to hear a voice according to the gender of the speaker.
Conclusion and future work
We presented in this paper SayHello, the mobile application we designed and built to facilitate real-time translation in polyglot conversations. The application was designed to recognize the user’s language and gender, thereby tailoring the conversation to the specific user’s preferences. Using Wi-Fi Direct and Bluetooth technologies, this application establishes a connection between two users, enabling them to communicate seamlessly in their preferred languages either when they are in close proximity to each other or at a distance. Part of the application’s core functionalities includes language detection and gender recognition. To optimize computational resources and preserve the battery life of smart devices, these functionalities were implemented in a separate server, accessed by the application. The server analyzes the user’s recorded voice and provides results to the application. Language detection is performed by comparing a transcribed “hello” against an array of “hello” in different languages. Despite its simplicity, this approach had proven highly accurate. The obtained results for language detection were 100%, for all languages in our performed tests except for Arabic language; an accuracy value of 93.3%was obtained. For gender recognition, we employed a machine learning approach using TensorFlow. The application was able to recognize the gender with an accuracy of 100%and an average confidence level of 94.2%. We modified the Android’s built-in speech recognition feature to continuously listen for user input. Google’s Speech-to-Text Engine is leveraged for its high recognition accuracy, multilingual support, real-time processing, and noise robustness. On average, the accuracies of the sentences recited in English, Arabic, and Spanish languages in our evaluation were 100%, 87.5%, and 91.7%, respectively. We utilized Google Translate to convert the sender’s language to the receiver’s language and Google Text to Speech Engine to convert the translated text into audio waves, providing an integrated real-time translation experience. The accuracy of the translation feature is equivalent to Google Translate’s accuracy due to the use of the same service. We also ensured to secure conversations by implementing the RSA asymmetric encryption algorithm in our application to encrypt data sent over the wireless channels.
In summary, by leveraging contemporary technologies, SayHello provides a user-centric solution that simplifies cross-cultural communication. The application’s performance metrics demonstrated its effectiveness, revealing high accuracies. Future implementations will investigate the potential expansion of the application’s language support, as the first implementation supports five languages only. In principle, the application can accommodate all other languages that are adopted by Google. In addition to Google Translate, we are currently considering the use of large language models such as GPT-4 for translation purposes due to their significant capabilities in such tasks. Additionally, we are considering incorporating additional user customization features and also implementing it on iOS. Despite these potential avenues for future development, SayHello, in its current version, stands as a robust, innovative tool for facilitating polyglot conversations, thereby bridging linguistic barriers and fostering global understanding.
During the preparation of this paper, the authors used OpenAI’s ChatGPT [39] and Grammarly [40] in order to improve readability and language. The authors take full responsibility for the content of the publication.
