Abstract
This paper briefly presents the linguistic theoretical principles in Portuguese-speaking countries that do not enhance the representation of most local languages in digital space, in national space, and particularly in the formal public space of teaching and learning. It proposes the understanding of theoretical linguistic thought in Brazil as preferentially working in monolingual situations erasing multilingual reality and mainly in written works erasing oral history and oracy, due to theoretical choices. It is this understanding of language as an object recognized in monolingual environment that promotes a lack of models to deal with multilingual environment, specially in digital world. It presents data on living languages today in the world and on those that are part of the world wide web. And from these data, it considers the notion of big data in relation to the linguistic reality of multilingual countries that have Portuguese as their official language, particularly considering the activities of study and production of knowledge. We reflect on our Portuguese-speaking official and written reality, where the big data is located, and the possibilities of working with local languages inspired on the idea of polyphony, in order to promote local languages representation in digital world.
Presentation
As we learnt in Linguistics studies in Brazil, at the beginning of modern Linguistics, in 1915, Ferdinand de Saussure presented two possibilities to approach language: 1) an abstract system (langue), and 2) a process of speech, the talking-listening process, the manifestation of language as a psycho-social practice (parole).
“En séparant la langue de la parole, on sépare du même coup: 1
Traditional language technologies in general are derived from the intuitive perception of language structured as written language, and they are part of our daily life with objects nowadays considered natural in Western cultivated pattern. Especially concerning linguistics tools, we can find grammars, encyclopedias, dictionaries etc. All the digital world can be considered a wide linguistic tool, with special emphasis on automatic text analysis and automatic translation technologies, all of them based on the technology of writing.
Written technologies permit many communities to register their cultural treasures in literature. Projects promoting literacy are found nowadays as basic needs to access information and to enhance citizenship in every nation. Also there is a considerable quantity of languages which do not have a written form, and this fact is usually underestimated in the world of linguistic studies (colonized by Europeans) geared to value written works, and to define intellectual capabilities based on the skills of reading and writing ideas.
The languages that do not have a written form have developed different ways to express, understand and to publish their own ideas. And in this case their cultural treasures are inherited through what Pio Zirimu called “orature” (Zirimu, 1973). Orature is the twin of literature, as oral cultures develop their orature, written based cultures develop their literature. In the same manner that the promotion of literacy is important for written based cultures, the promotion of oracy is a basic need in a community that does not share interest for written technologies. To promote oracy is also important to have access to languages, cultures and knowledge which are in the oral tradition. A big part of the languages of the globe (nearly half of the languages of the globe) are part of this group, and the learning the skills necessary to inherit and promote oracy are usually forgotten in national public policies and in the development of language technologies industries. This is yet to be developed in international grounds, if we really want to propose solutions for the two thirds of the languages of the world at risk of disappearance for the next two generations. How can digital culture embrace oral cultures, considering the big data paradigm and the wrong assumption that all information is online? Will information and communication technologies assets collaborate to the disappearance of all languages which do not have enough conditions to integrate the digital world? Are Internet and information and communication technologies just another layer for exclusion? Are the new technologies in fact a new lack of conditions for traditional languages to exist?
If we consider information access, the idea of books, publishing houses, press, libraries, museums, telecommunications, world wide web, are means of disseminating knowledge and culture, ideas of what it is possible to develop to keep languages, cultures, knowledge alive and with free access for the public. The appearance of digital tools in our life and to some of our languages permits us to re-create and to re-think the history of our cultures and of our countries. The perspective proposed here is that language technologies might be an interesting reference to debate the processes which involves the local creation and use of those “linguistic tools”? or in other words it could help us to comprehend how our national linguistic policies respond to them in national environment in order to understand to where current linguistic policies are leading us.
When we consider multilingualism in post-colonial context, and multilingualism in digital space, meaning digital inclusion of languages (and local communities) in information and communication technologies, in Post-colonial Portuguese speaking countries the idea of multilingual big data1
In Portuguese there are very few automatic analyzers for texts with morphologic, syntactic and semantic layers, because there is practically no investment in automatic text analysis and in big data analysis (cf. the important work of Inter-institutional Center for Computational Linguistics (NILC) at University of Säo Paulo (USP) University). But if the text cannot be represented in unicode – which is the case of many traditional peoples languages – if the language is not registered in unicode consortium, even a statistical analysis of occurrences is not viable.
As we learn in Linguistic studies, Ferdinand de Saussure (Saussure, 1915), the creator of modern linguistics, has two strong metaphors to express the abstract object “language”. One of them is a chess game, a closed system that is related to the idea that “In language there are only differences, and no positive terms.” In the original French text, Saussure writes:
“Tout ce qui précède revient à dire que dans la langue il n’y a que des différences. Bien plus: une différence suppose en général des termes positifs entre lesquels elle s’établit; mais dans la langue il n’y a que des différences sans termes positifs. Qu’on prenne le signifié ou le signifiant, la langue ne comporte ni des idées ni des sons qui préexisteraient au système linguistique, mais seulement des différences conceptuelles et des différences phoniques issues de ce système.” (Saussure, 128:2015) “Everything that has been said up to this point boils down to this: in language there are only differences. Even more important: a difference generally implies positive terms between which the difference is set up; but in language there are only differences without positive terms. Whether we take the signified or the signifier, language has neither ideas nor sounds that existed before the linguistic system, but only conceptual and phonic differences that have issued from the system.” (Saussure, 120:2015)
This enables us to show the relationship among signifier and signified that by an arbitrary relation would constitute the sign. These are abstract notions that permit systematization and serialization of a language as a closed system, like a chess game.
This definition of language, with elements structured as differences, is related to a specific understanding of a system, specifically a closed system. In closed system, there is no place of contact with another system, in this case no contact with another language. The system is studied from within, and there is nothing besides that. This kind of comprehension of a closed system to approach the object “language” is in current application in our country (Brazil), and it promotes a complete isolation of each linguistic system in linguistic studies. This was a viable model in 1915. Nowadays, in 2018, in complex system studies, it is very rare to propose a model of a system which is not open to the environment, or let us say a system which is not in contact with other systems. In other words, linguistic studies are, in general, developed within a monolingual closed system, because they are in a theoretical injunction of a model of a system which is closed. The difficulty here is that the more these interesting concepts that permit abstraction are working in linguistic studies, the more the unity and the frontiers among languages are presented, because they are established in a metaphor of a unique and closed system. Theoretical linguists, specialists and the schooled population are working in general with this understanding of language. It is time to open the model to new possibilities. In short we could say that all the technology that we use has the monolingual reality embedded in its creation. It does not carry strategies to deal with multilingual environment. For us, interested in the participation of all existing languages in world wide web as well as in our social life, this is a problem.
So how can we deal with multilingualism? How can we work with different linguistic systems in the era of big data, if the object language is currently shaped as an abstract system, or to state it clearly if there is an implicit implication that language is thought of as a unified system, living in its own monolingual environment. The truth is that the real world of languages is quite different from this theoretical endogenous abstraction.
Another way that we can envisage this is to think from the ground up, empirically, in a different direction of Saussure’s proposal. Our goal is to propose a way to look at a linguistic system where polyglossia is viable off line and online – and to allow us to think that language is the abstraction and systematization of the act of speech. Such a perspective opens the debate about linguistic unity, linguistic plurality and the actual frontiers from one language to another.
We understand that the idea of democratic access to information and knowledge implies social and digital inclusion for all, freedom of expression, and it should include – in the name of the democratic access to information – all living languages in all current technologies. For someone from the linguistic field, the idea of the Information Society concerns freedom of expression, freedom of publication, freedom to circulate one’s ideas. This means to give access to people that usually do not participate in the mainstream cultural life of a country, possibly occupying subaltern roles in society. The availability of means of intellectual production, such as publishing houses, or media labs to produce radio, TV and cinema programs are also related to the idea of speech. A consistent training in new technologies, digital literacy would be a strong key, as well as consistent training in literacy and/or oracy skills.
So the speech act in digital media should be a possibility to virtually anyone. It is a very simple affirmation. That is where our problems start, because we are talking about voices of different cultures to be considered as equivalent speeches in the same national ground, different languages should be present in media and especially in the Internet. Generally, in the history of linguistic ideas, as thought in Brazilian academy, the privileged language was (and still is) the language of the ex-colonizer and the language of the neo-colonizer – in ex colonies of Portugal for instance that is the current situation. Portuguese is, in general, the only national language, and the traditional languages and languages of immigrants do not participate in public life. Brazilians, for instance, are not allowed to be schooled in other language than Portuguese. TV, Radio and Internet are bound by law to broadcast programs in Portuguese, all government public services are only in Portuguese. It is a linguistic policy prevailing in Brazil which consolidates the erasing of around 300 traditional languages still alive today in the country.
The language that received the human resources and administrative resources into which the educational material was targeted to be is the national language, which is also the language of the national elite. The educational system is particularly impacted by this political relation in our country, by national linguistic policies.
In this sense it is important to determine what the real possibilities are for the inclusion of the local languages in information and communication technologies, and our share of responsibilities in this situation. Will big data be available for the so called minor languages? Or is it big data for dominant speeches only?
Local content, linguistics as a science for mutual well-being
Producing local content implies local language, local culture and knowledge availability for the community, and the possibility of circulation of this knowledge. Otherwise we would be creating a simulacrum of a pattern of knowledge production and knowledge diffusion that is not necessarily able to manifest the best contributions of any specific local language, society and/or culture.
I have envisaged a free and common digital library in multilingual basis, based on the studies of linguistic change of Bakhtin (1984), and on the notion of polyphony.2
I would like to thank Fernando Rosa Ribeiro for so many dialogical debates and his endless energy to formulate with us the multilingual digital library proposal.
In this sense our idea of local content production proposed is on the brink of novelty, and data base as well as big data should, or could, follow these principles in order to be in more agreement with actual linguistic reality, and to the idea of an open complex system. It means to take much more in account, considering the effects of enhancing the disappearance of traditional languages and cultures in our society in a long term, with a theoretical choice which is apparently neutral. Also the ethical aspects of a theoretical choice when working in a traditional community life should be taken into account. As researchers of higher education institutions we legitimate approaches to theoretical objects such as language. These understandings are part of the root of the difficulties to consider the polyphonic nature of language in our region.
On the other hand, a real work of this nature cannot be proposed without considering the debate with traditional societies and indigenous peoples in our work, as well as our permission for them to shake our traditional colonial principles and to bring to academic ground the true nature of the linguistic interaction that we live in. This linguistic permeability in Brazil as Ilari (2005) shows us, and the available linguistic diversity are better represented based on an epistemological option of a partner community participating in the academic debate.
There is another emerging society which is formed by other social subjects and the operational systems (computational systems). We consider, as Simondon (2005), operational systems and computers as social subjects, with ‘whom’ we interact very frequently through communication and information processes. This social group proposes and permits different subjective interactions, with them and through them. The main idea is to work as well with this group creating a digital library, a database, with speaking subjects (of different natures) that experience their speaking capabilities through a variety of phonies.
These phonies are here comprehended as repertoires of linguistic character that exist simultaneously and in a gradual manner, without necessarily the presence of well delineated and well defined linguistic frontiers. The objective of the presentation of a polyphonic database here is to create the possibility of a polyphonic knowledge database that permits the comprehension and an integrated digital experience of this myriad of repertoires, and to facilitate the navigation among them. Most of all, an open system which permits the idea of a multilingual societal activity in digital world, and a broader understanding of what big data could be, considering mother tongues in our region, and in the planet.
From Ethnologure, accepted number of languages alive today.
According to Ethnologue (2018), we have a world population of 7,349,570,760 people today, 7,097 living languages. “Five hundred and eighty languages are classified as “institutional”, 1,590 are considered as being “in development”, 2,446 are classified as “vigorous”, 1,559 are considered as being “at risk of disappearance” while 922 are considered as “dying”. From there we can ask ourselves whether all these languages or even some information generated in these languages are available on the world wide web or if they are present in digital space.
Let us recall that to access to the digital space we do it through an electronic or digital device through a keyboard. Using a keyboard necessarily implies that the language has a written form, and therefore is not a language exclusively of oral transmission. Approximately 3,188 languages have no writing attributed or recognized as such, therefore they are still unable to participate in a worldwide network of computers that is based on information exchange of the combination of written characters. In short, in a mass consumerism of written technology, we are in a situation where we can easily forget about the importance of the oral transmission of knowledge. In a way, by assuming that knowledge and language are basically transmitted in the written form we become co-responsible for the collective erasure of orally transmitted knowledge in the digital space. With the illusion that all information is online, we are preparing a generation of people unable to deal with, or even to recognize oracy skills as well as enhancing the invisibility of more than half of the linguistic diversity of the planet.
Consider that from 7,097 languages, 3,188 have no written form, we have only 3,909 that can be online. So, even considering Ethnologue’s data, we could potentially have online content relative to 3,909 languages that have characters and may be encoded to be used in the digital space.
In the Unicode consortium there is a set of approximately 600 languages that are met in terms of character set registration. The alphabets and their diacritical signs of the writings of natural languages are understood by electronic and digital apparatus as characters. So alphabets that have support as possible keyboards in the digital space are around 600. This is our real horizon of possibilities today for online dissemination of information and communication. Wikipedia is officially open for editing in 288 languages, but as we could expect there is no editing activity for the 287 other languages as for English.
Furthermore, according to the open standards of software, the set of languages that they support openly – in a way that can be implemented by the user at the time of installation – varies from 24 to 100 languages. It means that ideally the users of the software can extend the use of the open software up to one hundred languages, although this type of installation needs to rely on the user’s knowledge of the default language of installation (usually English), and also to do that the user should have a certain knowledge of the installation process to know that there is this possibility to choose the language of interaction and to customize it. These many elements of knowledge connected to the installation and use of the functionalities of a software are not always present in the knowledge framework of the users, nor the bilingualism to switch from the language of the manual, the language of installation of the software and the customization to one of these 100 languages available.
When it comes to browsing online… it is useful to find the content in your native language:
“The famous engine [Google] that recognizes 30 European languages, recognizes only one African language and no indigenous American or Pacific languages.” (Prado, 42: 2012)
Within these frameworks for the use of languages in digital space, when thinking about knowledge production, one must consider in which languages the study materials and the online content are. Also the method to count the language of the users online is related to the language in which the user is accessing Internet, it is not necessarily the language in which the user is fluent. It does not necessarily mean that the registered languages are the mother tongues or the preferred languages of the speakers, but rather the languages that they use to navigate online or the content that they can access as second or third language, in many cases by a lack of choice. Table 1 below gives the top ten languages used on the web. Internetworldstats (2018), shows.
From Internetworldstats (2018) the most used languages to access internet.
Map of Portuguese speaking countries and the administrative region of Macau in China.
The Portuguese-speaking world formally is composed of the countries that have Portuguese as their national language, i.e. Angola, Bissau Guinea, Brazil, Cape Verde, East-Timor, Equatorial Guinea, Mozambique, Portugal. We can also add Macau in China and Goa in India that still cultivate the presence of Portuguese language in their public policies. In the world map below, we show the production of content in national language, which is very modest in quantity considering the size of the population, and the absolute lack of content production in local languages in digital world in this group of countries and regions.
“It is in my political interest to join forces with those Marxists who would rescue Marxism from its European provenance” (Spivak, 1998: 216)
In terms of availability of online content in any language the framework is slightly different because either there is a national policy of content digitization or there is not. In the case of Brazil, every public university should make available all its production of master’s dissertations and doctoral theses online, free of charge. Thanks to the efforts of the Library of Thesis and Dissertations (BDTD) of the Brazilian Institute of Information in Science and Technology (IBICT) <
In order to enter this debate seriously, we need to consider the literacy level of the population in general, the degree of media and information literacy, the economic value of an internet access line, the economic value of the device that will allow access to the internet and the degree of importance that this culture gives to people that write, produce and publish content. Considering at least these variables, we will have a notion of the possibility of local authors producing online content in a given language. This is the possible system today for the production and circulation of knowledge in a language other than the national language, or the language of a country’s education. From pre-school education to higher education all intellectual work is in Portuguese language, and the authors who are the product of this educational process are writing and debating in Portuguese language.
Traditionally the university is an institution that produces books and scientific articles, which can potentially go online. As academic authors are practically interested in facilitating the exchange of knowledge and the production of new knowledge, and local citizens are rarely involved in this dynamic, it is important to develop different circuits for local content production in order to develop multilingual environment.
Many languages, many cultures, the real big data
There is an immense amount of information present in our territory that is outside the official intellectual life of the country and consequently outside the digital space in the official language, not only for a lack of technological conditions but also for a lack of critical postcolonial reflection.
“languages maladapted to the web, or a web maladapted to languages? In terms of tools, we know that a language’s online representation is not simply cultural or quantitative. It is most crucially a question of[A] technology. The Internet, recalls Paolillo [2005], is an instrument originally conceived primarily for the English language. By extension, languages sharing the Latin alphabet and Western cultures were able to find a comfortable place for expression more quickly than others. However we should not forget that European-specific diacritics still do not have a place everywhere online, despite advances are that sometimes given an excessively high profile, as with actions advocating the acceptance of domain names in different alphabets and diacritics, English remains the language of programming, markup, coding, communication between servers and most importantly, the bases of computer languages. Computer languages are based on English, and computer scientists are professionally required to know it. But how many languages encounter more significant constraints related both to the technical problems of representation and to cyberspace-specific cultural media use [Diki-kidiri 2007]? The explanations provided by the site do not explain the method for deducing what language is used by a given Internet user, and the number of potential users of a number of languages (English, Arabic, Chinese, French, Portuguese, etc.) seems wrong”. (Prado, 2012)
The European Federation of Institutes of National Languages (EFNIL <
In the case of the Community of Portuguese Speaking Countries (Comunidade dos Países de Língua Portuguesa – CPLP) there is no such effort, there is no clear interest in building a knowledge base in Portuguese online. Portugal, as far as we know, is resistant to the availability of its scientific production online, as we are still in 2018 not able to find most of their literature and scientific work online. The Portuguese-speaking countries (Países de Língua Oficial Portuguesa – Palops) are working towards building their postgraduate courses and setting up their national research and development priorities. Thus, as far as we know, there is no such space for the production of knowledge in Portuguese online.
I have been working with the issue of Multilingualism in the Digital World since 2004 and was in a first contact with the subject at UNESCO. I understood that it would be important to develop work to reflect on multilingualism in the digital world among Portuguese speaking countries. The first framework that emerged from the dialogue is that local languages are under the barrier of the official language. In the Information for All Program, at UNESCO, forty six countries have structured some ways of preserving local cultures and languages, and interested in exchanging good practices. The linguistic situation in Portuguese speaking countries is shown in Table 1.
Languages in Portuguese Speaking countries
Languages in Portuguese Speaking countries
Access to languages in the digital world would be the first point of debate when we discuss big data. After the language is available online, it is important to assure that there is the possibility of online language learning. In this sense, the online language representation is very important for the children to enter the digital world in their mother tongue and to feel that their culture is present. The capacity to represent “minor” linguistic communities might be enabled by the appropriation of free technologies, the appropriation of an identity autonomy and the understanding that we all have more rights than we exercise.
Linguistic families alive in Brazil composing around 300 living languages.
In this perspective, it was only after some years of research that it finally became clear to me that the expression “Portuguese speaking countries” is in reality nonsense. It is very easy to adhere to the official narrative of countries in post-colonial situation when we did not enhance the intellectual core to seriously debate the inheritance of the colonial system in our grounds. It is a comfort to still live among generations of academics that are still based on old colonial comprehension of our post-colonial reality, and a new generation that understands what is at stake, but feels that there are not so many opportunities to live in the country and carry out research. Considering that currently, Brazil is cutting its investment and creating a structured deficit of a (already low) national and regional research funding, considering also that for at least a decade, Brazil is under a national policy of research funding by thematic induction, meaning that it is the older generation that dictates research funding policies and chooses what kind of research should be funded. This promotes a profile of young researchers, who in order to adapt and to have some resources for research will have to bend to the thematic choice of the agencies. This would not be a problem if in Brazil we had a scientific planning, a common horizon to develop together something like a regional research plan for a decade at least. In other words, if we had a project for science development and for the improvement of the citizenship in our country, it would not be difficult to cope with this process of thematic induction. As we say among scholars, today we are much more under the influence of a tower of thematic control in research than under the scrutiny of an ivory cultivated tower. Hence, a change in our way of doing science is dependent on a generation of researchers who do not have financial autonomy to develop their projects. Nevertheless there are ways in which countries can preserve their local cultures and languages.
In the case of Brazil, there is a large amount of languages and cultures and peoples to be preserved, cherished and honored. There are many ethnic groups that can teach us how to better deal with the environment and with ourselves. There is a richness of possibilities to understand the fauna, the flora, the human beings, the world as we do not know it. Also, there is the possibility to learn to work, to feel and to think with someone that is different. There is the chance to learn how to be inclusive. There is the possibility to broaden the understanding of knowledge by exchanging respectfully with Amerindian and Afro-Brazilian sages and learning to work in parallel with different epistemologies. There is a huge potential to expand our understanding of the academic process, the colonial program, the arrival of the Europeans in our lands, and the horizon of industrialization, globalization that they brought. This is possible by getting in contact with original people’s way of thinking, by analyzing our elite choices, and mostly by permitting ourselves to develop an auto-criticism concerning our micro-history of arrival in the academic sphere, and the traditional role of the academics in our territory and what do we do with it.
The many linguistic families that resist in Brazilian territory, the around 300 living languages (Aryon Rodrigues, 2002) that we can find in (usually) small indigenous communities is the big data concerned in this work.
To exercise this possibility, it is important the we have a recording capacity, or that it is viable to create digital collections, that people can register what they feel that is most important for them. These recordings should have a place to be hosted that might be administered by the community, without a control point. It is important to have an interlocution capacity. Similar networks tend to help themselves in a better way that very different networks (as academy and traditional community) cannot. Traditional peoples and higher education professionals are two groups practically torn apart since the beginning of the institution of the university in Brazil. Hence, the more communities we connect, the better the dialog.
Also the possibility of publicizing their communitarian initiative through events is a very nice start. The universities tend – in order to fill the control of productivity – to take the lead in many events that would be much more interesting if created collectively, and with academics in a backstage role. To the best of my understanding, the necessary recognition among academic peers that such a work is more than necessary in times of the disappearance of languages, peoples, water, is something to be discussed amongst us.
We are in times of many environmental issues due to the infinite need for finite matter for production of goods, but there is a reasonable understanding among scholars that we have surpassed the level of destruction that the planet can handle. Traditional people’s capacity to resist and their understanding of resilience might expand our understanding of knowledge in a way that our interest in infrastructure and technology might meet an ethical comprehension and compassion. Such an approach is viable and feasible when we aim for the mutual well being of all peoples.
The possibility of disseminating reflections through publications is also another very important point. Many countries like Brazil have a chart to give scores for publications in order to be able to evaluate a researcher’s work impact. Contrary to what we could expect from the perspective of a researcher formed and funded by national resources, the more our intellectual work is framed in Europe and United States debate, the more the score of the academic publication grows in Brazil. The more the academic work in Brazil is related to national issues or to a research out of the European-American circuit, the less it counts in the score. As far as I understand, it is from the periphery that we might find ways to tackle the crisis that the knowledge and technological development have brought us. Working with different worldviews, different logics, different understandings of the human nature, different relations to the living beings might teach us something valuable.
Big data in the terms of this work considering Brazilian local languages demands an old equipment, the body in movement. It is necessary to get out of the university campus and to go in person to meet the speakers in order to understand what is happening with local languages in our territory, in local communities and in the periphery of the big cities. For this kind of research it is also necessary to be in contact with our neighbors, to be in contact with local peoples, and develop or to turn on a double-bind software present in human beings, empathy with auto-criticism. The sequence of these encounters among academics and local peoples promoting mutual wellbeing through science is to be orally shared and written.
Footnotes
Acknowledgments
Prof. Maria Eunice Quilice Gonzalez for the invitation to present this discussion at UNESP Marilia. Prof. Daniel Martinez-Avila and Prof. Fidelia Ibekwe-SanJuan for the proof reading and editing of the English as well as for her suggestions. Prof. Fernando Rosa-Ribeiro for to introducing me to Johann Gottfried Herder readings. All the colleagues that have been kind enough to maintain the dialog with me about Multilingualism in Digital World for the last decade.
