The words of Bambara headlines: A quick look at an open dataset

 

Finding Bambara or other Manding varieties written on the internet is not always straightforward. (In Latin script, at least; N’ko websites are a different story).

That said, just as you run into the language written in the streets of West Africa, you also find it used for basic headlines on the internet. Today, this is in large part because of the foreign media operations run by the United States’ Voice of America (VOA) and France’s Radio France Internationale (RFI). (See this outside post of mine more context).

Interested in these headlines as a potential future archive, a data source for my own research, and a resource for people interested in learning Manding, I recently collected VOA Bambara’s headlines every day for one month.

Below are some very basic insights from my initial look at the data set, which I’ve uploaded open-access for anyone to use here, but first a bit more context…

Context

Voice of America has a Bambara language service that publishes audio and video media episodes.

Each episode has its own title (which often reads like a headline) and description, which is written in Bambara using an ad hoc orthography that combines French, English and Bambara spelling conventions. (The fact that VOA Bambara does not respect standard Bambara orthography was recently a subject of debate and even a petition.)

Every day the VOA Bambara homepage links to the individual web pages for the most recent episodes that come from a variety of programs (longer regular shows of VOA Bambara) and segments (short episodes produced under one name, but not officially a show):

Programs

  • Mali Kura: a daily 30-minute radio show.
  • An ba fo: a 60-minute call-in radio show that comes out on Saturdays.
  • Farafina Foli: a 60-minute Bambara version of Music time in Africa, a music-focused radio show

Segments

  • VOA 60 Bambara are video clips from VOA's regular sixty minute show. The clips comes out daily during the week.
  • Sport: short audio clips focused on sports that irregularly labelled with names like faricolo gnanadje or Sport:

Number of words

If we eliminate punctuation tokens and anything non-alphabetic AND we convert the remaining word tokens all to lower case, we get the following:

  • 5,710 words (counted in the same way that Word counts words)
  • 1,326 word types (if we only count words a single time no matter how many times they appear)

The average episode has a written Bambara headline that is roughly 21 words long. (Each "headline" is in fact a combination of a title and description in VOA Bambara's episode system.)

Top words

The top 25 words can be seen here:

 

The top 25 words counted cumulatively

 

Note that this graph means that the top 15 words account for 2,000 of the 5,710 total words that appeared in a roughly a month of headlines! Two of them are actually the name of one of VOA Bambara's journalists: Kassim Traoré. Sadly, but appropriately the Corona virus figures prominently since I started collecting the headlines in mid-March of 2020.

If you are curious to see how frequently the words are individually though, look at the graph below:

 

A plot of the frequency of the top 25 words

 

Notice how prominent ka is? This can be explained in part because of homonymy: ka is, in fact, a few different words (as in, lexemes or headwords) that are listed as distinct words in the dictionary:

 
Returned words when searching for “ka” in the dictionary

Returned words when searching for “ka” in the dictionary

 

Other things

A seemingly large number of words---621; more than 10% of the total word forms---occur only one time (so-called hapaxes). Some are clear typos, but others stand on there own. For instance:

  • damina 'start'
  • baloko 'sustenance'
  • dugukolo 'earth'
  • etc.

There are 8 words that are over 15 letters long. Some of them seem intended to be read as internet hashtags without spaces and not to stand as their own words. For instance:

  • ogossagouvoabambaradiallassagouvoabambaramopti
  • bambarawashington
  • konovoabambaracoronaviruskasarawashingtondc

Future research and projects?

Alright, that's the very broad and basic look at the VOA Bambara headlines data that I've collected. There's a number of smaller features that stuck out to me when reading etc., but I'll leave them for future writings.

To analyze the headlines as I would like down the road, I will need to interpret and translate all of the headlines into standard Bambara orthography. (This need to manually re-write every single headline reveals another reason that VOA Bambara's policy regarding written Bambara is a pain; it means even less data available to develop resources and tools like Google Translate, etc., that people so often expect now for major languages.) I have done it for a random assortment of the headlines, but it'll likely be a while before I get to or through the whole data set.

My other wish is that I can someday turn the headlines (and future ones) into a tool or database that researchers and students could search and filter. But that that's likely even futher off...