news

2021 / 01 / 23

10 Years of Phonemica

Today marks the 10th anniversary of the launching on Phonemica. On the 23rd of January, 2011, we officially went live with 12 recordings. This after 2 years of discussion on how best to approach the project and what we wanted the long term result to be.

Today also marks a major update to the site. About 5 years ago, due to life getting in the way, we refreshed the site, changed the back-end to be more stable and self-sufficient, and really haven't touched it since. We're remedying that now with a total re-write of the backend.

What's new

code & data storage

We've taken major steps toward portability and ease of making backups. In a previous iteration of the site, user details were stored in one database, and transcripts in another, and neither of these made it terribly easy to make backups. We now have proper git integration for site code, and consistent backups of media files and transcripts.

media files

When we started the project, in order to be able to play audio files with the HTML5 audio tag, we were forced to have two versions of each recording hosted. One, an .ogg format, for Firefox, and another .mp3 for webkit users. That plus a .wav from which waveforms are created, and then whatever format the original recording was in. We've consolidated that considerably. Ogg vorbis is no longer needed as mp3 now has more complete support, and we're also no longer keeping other formats unless a wav file was the original upload.

This allows us to have a much smaller footprint, bot in terms of files to keep track of, while also allowing quicker and smaller automated backups.

user interface

The site has been completely redesigned with an update on the familiar branding. In addition to a slight visual refresh, dark mode is now available. It will default to whatever your operating system settings are, but can be changes with the sun/moon icon near the right side of the navigation bar.

less bloat

In addition to cleaning up the site, we've also cleaned up some of the contributions. Of the 1277 speakers that had been listed on the site, only 783 had stories uploaded. In order to clean things up a bit, we have removed all entries that did not have an accompanying story. We hope that those who might have been planning to record and never got around to it still do, but in the mean time it was clogging up some other functions.

database changes

In past versions, the locations represented by different entries were not well standardised. Taiwan might be 台湾 or 台灣 or 臺灣. Locations were generally in CJK in the database, but due to this lack of consistency, it meant the system wouldn't know the three were all the same thing.

The same problem existed for languages. For the langauge side of things, this is relatively easy by just switching all in-database encodings to their corresponding Glottolog glottocodes. For example, 汉语系 and 漢語系 both become sini1245. This is only relatively easy instead of completely easy because there are cases where Glottolog either is missing a code, or has a different tree structure than the sources which were being following for classification within Phonemica. One such case is that of 饒平 Ráopíng dialect, one of the five groups found in Taiwan, which has no glottocode at all. For these, for the time being, pseudocodes are created such as raop0000. These glottocode and pseudocodes are then used for i18n-like localisation, allowing more flexibility in searching, grouping, analysing data and of course visually representing the language names in a visitor-appropriate way.

The way this has been (inconsistently) represented in the database before was like this:

"language": {
  "0": "汉语系",
  "1": "客语",
  "2": "漳潮片",
  "3": "饶平小片",
  "4": null,
  "other": ""
},

Instead, it is shifting to a format like this:

"variety": ["sini1245", "hakk1236", "raop0000"],

Note 漳潮片 Zhāngcháo piàn also does not exist in Glottolog.

For locations a similar approach is being used. For 台湾 and 台灣 and 臺灣, an internal identifier tai2wan1 is used. While this is less cross-linguistically consistent since locations within China will use hànyǔ pīnyīn and locations within Korea use revised romanisation, it at least offers some consistency and a way to translate placenames.

The old way (which itself wasn't even consistent as sometimes the address was broken up and sometimes not):

"address": {
  "address": "臺灣新竹縣竹北市",
  "latitude": "24.8332427",
  "longitude": "121.0127707"
}

The new way:

"address": {
  "latitude": "24.8332427",
  "longitude": "121.0127707"
  "location": ["tai2wan1", "xin1zhu2xian4", "zhu2bei3shi4"]
}

The new system is indifferent to the number of items in the array, so that additional narrowing of locations can be provided if desired, but won't be necessary. Regardless of the length of the array, latidue and longitude will always be at the end.

By using an array, we can now easily filter recordings by their location, and not have the system be confused about whether 台湾 is the same thing as 臺灣, since the character set is handled by the localisation functions, and both refer to a single identifier tai2wan1. A big issue in the past was some entries were submitted in simplified characters, and some in traditional, and conversion scripts are unreliable so it means either no linking between the two, or a lot of manual work to create an object in the code of all the options for all entries.

This system is also not ideal, but in lieu of a Glottolog-like system for locations, this may be the best option at the moment. We are looking into something a bit cleaner.

Finally, names of speakers will exist in more than one form. We will use automatic romanisation for Chinese names and allow these to be searchable, so that if you find a story by 饒平妹 that you want to find again but can't remember the characters, you can also search for ráopíngmèi or 요평매 or ราวผิงเม่ย, depending on the languge of your user interface.

Roadmap

The following is a rough outline of our plans. Subject to change, of course, but probably not by much.

23 January 2021

Site launches as version 3.0, with a new back-end and improvements to many features. At this time, not all features are available, and the site is primarily read only.

Late Q1 2021

Re-enable transcript editing, story submission, other previously locked features.

Early Q2

Revive Korean localisation for site interface

Early Q3 20201

Initial forms of Burmese and Assamese site localisation