The theoretical and practical value of studying human accented speech is of interest to linguists, language teachers, actors, speech recognition engineers, and computational linguists. It is also part of the research program behind the speech accent archive (http://accent.gmu.edu). The archive is a growing annotated corpus of English speech varieties that contains more than 2,355 samples of native and non-native speakers reading from the same English paragraph. The non-native speakers of English come from more than 365 language backgrounds and include a variety of different levels of English speech abilities. The native samples demonstrate the various dialects of English speech from around the world. All samples contain a complete digital audio version, and include a narrow phonetic transcription. Each speaker is located geographically, and crucial demographic parameters are supplied. For comparison purposes, the archive also includes phonetic sound inventories from more than 200 world languages so that researchers can perform various contrastive analyses and accented speech studies.
This paper discusses the architecture and the collaborative methodology behind the speech accent archive. Our practices are evaluated and lead toward a formulation of a set of best practices for online speech databases. Ongoing work on modifications to the archive is addressed, particularly our new computational tools, the enhanced search capabilities with Unicode, and the new smartphone recording procedures. We also describe how the archive is used as a research and teaching tool, with ways of sharing the data.