Language Identification Between Similar Languages – British, American and Australian written English
The aim of this project is to find a way to distinguish between natural language variants. In particular British, American and Australian written English. This is the same problem tackled by Tim Chater in 2001. An algorithm identifying the most distinguishing words based on the statistical chi-square test and frequency word lists was used, as Chater’s attempt to use the n-gram recognition method had not yielded satisfactory results.
The first task was to conduct a literature review in the field of Natural Language Identification. Souter et al, Chater and Kilgariff examined related work including work.
It was found that the chi-square test yielded too many common words as distinguishing words so chisquare divided by the word frequency was used instead. Using this test a set of identification programs were developed. For two and three language variants they successfully identified almost all test files of varying lengths. Under scrutiny however it was found that the distinguishing words extracted by the software may be more representative of the corpora used to train the software on rather than the language variants themselves.