Sale!
Placeholder

Language Identification Between Similar Languages

10,000 3,000

Topic Description

 ALL listed project topics on our website are complete material from chapter 1-5 which are well supervised and approved by lecturers who are intellectual in their various fields of discipline, documented to assist you with complete, quality and well organized researched materials. which should be use as reference or Guild line...  See frequently asked questions and answeres



Summary
The aim of this project is to find a way to distinguish between natural language variants. In particular
British, American and Australian written English. This is the same problem tackled by Tim Chater in
2001. An algorithm identifying the most distinguishing words based on the statistical chi-square test
and frequency word lists was used, as Chater’s attempt to use the n-gram recognition method had not
yielded satisfactory results.
The first task was to conduct a literature review in the field of Natural Language Identification. Souter
et al, Chater and Kilgariff examined related work including work.
It was found that the chi-square test yielded too many common words as distinguishing words so chisquare
divided by the word frequency was used instead. Using this test a set of identification
programs were developed. For two and three language variants they successfully identified almost all
test files of varying lengths. Under scrutiny however it was found that the distinguishing words
extracted by the software may be more representative of the corpora used to train the software on
rather than the language variants themselves.

Contents
SUMMARY I
ACKNOWLEDGEMENTS II
CONTENTS III
Chapter 1 – Introduction
1.1 – Background 1
1.2 – Aims and Objectives 1
1.3 – Scope 2
1.4 -Why I Chose this Project? 2
1.5 – Structure 2
Chapter 2 – Background
2.1 – NLP in general 3
2.2 – Language Identification 3
2.21 – N-gram Recognition 4
2.22 – Language Identification between similar languages 5
2.23 – Unique Character Recognition and Frequent Word Recognition 5
2.3 – Genre Identification 6
2.4 – Differences between British and American English 7
2.5 – Chi-square and Mann-Whitney Ranks test 8
2.6 – Mutual Information and Log-likelihood (G²) 11
2.7 – Definitions 12
Chapter 3 – Design
3.1- Comparison of frequency word lists 13
3.2 – Identification of the unknown text 13
3.3 – Statistical Tests 13
3.4 – Preliminary Experiment 14
3.41 – Findings from Preliminary experiment 14
3.5 – Relevant Variables 15
IV
Chapter 4 – Implementation
4.1 – Test Data 15
4.2 – Software Development 15
4.3 – Programming Language 16
4.4 – Frequency word list generator 17
4.5 – Distinguishing word model generator 18
4.6 – Distinguishing between two language variants 19
4.7 – Distinguishing between many language variants 19
Chapter 5 – Testing and Results
5.1 – Testing Methodology 21
5.11 – Testing the Frequency Wordlist Generator 21
5.12 – Training the Models 21
5.13 – Testing the Identification software with two language variants 22
5.14 – Testing the identification software with three language variants 22
5.2 – Evaluation Methodology 23
5.3 – Results 24
5.31 – Effects of varying the threshold X²/f on the model length 24
5.32 – Accuracy and Reliability for varying test file lengths 24
5.33 – Accuracy and Reliability for varying model lengths 25
5.34 – Categorisation of distinguishing words 25
5.35 – Distinguishing between three language variants 27
Chapter 6 – Conclusions and Future Enhancements
6.1 – Conclusions 28
6.2 – Limitations 28
6.3 – Future Enhancements 29
6.4 – Completion of Project Aims 30
V
Appendixes
Appendix A – Personal Reflection 32
Appendix B – User Manual 34
Appendix C – Program Code 39
Appendix D – Distinguishing Word Models 50
Appendix E – Wordlist Test Data and results 65
Appendix F – Sample Test Data 66
Appendix G – Sample Results 67
Appendix H – Categorisation of Distinguishing words 68
Appendix I – Frequency Word Lists 72
Appendix J – Justification of X²/f threshold

GET COMPLETE MATERIAL