To produce rule induction models for a set of data

10,000 3,000

Topic Description

Section Number Section Header Page Number
Acknowledgments iii
1 Introduction 1
1.1 Project Title 1
1.2 Project Aim 1
1.3 Project Motivation 1
2 Relevant Information and Background Research 2
2.1 The Inland Revenue 2
2.2 What is Data Mining? 3
2.3 My Previous Experience of Data Mining 4
2.3.1 Example Cluster Map-Where does the Beer Go?!? 4
2.3.2 How does Cluster analysis work? 5
2.4 Methodology in a Data Mining Project 7
2.4.1 Exploring the Problem 8
2.4.2 Exploring the Solution 8
2.4.3 Implementation specification 8
2.4.4 Data Preparation 8
2.4.5 Surveying the Data 8
2.4.6 Data Modelling 9
2.5 Rule Induction Modelling 9
2.6 Association rules (Using the Apriori Algorithm) 10
3 Software 11
3.1 Weka 11
3.1.1 ARFF file 11
3.1.2 How to use Weka 12
4 The Data 15
4.1 Original Data 15
4.1.1 SA Random Enquiry Program Data 15
4.1.2 Risk Rule firings data 16
4.2 Data Preparation 16
4.2.1 Combining the Two years worth of data 16
4.2.2 Creating one final table 16
4.3 Relevant Clusters 17
4.4 Recommendations 17
4.5 Duplicate Id’s 18
5 Data Familiarisation 19
5.1 Analysis of the SA data 19
5.1.1 Whole Population 19
5.1.2 Cluster 1 19
5.1.3 Cluster 2 19
5.1.4 Cluster 3 19
5.2 Analysis of the Risk Rules data 20
6 Analysis of the data 22
6.1 Running the Data through the C4.5  in Weka 22
6.2 Analysis of the Output 23
Table 1 24
Table 2 26
Table 3 26
Table 4 27
7 Conclusion 28
7.1 Effectiveness of the Methodology? 29
7.2 Presenting the Results to a manager of the Inland Revenue. 30
7.2 30
8 Appendices 31
Appendix A 31
Appendix B 32
Appendix C 33
Appendix D 34
Appendix E 35
Appendix F 36
Appendix G 37
Appendix H 38
Appendix I 39
Appendix J 40
Appendix K 41
Appendix L 42
Appendix M 48
Appendix N 50
Appendix O 53
Appendix P 55
Appendix Q 56
9 Bibliography
1 – Introduction
1.1 – Project Title – To produce rule induction models for a set of data.
1.2 – Project Aim – The Inland Revenue have come up with a project that they need doing
and that I should be able to achieve using software available to me. The aim of this project is
to determine the success of automated risk assessment for SA Businesses. The automated
system works by running risk rules (which are coded in a programming language) against the
SA Business records. If these rules fire they produce a score and if the total score for each
record is above a certain threshold then that record may be deemed to be risky and would be
looked into manually.
The task is to determine which risk rules are the better indicators to predict noncompliant
businesses. This will be done by building rule induction models and analysing them for
– all businesses
– Certain specific clusters of business (previously identified by cluster analysis.)
It is also possible to look into which risk rules fire together and how successful they are at
identifying risky taxpayers. This can be done using association algorithms (such as Apriori)
and automatically finds the associations using a web node. All the algorithms that are required
are contained within a package called Clementine and is the software that the Inland Revenue
would use to carry this project out.
1.3 – Motivation for the Project – My degree course is Mathematics with Artificial
Intelligence, I wanted to do a project which combined both areas of my subjects. Whilst on
my placement at the Inland Revenue I used techniques which did incorporate both of these. I
took part in a data mining (an artificial Intelligence technique) project which looked into
fraud in Repayments data. This was done using cluster analysis and used a special piece of
software (called acustar) which had been specially designed for the Inland Revenue.

2 – Relevant Information and Background Research
2.1 – The Inland Revenue
Established in 1694, the Inland Revenue is one of the oldest Government departments. Its aim
and objectives are set out in the annual report [1
annual_reps.htm] and are as follows:
“The Aim is:
To provide the best possible tax and valuation services. Our Objectives are to meet this
aim by providing fair, efficient and effective tax and valuation services through:
· Bringing into the Exchequer the taxes, National Insurance Contributions and other
receipts for which we are responsible
· Providing Ministers with high quality analysis and advice on direct tax and national
insurance contribution policy, reflecting the government’s objectives…’ “
On the inland Revenue home page [2] it says “The
Inland Revenue is responsible, under the overall direction of Treasury Ministers, for the
efficient administration of income tax, tax credits, corporation tax, capital gains tax,
petroleum revenue tax, inheritance tax, national insurance contributions and stamp duties. The
Department’s job is to provide an effective and fair tax service to the country and Government.”
This project is concerned with self assessment of Businesses. Also in the Inland revenue
annual report [1] it says that
“Self Assessment is for people with more complex tax affairs – including the self-employed,
business partners, company directors and those paying tax at higher rates. It is not a new tax,
just a simplified method for people who may have had several tax forms to fill in previously
and who may have been paying tax on different types of incomes at different times – even in
different years. It gives them the option of calculating their own tax payments, if they wish.
Nine million people are affected.”
Enquiry programmes were introduced in 1996/97 to measure the levels of non-compliance in
the Self assessed population. The programme is based on a random sample of the selfassessment
taxpayer population and is designed to provide simple measures of compliance in
terms of the percentage of compliant taxpayers. Part of the information that I have is taken
from the results of these random enquiries, mainly the yield obtained. (i.e. the amount of
money that was found to be owed by the,……