Frequently Asked Questions

If I have a noisy dataset which consists of 27 attributes and 597 instances how can I improve the results of any data mining classifier without overfitting?

Some research has shown that in case there is no intentionally incorporated noise, using the combination of noise filters called  Edited Nearest Neighbor Algorithm (ENN) and Repeated ENN(RENN) gave the best result in terms of classification accuracy of the dataset. This combination excludes Reduced Error trimming. Combination resulting outperformed the matching one which used full datasets in nine out of thirteen datasets. Noise filters gave the best results in combination with DROP3 and DROP5 in datasets without adding noise. With 5%, 10 %, and 20% noise ratios, the use of the two above-mentioned noise filters showed significantly improved results compared to those where Reduced Error Pruning was included. There is a term important for your question answer, and it’s called Instance reduction algorithms. They have been used largely in Instance-Based Learning (IBL) (also known as memory-based learning), which is the process where the learning algorithms compare new problem instances with those already seen and stored in memory instances, instead of performing generalization. Their main task is to identify valuable and the most relevant instances to retain and which to remove from the dataset in order to maintain classification accuracy and reduce the memory requirement. The process of Instance reduction can create a model with improved capabilities such as shortening the learning process and scaling up to larger sources of data. As an example we will check the Noise filters- They strive to remove the instances that are close to decision boundaries because they tend to be the noisiest but still retain the internal instances, and the reduction amount they achieve is commonly quite limited.

ENN – The algorithm will remove an instance if it contains a different class than the class of the majority of its nearest k neighbors.

RENN- In this phase, the algorithm repeats ENN a few times until there are no more instances to remove. This means that all existing instances have the same and consistent class as their k neighbors.

Related Questions