## Data Mining

The following are my thoughts, paper reviews, and research work on my course CSI 777: Principles of Knowledge Mining

**From Data to Wisdom** by **Russell Ackoff**

I completely agree with Ackoff’s determination of the modern school education system focusing on transmission of knowledge vs analytical thinking. This is a mechanistic form of thinking from Descartes’ time, if the film Mindwalk (adapted from Fritjof Capra’s The Turning Point) is to be believed. I also wonder about how understanding a concept, as Ackoff mentions, is synthetic thinking. The word, “synthetic” tripped me up, since we would use it to mean “artificial”, more often nowadays. Perhaps in 1988 when this paper was published, it meant what a quick Google search told me to be the adjective form of the word, “having truth or falsity determinable by recourse to experience”. The paper seems to make a distinction already of DIKW flow. The paper also seems to argue for an additional step of *Understanding*, which I perceive as important given the arguments made in the first few lines. Understanding helps us to increase efficiency, while effectiveness measures efficiency. It is a neat little distinction, since I remember in Computer Sciences and Physical Engineering terms, efficiency was always capped to be at 100% but effectiveness was a validation process term. However, I disagree with Ackoff about computer systems being unable to “generate wisdom”, since much like all on-going Machine Learning, Natural Language Processing, Artificial Intelligence efforts, the idea is to imitate human-like behavior for decision-making processes. If the computer system is able to mimic wisdom and infer insights from knowledge it possesses, will it not be purported to have wisdom itself? Lastly, Ackoff mentions data generation as a schooling issue, which is also, I believe, short-sighted in the current world of Big Data. Data generation is not *just* a schooling issue, but a societal effort. Structured, processed, and clean data on the other hand possibly could be created at educational level, but how would we ever keep up?! The paper starts strong and is convincing but ends conclusively underwhelmingly, for me.

**The wisdom hierarchy: Representations of the DIKW hierarchy** by **Jennifer Rowley**

We went into the knowledge hierarchy reverse-waterfall-like structure in class, but Rowley takes down every iteration of the DIKW model in its various forms and summarizes all the different takes and their own definitions of the different parts of the model. I enjoyed learning that the origins of this hierarchy is from a 1934 poem by T. S. Eliot, *The Rock*. An entire scientific and computational discipline that can be traced to the poet that came up with Macavity the Mystery Cat and Mr. Mistoffelees the Practical Cat.

*Where is the wisdom that we have lost in knowledge?*

*Where is the knowledge that we have lost in information?*

I am glad I read Ackoff’s paper before reading this one, so that the order of reference is maintained. I would someday like to read the other models’ as well, but I am unsure if there is a single authority on the One True Model yet. Of all the visualizations of the model, Choo’s Data, information and knowledge diagram looks like a balanced cognition informed graph, and the philosophical approach, specifically the Western form, via Plato’s Knowledge is “Justified True Belief”.

Finally, the paper also encapsulates what attracts me to the course and the subject matter itself. It seems to be a philosophical, informed approach to a real-world necessity with an extremely multi-disciplinary application including: philosophy, cognitive science, social science, management science, information science, knowledge engineering, artificial intelligence, and economics. Not only does the DIKW model, and by extension, Knowledge Mining, apply to these fields, but it also borrows many application concepts from and between them. I am curious about the omission of “insight” as a step in the hierarchy. Assuming that it would come in after knowledge extraction, would it be above or below Wisdom? We would have to define yet another term then, especially since the terms already included – Wisdom, for example are not studied enough – so maybe we are better off without it!

**On the Theory of Scales of Measurement** by **S. S. Stevens**

The paper has a distinct storytelling quality to it, which I enjoyed. I can imagine the Committee of British Association for the Advancement of Science in a stuffy room, with snobby clothes sitting around asking each other, “What is measuring something? How do we measure emotions?” What they came up with is a classification of the scales of measurement.

- Ordinal
- Nominal
- Interval
- Ratio

The paper goes in depth about the classification systems but I am unsure if it covers all forms of measurements within? It is a paper that possibly could be updated from its 1946 form. I would be interested to know how these scales developed to suit the requirements from Big Data or Quantum studies.

**Science and Statistics** by **George E. P. Box**

I am really glad to have learnt about Dr. R.A. Fisher. I intend to read Joan Fisher’s book on her father and perhaps even husband, Dr. Box later. I found the tid-bit of their personal life fascinating along with Dr. Fisher measuring his children (in the name of Science!), interesting. This is because of just how prolific and multi-disciplinary his work seems to have spanned. This is how, I believe the future of Data Science could be, if it ever was. Ranging from application of statistical computational theory and practices across all fields to the ubiquitous nature of data literacy itself. Study of Science is presented as a constant back and forth between theory and practice – a feedback loop where one informs the other. I really liked the cautionary explanation of a scientist to not be a Pygmalion to their work’s Galatea, apart from being parsimonious, or a narrowed view of their work. This paper works as a guidance by Dr. Box on how to be a statistician, mathematician, scientist. I did not completely understand the case studies mentioned in the paper, and will require further readings on them, but at a high-level they all seem extremely captivating.

**Visual Data Mining** by **Wegman**

Visual Data Mining seems to be a methodology for working on large datasets with the usual or at least expected machine learning but in a more intuitive format. At first glance this seems counter-intuitive to the claim that this method would reduce the computational resources required since visual interpretations add that much of an extra layer of dimensionality and resource consumption, cognitive load, and screen resolution, but this paper argues for the opposite in the case of data mining. In the data mining literature, these have been extensively investigated in an attempt to conduct exploratory data analysis for knowledge discovery in databases, quicker, more efficiently, and competently.

Several methods for visual data mining have been reported in the literature and here we focus on their applicability in statistical tasks rather than their methodologies. Despite the success of this data mining in certain aspects, it still suffers from becoming a sort of unapproachable or intimidating black box for many and I believe visual data mining, much like visual softwares, tools, and UI-based programming languages, would benefit with greater accessibility features.

Several new strategies are proposed to improve the performance of rapid data editing, density estimation, inverse regression, the formulation of tree – structured decision rules and their application to risk assessment, dimension reduction and variable selection, classification, clustering, discriminant analysis, multivariate outlier detection and unique event detection. These strategies include parallel coordinates, d-dimensional grand tours, and saturation brushing techniques. BRUSH-TOUR and the TOUR-PRUNE strategies are some of the visualizing techniques having O(n) computational complexity vs the more complex to the order of O(n2) or exponential values.

It will be demonstrated in this work that Parallel Coordinates represents d-dimensional points in a 2-dimensional planar diagram where the mapping uses mathematical structures to recreate a scatter plot “data cloud”. To take a look at the data cloud with different parameters like space filling and continuous algorithms , the d-dimensional grand tour allows for an animated visual. I am unsure of many of the terminologies of algorithms mentioned in this section, like the Asimov Buja winding algorithm or the Solka-Wegman fractal algorithm.

Saturation Brushing is a form of ordinary brushing which seems like an appropriate term for visual data mining by representing the data clouds in a word cloud format. The BRUSH-TOUR strategy is an iterative form of narrowing down these clusters of data clouds. The color design involved in this follows cognitive theories and visualization strategies that I came across in the CSI 703 course: Scientific and Statistical Visualization. The software suggested for visual data mining here is CrystalVision which can handle mapping out high-dimensional data along with scatterplot, density plot, linked views, and visual data manipulations. While the scatterplots were easy to read, I wonder how useful the parallel coordinate display of Pollen data really is, even with a desaturated coloring feature.

The TOUR-PRUNE technique, similar to CART (Classification and Regression Test) is a decision tree pruning analysis strategy but with a visual aid for determining the outcome, or at least the point at which the pruning process can be adjusted for obtaining the results.

The challenges of this process seem to be dealing with outliers rather than missing values, which I imagine are more visual. The Bureau of Labor Statistics project for cereal analysis described in the paper is interesting to note for its large dataset with certain unique elements or unexpected results which were more visually graphed for considering their sales.

### Concepts, Instances, Attributes

This chapter elucidates the components in Knowledge Mining: Concepts, Instances, and Attributes. This field is maturing, with a wealth of well-understood methods and algorithms in machine learning. This is now a mature field which is now being spun out into various commercial applications as mentioned in the first chapter of the book: The Weather Problem, Irises, Marketing & Sales, etc. with different types of methodology. A challenging problem which arises in this domain is data gathering and collection which takes up the most amount of time in the entire Knowledge Representation process.

**Components**

I found the definitions of the terminology of the components in Knowledge Mining to be restrictive. Unsure as I was if it made a difference between the hierarchy of concepts, their instances, and in turn, their attributes, I was also confused about how much of a variation they are within various algorithms.

**Data**

Data is the most important feature of the data gathering step. This leads us to myriad problems in the collection step itself, since data might be sparse, missing, inaccurate, or unbalanced. Personally, I have run into issues with all of these while working on my Spring 2020 project, Ancient Silk Route visualization and simulation. Data on the more western side of the world was more complete, accurate, and ultimately unbalanced unfortunately, compared to the eastern trade networks nodes and routes.

Similarly, in the National Gallery of Art Tableau data project for the Fiscal Year 2019 had many instances where data, even if public and open as was the first requirement, could not be visualized. This was due to the departmental cross-referencing and connectivity, rather than data incompleteness, inaccuracy, or unbalance.

Understanding the depths and wading through the data, as is mentioned in the text was extremely important for both of my projects. If it were not for the pre-compiled glossary and notations list on OWTRAD (Ciolek, 1999), it would be difficult to understand the 6000+ trade routes in the dataset. Continuous, ongoing, touch-bases with the 10 different departments and external entities was required to visualize the museum data effectively. Out of the nine weeks I spent working on the dashboard, I used five weeks for understanding the Gallery, its structure, the departmental agencies, their performance metrics, requirements and the overall strategy plan beyond even the specific data in the datasets. For example, geographical data would be better placed in a map visual. The OMB would be interested in the rolling averages for the visitor metrics, and that is why the numeric data had to be collected quarterly.

**Software**

ARFF file format is the standard dataset representation in WEKA (Waikato Environment for Knowledge Analysis). I have personally not come across this software before even though I had an inclination of Microsoft Azure being the more common software for the Knowledge Mining process.

### Output

Decision tables, decision trees, decision boundaries, rules for classification, association, with exceptions, instance-based learning, rectangular generalizations

**Output formats**

I am confused about the justification for the usage of the word “knowledge” in the beginning of this chapter. Why not use a new word for decision tree and rules structures? Computer Science borrows and creates terminologies and nomenclature as and when possible, anyway.

I have always wondered why tables were fixed as the input format for most data science operations. It makes sense that it is a visual representation of the required output itself. I did not know the numeric form of such tables are called regression tables. I’m unsure if I have used linear models, which are a sum of attribute values and if they were in a list format. Again with the word homonyms particular to Statistics of Knowledge Mining – “regression” means reverting, but also numeric predictions. I have studied the theory of decision boundaries and hyperplanes for Dr. Blaisten’s course CSI 690 Numerical Methods but their applications within inputs for a model are in more depth here. Trees seem to be a mechanistic philosophy approach to problem-solving, a bottom-up approach as it is considered in Computer Science. I personally would really like to use the interactive tree and data visualizer but that’s because I prefer softwares to programming models from scratch since it is less time consuming although at the cost of customization and functionality.

**Instance-based Representation**

I always found the distinction of learning mechanisms fascinating. Humans consider learning, and the way that the education system perceives learning to be rote-learning, while computers and machines can store information immediately and their “learning” is higher level complex tasks that seem to come naturally to humans. Sometimes, people on the neurodivergence spectrum are compared to machines even though in the timescale of evolution, humans of such nature have always existed even if not all have been called “geniuses”. It seems to point to an innate consideration of machines’ methodology of learning to be superior. As with many terms in machine learning, in the case of classification models, the training and testing algorithms help the system “learn” in this way rather than really mimic human behavior. Clustering seems more about grouping in its nomenclature and definition, especially with the venn diagram representation.

**Rectangular generalizations**

It seems to be a combination of logic gates with boundary rules, which can possibly be within each other and different classes. I do not fully picture it correctly I believe.

### Credibility

Reliable evaluation of the data predictions quality. Predictive accuracy methods like Train-Test setup, cross-validation, and bootstrap and comparing them all using statistical significance tests. Probability estimates, misclassification costs, evaluating numeric prediction schemes. Determining model complexity using compression-based minimum description length principle and validation set.

The part in this text I am most fascinated or struck by is about the philosophies of theories. I am interested in understanding how really machine learning concepts can help solve this question of evaluating scientific theories. From reading the text, it seems to not have really “solved” the question so much as narrowed down the parameters of what is required to evaluate the theories to some specific requirements. That is, the data can be divided and used to validate itself (Train-Test). This can be done over and over again (Cross validation). Or with “sets” of data within at random (Bootstrap).

**Issues**

The odds are really stacked against evaluation, but they seem to be a means to an end in a best-case scenario way. Quality labeled data in the training data is scarce. Comparing performance by ML algorithms also means that other factors and chance effects require statistical tests.

**Training and Testing**

Based on the classifier’s performance an error rate can be measured for each iteration of the data for past performance, but not for future performance. Specifically for the training data, the error rate is called re-substitution error; the trade-off for “re-substituting” the training instances. The success or error rate with Bernoulli sequence results in binary values.

**Cross Validation**

Error rate of classifier on old data is not indicative of error rate on new data. Cross Validation is a technique for when the data is especially scarce. The number of “folds” is characterized by the number of splits in the data. Why is 10 fold cross-validation the standard way of error rate prediction? Is it not too many divisions to the data? The sample set having multiple classes should be of the same numbers so that they are equivalent to remove bias, which is called stratification. The error rate estimation via repeated holdout method.

**Hyperparameter Selection**

This seems to be another layer of optimization within the learning algorithms themselves and are called “hyperparameters” outside of regular parameters. The training set is divided into a smaller training set and validation set.

**Comparing Data Mining Schemes**

Paired T-test use the same training and testing sets, sensitive version of sampling getting matched pairs. Binary choice in assuming only one class or another class results in a forced binary filter and loses out given a probability distribution via a Quadratic loss function. Information loss function is when the content of each instance is not obtained, not the accuracy level.

**Cost**

** **Extremely unpredictable events like Black Swan spiral off to infinite information events. Like Covid-19 and its related events in 2020!

**Using a Validation Set for Model Selection**

The performance is affected by both overfitting (linear regression) and underfitting (data scarcity).

### Trees and Rules

Decision trees are popular with their speed and intelligible output which also happens to be quite accurate. What is the measure of this accuracy, given the various ways of defining it? Decision Trees are more popular than Rules in AI because generating accurate rules is harder and more heuristic (self-evolving) than creating specific Rules.

C4.5 algorithm is a Machine Learning scheme of a decision tree algorithm in the form of a divide and conquer algorithm, dealing with numeric attributes, and missing values by “pruning” the trees. Overfitting is an issue with independent test cases. Classification rules via CART and Regression trees.

**Numeric Attributes**

Handling numeric, categorical, noisy, or missing values each requires its own process. When values are not nominal but numeric attributes, binary split possibilities occur. The simplest form of instance based learning puts the dividing line.

**Pruning **

Pruning is removal of a branch in the decision tree. Pre-pruning is to go back a step to check for “goodness” of the branch; its validity. Post-pruning is subtree replacement or raising it, which is more popular. About this, incrementally, pruning and optimizing the tree so to speak enhances the decision tree accuracy. If missing, the whole value in the corresponding row or column are either ignored or considered insignificant.

**Missing Values**

Enhancement to decision trees is to split the instances to weighting schemes and remove nodes with missing values.

**Error rates**

C4.5 estimates error based on training data. It is the top down induction of trees but a bottom up approach is there an assessment at the higher level to build the decision tree incrementally, to improve at each level of the Decision Tree.

**Decision Tree Induction Complexity **

Building = O(mn logn)

Complexity of subtree replacement = O(n)

Complexity of total subtree lifting = O(n (logn)2)

Therefore,

Complexity of a Decision Tree Induction = O(mn logn) + O(n (logn)2)

**Classification Rules**

Rules for Separate and Conquer technique (How is that nomenclature different from Divide and Conquer?) and Global optimization steps are classification-based rule sets. Good rules avoid overfitting and underfitting. Global optimization pruning requires repeated pruning at each level.

**Association Rules**

*Apriori* algorithms for generating association rules following a generate-and-test method to find frequently occurring longer item sets, even as shorter nes would be more occurring by checking if it exceeds minimum support threshold. Computationally costly with a large number of item sets.

FP trees (Extended prefix tree = Frequent Pattern trees) are indexed by recursive depth. I take this to mean, bottom-up approach with respect to Decision Tree. They can be compressed for better optimization.

**Extending instance-based and linear models**

This chapter is about the robust and storage efficient Instance-based learning method of nearest-neighbor classification. Generalizing Linear Models by Kernel trick with Support vector machines for classification and regression, kernel ridge regression, and kernel perceptron and on applying simple linear models in a network structure that includes nonlinear transform with classical multi-layer perceptron. Building local linear models, model trees, that is, decision trees with linear regression models at leaf nodes can be used for learning problems. Instance-based learning and linear regression together for locally weighted linear regression.

Of all the topics covered in this chapter, I have previously come across SVM, Linear Models, and Gradient Descent, Back Propagation but the notes from the online class are much more understandable than the textbook, with its denseness. I did find some of the basics discussed in an accessible and informative manner from 3BLUE1BROWN SERIES S3 • E1 but have not yet found these specific topics in the exact Data mining concepts.

**Instance-based Learning**

K Nearest Neighbour is not ideal because it slows down with large training sets, noisy data, different attributes give different outcomes each time.

**Exemplar**

The number of exemplars (often occurring instances in classification) can be reduced later in the process to help with predictive accuracy. Noisy exemplars can be pruned with cross validation tests for different values or performance checking. The instances for classification are classified by the confidence in their performance. In each of these methods there seems to be a large computation cost and time cost attached in varying degrees.

Weighting attributes with the Euclidean distance function is used when attributes are relevant. Hyperrectangles are generalized exemplars that are nested. Distance functions simplify the measure to calculate the distance to the closest rectangle for grouping similar classes.

**SVM**

Support Vector Machine algorithms are for numeric prediction in classification. Linear models are used by SVMs to implement non-linear boundaries, that is, to draw the line with attributes and weights. They are used to calculate a function that approximates the training

points with all deviations up to a user-specified parameter ε are removed. Extending by black boxes with higher order equations, which are not as intuitively adjusted.

**Stochastic Gradient Descent**

Learning models use gradient descent models to optimize nonlinear neural networks. The error function has a global minimum rather than many local minima.

**Multi-layer Perceptrons **

A perceptron or neuron represents a hyperplane in instance space.

Nonlinear classifiers based on perceptrons can either be single layer or multilayer. The attributes with threshold and bias are at different layers – input and hidden layers, binary classifiers. The simple decision process gives a simple output using logic gates. At a greater level, multiple levels are used at XOR with the Backpropagation method.

**Numeric prediction With Local Linear Models**

A linear regression model that predicts the class value of instances that reach the leaf is a model tree. Model tree predicts value for the test instance. Model trees are essentially decision trees with linear models at the leaves.

## References

Ackoff, R. L. (1999) *Ackoff’s* Best. New York: John Wiley & Sons, pp 170 – 172.

Rowley, J. (2007). The wisdom hierarchy: Representations of the DIKW hierarchy. *Journal of Information Science*, *33*(2), 163–180. https://doi.org/10.1177/0165551506070706

Stevens, S. S. (1946). On the Theory of Scales of Measurement. *Science*, *103*(2684), 677–680.

Wegman E. (2003). Visual data mining. *Statistics in Medicine*, *22*, 1383–1397. https://doi.org/10.1002/sim.1502

Witten, I. H., Frank, E., & Hall, M. A. (2011). *Data mining: Practical machine learning tools and techniques*. Burlington, MA: Morgan Kaufmann.

## 1 Comment