Page 3 of 3
R-GB: Today we use these bounds and design algorithms to minimize these bounds. But the bounds are loose. Are we doing the right thing?
V-V: I don't think you can get a bound which is not loose, because technically it is very difficult. Not everything can be described in a simple mathematical expressions. Now, people are working on transductive inference. What is the difference between transductive inference and inductive inference? In transductive inference, you have a set of equivalent decision rules: rules which classify the training and test data in the same way are equivalent. So instead of an infinite number of hypotheses, you have a finite number of equivalence classes. For a set of equivalence classes the problem becomes simpler; it is combinatorial. You can measure size of equivalence classes using different measures and this leads to different algorithms. You can restructure risk minimization on equivalence classes. It describes the situation better. But in this case, you will not have a general decision rule, you just have answers for your test points. You give up generality but you gain in accuracy. I am happy that people have started to develop transductive inference which was introduced back in the 1970s.
R-GB: Do you have some pointers to interesting work that you ran across recently?
V-V: There are a lot of very good works. Recently, the book "Semi-Supervised Learning" was published. Along with semi-supervised learning it contains chapters about transductive learning. I think that semi-supervised learning is still an attempt to do induction. But I believe that in order to get accuracy in inference we should give up the inductive approach. It should be something else like transductive inference or selective inference. For example, consider the following problem. You are given pairs, (xi,yi), i=1,...,l for training and simultaneously you are given m testing examples x1,…,xm. The problem is, given these two sets, can you find k examples from the test set which most probably belong to the first class? This is a decision making problem. It is easier than the inductive pattern recognition because you don't need to classify everything. Classification is difficult for the border elements which are close to the decision boundary. It is also not a ranking problem, because you are not interested in the ranking of chosen vectors. It is a simpler problem, and therefore it can be solved more accurately.
R-GB: Your SVM work gained a lot of popularity, what do you think is the reason that made it so popular?
V-V: First of all, SVM was developed over 30 years. The first publication we did jointly with Alexey Chervonenkis in 1963 and it was about optimal separating hyper-planes. It is actually linear SVM. In 1992, jointly with Bernhard Boser and Isabelle Guyon, we introduced the kernel trick and in 1995, jointly with Corinna Cortes, we introduced slack variables and it became SVM. Why is it so popular? I believe there are several reasons. First of all, it is effective. It gives very stable and good results. From a theoretical point of view, it is very clear, very simple, and allows many different generalizations, called kernel machines. It also introduced pattern recognition to optimization scientists and this includes new researchers in the field. There exist several very good libraries for SVM algorithms. All together these make SVM popular.
R-GB: It seems now days that almost all machine learning is about margins and maximizing margins. Is it the right thing to look at?
V-V: In my research, I am trying to invent something better than margin, because margin is not the best characteristic of capacity. Through margin one can bound the VC dimension. But VC dimension is a loose capacity concept. Now I am trying to introduce something instead of the margin. I am checking different ideas such as using universums and making inference by choosing the equivalence class which has the maximal number of contradiction on universum. Margin is good to prove a general point, but to find advanced techniques maybe we should think about what could be better than margin - to move closer to entropy and not to VC dimension.
R-GB: Switching to a different topic: you have been active in machine learning for more than 4 decades and keep being very innovative. How do you do that?
V-V: The problem of machine learning is very wide. It includes technical aspects as well as philosophical aspects. It is also a unification of humanities and technicalities. In different periods of my life I tried to understand different aspects of this problem. Through 1960's-1970's, I was interested in the mathematical aspects of this problem. In 1980's, I understood the relation of the pattern recognition problem to problems of classical philosophy of science. In 1990's, developing SVM I was happy to introduce a theoretical line in creating algorithms instead of heuristics. Now I see in this problem a central point for changing a general paradigm that has existed for many hundreds of years which separates inductive (or scientific) inference developed for dealing with a simple world from direct (non-inductive or non-scientific) inference which is a tool for dealing with a complex world. It is a very rich problem that has many aspects and one can find appropriate aspects of this problem in different periods of ones life.
R-GB: When you started in the 60's, did you consider your work as studying learning?
V-V: Yes, I considered my research to be research in learning theory from the very beginning. The uniform convergence theory was constructed to justify learning algorithms developed in the 1960s and, in particular, the optimal separating hyperplane. For me PAC theory that started in mid 1980's was a backward step to prior development presented in our joint with Alexey Chervonenkis book "Theory of pattern recognition" (1974) and in my book "Estimation of Dependencies Based on Empirical Data" (1979 Russian version and 1982 English translation).
R-GB: Can you share with us some recommendations for papers or books in machine learning or related areas that you find interesting?
V-V: There are many excellent articles and books on machine learning. Now is the time when many different ideas are implemented into algorithms for solving large-scale technical, biological, and linguistic problems. Now is a time when a lot of new facts and observations have appeared. It is very interesting to try to follow them in order to understand how they can fit in a general model.
Recently, the first book in the philosophy of science appeared, Reliable Reasoning - Induction and Statistical Learning Theory" by G. Harman and S. Kulkarni, which is trying to explain a philosophical component of what was done in machine learning, what is the VC dimension, what was wrong in Popper's non-falsifiability concept in contrast to VC non-falsifiability concept and so on. From my point of view this book is still conservative: it is primarely about induction but also mentions the transductive inference. Nevertheless, people in philosophy are trying to take into account our science. I believe that we should also try to advance general philosophical problems of intelligence.