Jekyll2018-12-03T00:09:36+00:00http://www.ml-hack.com/feed.xmlML HackMachine learning, deep learning, statistics, programming and finance.Pierre ForetPierre_foret@berkeley.eduWinning A Dataopen2018-10-19T00:00:00+00:002018-10-19T00:00:00+00:00http://www.ml-hack.com/winning-a-dataopen<h1 id="how-to-win-a-citadel-dataopen-semi-final">How to win a Citadel DataOpen semi-final</h1>
<p>Last September, we had the opportunity to compete in a datascience competition organized by Citadel LLC. With an interesting problem, a thrilling competitive environment and the possibility to represent UC Berkeley during the national final in New-York, we were highly motivated and ready to give our best. Before giving some advice and advice for future competitors, I would like to thanks the organizers for making this event possible. It was well organized, thoroughly thought, and flawlessly executed. I also want to give credits to my teammates: Teddy Legros, Hosang Yoon and Li Cao, you guys rock.</p>
<h3 id="about-the-competition">About the competition</h3>
<p>For those who are not familiar with the DataOpen, it’s a competition organized by the hedge fund Citadel LLC and the recruitment firm Correlation One, taking place in some of the most prestigious universities in the world. In less than 24 hours, competitors must analyze several datasets, draw insights from them, and conduct research on a question of their own.
This is not a Kaggle, the goal is not to come up with the best possible model to optimize a pre-defined metric. You have to understand the data, separate the noise from the information, come up with a relevant story and justify it rigorously. Real life stuff for a data scientist.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/-qGgcfYh_rA" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<h2 id="before-the-competition">Before the competition</h2>
<h3 id="the-online-exam">The online exam</h3>
<p>Among the 400 students that registered for the competition, only about 100 made it to the event. Competitors are tested online to earn their place in the DataOpen. I can’t discuss the content of the test in detail, but it was fairly standard, the same kind that you would usually get when job hunting. You have plenty of time to complete it (about one hour), so make sure you do your best on each question before submitting your answers. Keep your statistics and basic calculus sharp, and you should be fine!</p>
<h3 id="get-good-team-mates">Get good team-mates</h3>
<p>This part is critical. It’s a fast-paced competition, so make sure your team-mates are the good ones. As much as I like to learn from others and share what I know, a high pressure 24 hours challenge is not the ideal place for that. Pick people you trust. Equally important, make sure to show them that they can trust you.</p>
<h3 id="do-your-homework">Do your homework</h3>
<p>During the competition, you must be focused on the data and the story you tell. You won’t have the time to learn new things, and you must be confident in your ability to execute technically demanding analysis flawlessly, without having to look up the documentation for new tools, or the hypothesis of a particular model. Before the competition, we wrote a lot of glue code. We agreed on how our datasets will be represented, and made sure that every bit of code would take it as input without having to change anything. Once the dataset was released, we almost didn’t write any new code: we were able to focus only on finding insights from the data. We could test a lot of hypotheses in a very short period of time, just because every statistical test was already scripted, tested, and understood. This was especially helpful for me, as I often find myself needing some old-school, uncommon statistical tests that are not implemented in open source Python packages.</p>
<h2 id="the-previous-day">The previous day</h2>
<p>One particularity of the competition is that even if the datasets are released only in the morning of the last day, some descriptions of the data (source, variable names and types) are provided the evening of the previous day. That leaves you a whole night to plan which hypothesis you want to test and what kind of question would be interesting to answer. When we were given the datasets, we already had a very clear idea of what to do.</p>
<p>Incidentally, that also means that you have to balance your sleeping time and your preparation for this last night before the big day. Some three hours of sleep were nice to be sure I would still be sharp during the day, while letting me enough time to digest the numerous dataset features and review my code one last time. Congratulations to Hosang for pulling the all nighter, and still managing to be more productive and focused than me!</p>
<p>All jokes aside, this preparation was really necessary, and probably one of the most critical parts of the competition. Other teams didn’t stay on campus too late, but I believe these hours of preliminary work really gave us a competitive edge. You can still do a lot of things without the complete datasets!</p>
<h2 id="the-final-line">The final line</h2>
<p>Early in the morning of the last day, we arrived at an open space in a building near the campus. We were served coffee, some sandwiches and a usb drive containing several datasets. It was time to see how our careful planning would hold in front of the real thing.</p>
<h3 id="dont-step-on-each-others-toes">Don’t step on each other’s toes</h3>
<p>You are a team of four. This means you should be able to do near to four time as much as you would alone. For that to be true, your workflow must be parallelized efficiently. Make sure you are communicating with your team-mates on what you are doing, and make sure your careful planning allows four people to work independently at all times. Don’t invade other people’s work, and be sure to conduct your own without distracting them. Working on different and independent sub-problems allows us to cover a lot of ground. When we uploaded our final report, I discovered about two thirds of it for the first time. Trust yourself and trust your teammates!</p>
<h3 id="a-negative-result-is-still-a-result">A negative result is still a result</h3>
<p>Sometimes, things don’t go as planed. You don’t have enough datapoints to separate the noise from the information correctly. Your statistical test fails, some variables that you though were critical aren’t actually significant. That’s totally fine. Report it, say that you would need more data to conclude, and move on.</p>
<p><img src="https://imgs.xkcd.com/comics/linear_regression.png" alt="xkcd linear reg" class="align-center" /></p>
<figure>
<figcaption class="align-center">Credits to xkcd.com</figcaption>
</figure>
<p>Cherish the negative result, make sure that it can happen and that you can recognize it: it’s the proof that you are doing real scientific work. For instance, you could try to do features selection by fitting some tree based classifier and printing some fancy feature importance chart. But do you know their real meaning? Are you sure you would have known that the features were actually not relevant by doing that? How would a negative result look like? Well, it’s not clear, at least for me. That’s why we still use the good old linear models and t-stats.</p>
<h3 id="know-your-stuff">Know your stuff</h3>
<p>That’s probably the most obvious advice I could give, but it’s nonetheless true. In order to win, you have to know what you are doing. Recognizing which statistical model is the most adapted to answer a specific question is not an easy task, and there is no way to hack it. Practice, make mistakes, understand them, but do that before the competition! Kaggle kernels, for instance, are a good way to get feedbacks on your analysis.
For example, we wanted to perform a regression for which the target variable was a mortality rate. Would it make sense to do a standard linear regression in this case? Are we sure that the hypothesis behind a linear regression make them suitable to modelize a percentage? The answer is no, that’s why we used beta-regression instead.</p>
<h3 id="add-value">Add value</h3>
<p>Another obvious advice, but nonetheless true: make sure your analysis actually add value to the report. Correlation maps or grid plots are nice to get a quick overview of the data, but it doesn’t add any real value to the analysis. Anyone can do it, show us what you got! There are a lot of tools for exploratory data analysis, probably more than one can master in a lifetime. For instance, we used:</p>
<ul>
<li>Principal component analysis, for continuous variables.</li>
<li>(Multiple) Correspondance Analysis for nominal categorical data.</li>
<li>Good old <script type="math/tex">\chi^2</script> tests on pivot tables.</li>
<li>Logistic regression, beta regression and the associated significance tests.</li>
<li>Hierarchical clustering.</li>
</ul>
<p>On of my favorites is the correspondance analysis. Not only does it offer a very nice way of visualizing the interactions between several variables in a plane, but it makes it meaningful to use the euclidian distance as a measure between your distributions: nice for clustering!</p>
<h3 id="weather-the-storm">Weather the storm</h3>
<p>Plans are just plans, and something will invariably get wrong at some point. Some results that you were expecting will turn out to be negative or some bit of code will raise an error. That’s fine. Quickly find another hypothesis to test instead, or trace your bug. When deep into the competition, you will doubt your abilities. The other teams will seem more efficient or smarter than you. Because everything didn’t go as planed, you will think that winning is no longer possible. It might be the case, but truth is that for now, you can’t know for sure. Don’t get distracted, move a step at a time, and give your very best at each moment without focusing on the outcome.</p>
<p><img src="https://media.giphy.com/media/5Zesu5VPNGJlm/giphy.gif" alt="monkey" class="align-center" /></p>
<figure>
<figcaption class="align-center">When your precious script breaks</figcaption>
</figure>
<h3 id="write-a-nice-report">Write a nice report</h3>
<p>At the end of the day, your report is the only link between your hard work and the judges, so make sure it’s good. We spent about two thirds of our time writing the report during the final day. That can seems like a lot, but remember that most of our code was already prepared and that we studied the dataset’s features the whole night, so about three hours were enough to wrap up the analysis.</p>
<h2 id="a-last-word">A last word</h2>
<p>I hope that these pieces of advice will help the future competitors to give the best! One last word though: you need skills but also a little bit of luck. The other teams were amazing, and some made a truly great work. Winning is a combination of inspiration, teamwork efficiency and some luck. Don’t take the result of the competition too personally. Not getting a place on the podium in a competition doesn’t mean that you’re not a good data scientist. Take a shot, give your best, but don’t be too harsh with yourself.</p>Pierre ForetPierre_foret@berkeley.eduAdvices for future competitorsBayesian Features Machines2017-03-25T00:00:00+00:002017-03-25T00:00:00+00:00http://www.ml-hack.com/bayesian-features-machines<h1 id="imdb-sentiment-analysis-with-bayesian-svc">IMDB sentiment analysis with Bayesian SVC.</h1>
<p>Movies are great! Sometimes… But what if we want to find out if one is worth watching? A good start would be to look at its rating on the biggest reviewing platform, IMDB. We could also do without the rating, just by reading the reviews of other film enthusiasts, but this takes some time… so what about making our computer read the reviews and assess if they are rather positive or negative?</p>
<p>Thanks to the size of this database, this toy problem has been studied a lot, with different algorithms. <a href="https://cs224d.stanford.edu/reports/TimmarajuAditya.pdf">Aditya Timmaraju and Vikesh Khanna</a> from Stanford University give a really nice overview of the various methods that can be used to tackle this problem, achieving a maximum accuracy of 86.5% with support vector machines. <a href="https://cs224d.stanford.edu/reports/HongJames.pdf">James Hong and Michael Fang</a> used paragraph vectors and recurrent neural networks to classify correctly 94.5% of the reviews. Today, we explore a much simple algorithm, yet very effective, proposed by <a href="https://www.aclweb.org/anthology/P12-2018">Sida Wang and Christopher D. Manning</a>: the Naive Bayes Support Vector Machine (NBSVM). We will propose a geometric interpretation of this method, in addition to a Python implementation that yields <strong>91.6%</strong> of accuracy on the IMDB dataset in only a few lines of code.</p>
<h2 id="multinomial-naive-bayes-classifier">Multinomial Naive Bayes classifier</h2>
<p>Bayesian classifiers are a very popular and efficient way to tackle text classification problems. With this method, we represent a text by a vector <script type="math/tex">f</script> of occurrences, for which each element denotes the number of times a certain word appears in this text. The order of the words in the sentence doesn’t matter, only the number of times each word appears. The Bayes formula gives us the probability that a certain text is a positive review (label <script type="math/tex">Y=1</script>):</p>
<script type="math/tex; mode=display">P(Y=1|f) = \frac{P(f|Y=1)P(Y=1)}{P(f)}</script>
<p>We want to find the probability that a given text <script type="math/tex">f</script> is a positive review (<script type="math/tex">Y=1</script>). Thanks to this formula, we only need to know the probability that this review, knowing that it is positive, was written. (<script type="math/tex">P(f|Y=1)</script>), and the overall probability that a review is positive <script type="math/tex">P(Y=1)</script>.
Although <script type="math/tex">P(f)</script> appears in the formula, it does not really matter for our classification, as we will see.</p>
<p><script type="math/tex">P(Y=1)</script> can be easily estimated: it is the frequency of positive reviews in our corpus (noted <script type="math/tex">\frac{N^+}{N}</script>).
However, <script type="math/tex">P(f|Y=1)</script> is more difficult to estimate, and we need to make some very strong assumptions about it.
In fact, we will consider that the appearance of each word of the text is independent of the appearance of the other words. This assumption is very <em>naive</em>, thus illustrating the name of the method.</p>
<p>We now consider that
<script type="math/tex">f|Y</script>
follows a multinomial distribution: for a review of <script type="math/tex">n</script> words,
what is the probability that these words are distributed as in <script type="math/tex">f</script>?
If we denote
<script type="math/tex">p_i</script>
the probability that a given word
<script type="math/tex">i</script>
appears in a positive review (and <script type="math/tex">q_i</script> that it appears in a negative review), the multinomial distributions assume that <script type="math/tex">f|Y</script> is distributed as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{array}{lll}
P(f|Y=1) = \frac{(\sum_{i=1}^n f_i)!}{\prod_{i=1}^n f_i!}\prod_{i=1}^n p_i^{f_i} & \mbox{ and } &
P(f|Y=0) = \frac{(\sum_{i=1}^n f_i)!}{\prod_{i=1}^n f_i!}\prod_{i=1}^n q_i^{f_i}
\end{array} %]]></script>
<p>Thus, we can predict that the review is positive if
<script type="math/tex">P(Y=1|f) \geq P(Y=0|f)</script>
, that is if the likelihood ratio
<script type="math/tex">L</script>
is greater than one:</p>
<script type="math/tex; mode=display">L = \frac{P(Y=1|f)}{P(Y=0|f)} = \frac{P(f|Y=1)P(Y=1)}{P(f|Y=0)P(Y=0)} = \frac{\prod_{i=1}^n p_i^{f_i} \times \frac{N^+}{N}}{\prod_{i=1}^n q_i^{f_i} \times \frac{N^-}{N}}</script>
<p>Or, equivalently, if its logarithm is greater than zero:</p>
<script type="math/tex; mode=display">\ln(L) = \sum_{i=1}^n f_i \ln\left(\frac{p_i}{q_i}\right) + \ln\left(\frac{N^+}{N^-}\right)</script>
<p>Which can be written as:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{array}{llll} w^T . f + b > 0 & \mbox{ with } & w = \ln\left(\frac{P}{Q}\right) & b = \ln\left(\frac{N^+}{N^-}\right) \end{array} %]]></script>
<p>We see that our decision boundary is linear in the log-space of the features. However, I like to see this formula as written differently:</p>
<script type="math/tex; mode=display">1^T . (w\circ f) + b > 0</script>
<p>where <script type="math/tex">\circ</script> stands for the element-wise product and <script type="math/tex">1</script> for the unitary vector <script type="math/tex">(1,1,...1)</script>.
Now our Bayesian features vector is <script type="math/tex">(w\circ f)</script> and our hyperplane is orthogonal to <script type="math/tex">1</script>. However we can wonder if this particular hyperplane is the most efficient for classifying the reviews… and the answer is no! Here is our free lunch: we will use support vector machines to find a better separating hyperplane for these Bayesian features.</p>
<h2 id="from-reviews-to-vectors">From reviews to vectors</h2>
<p>The original dataset can be found <a href="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz">here</a>. However, this script named <a href="https://github.com/PForet/ML-hack/blob/master/bayesian-features/IMDB.py">IMDB.py</a> loads the reviews as a list of strings for both the train and the test sets:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">IMDB</span> <span class="kn">import</span> <span class="n">load_reviews</span>
<span class="c"># Load the training and testing sets</span>
<span class="n">train_set</span><span class="p">,</span> <span class="n">y_train</span> <span class="o">=</span> <span class="n">load_reviews</span><span class="p">(</span><span class="s">"train"</span><span class="p">)</span>
<span class="n">test_set</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">load_reviews</span><span class="p">(</span><span class="s">"test"</span><span class="p">)</span></code></pre></figure>
<p>Feel free to use it, it downloads and unzips the database automatically if needed. We will use <code class="highlighter-rouge">Scikit.TfidfVectorizer</code> to transform our texts into vectors. Instead of only counting the words, it will return their frequency and apply some very useful transformations, such as giving more weight to uncommon words. The vectorizer I used is a slightly modified version of <code class="highlighter-rouge">TfidfVectorizer</code> which a custom pre-processor and tokenizer (which keeps exclamation marks, useful for sentiment analysis). By default, it doesn’t only count words but also bi-grams (pairs of consecutive words), as this gives best results at the cost of an increasing features space. You can find the code <a href="https://github.com/PForet/ML-hack/blob/master/bayesian-features/text_processing.py">here</a>, and use it to run your own test:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">text_processing</span> <span class="kn">import</span> <span class="n">string_to_vec</span>
<span class="c"># Returns a vector that counts the occurrences of each n-gram</span>
<span class="n">my_vectorizer</span> <span class="o">=</span> <span class="n">string_to_vec</span><span class="p">(</span><span class="n">train_set</span><span class="p">,</span> <span class="n">method</span><span class="o">=</span><span class="s">"Count"</span><span class="p">)</span>
<span class="c"># Returns a vector of the frequency of each n-gram</span>
<span class="n">my_vectorizer</span> <span class="o">=</span> <span class="n">string_to_vec</span><span class="p">(</span><span class="n">train_set</span><span class="p">,</span> <span class="n">method</span><span class="o">=</span><span class="s">"TF"</span><span class="p">)</span>
<span class="c"># Same but applies an inverse document frequency transformation</span>
<span class="n">my_vectorizer</span> <span class="o">=</span> <span class="n">string_to_vec</span><span class="p">(</span><span class="n">train_set</span><span class="p">,</span> <span class="n">method</span><span class="o">=</span><span class="s">"TFIDF"</span><span class="p">)</span></code></pre></figure>
<p>You can tune every parameter of it, just as with a standard <code class="highlighter-rouge">TfidfVectorizer</code>. For instance, if you want to keep only individual words and not bi-grams:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># Returns a vector that counts the occurrences of each word</span>
<span class="n">my_vectorizer</span> <span class="o">=</span> <span class="n">string_to_vec</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">method</span><span class="o">=</span><span class="s">"Count"</span><span class="p">,</span> <span class="n">ngram_range</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1</span><span class="p">))</span></code></pre></figure>
<p>From now on, we will only use:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">myvectorizer</span> <span class="o">=</span> <span class="n">string_to_vec</span><span class="p">(</span><span class="n">train_set</span><span class="p">,</span> <span class="n">method</span><span class="o">=</span><span class="s">"TFIDF"</span><span class="p">)</span></code></pre></figure>
<p>This will keep all words and bi-grams that appear more than 5 times in our corpus. This is a lot of words: our features space has 133572 dimensions, for 25000 training points!
Now that we know how to transform our reviews to vectors, we need to choose a machine learning algorithm. We talked about support vector machines. However, they scale very poorly and are too slow to be trained on 25000 points with more than 100000 features. We will thus use a slightly modified version, the dual formulation of a <em>l2-penalized</em> logistic regression. We will now explain why this is very similar to a support vector classifier.</p>
<h2 id="support-vector-machine-and-logistic-regression">Support vector machine and logistic regression</h2>
<h3 id="cost-function-of-a-support-vector-machine">Cost function of a support vector machine</h3>
<p>A support vector machine tries to find a separation plane <script type="math/tex">w^T.f=b</script> that maximises the distance between the plane and the closest points. This distance, called <em>margin</em>, can be expressed in terms of <script type="math/tex">w</script> :</p>
<p style="text-align: center;"><img src="http://www.ml-hack.com/assets/images/stat/SVM_margin.png" alt="support vector machine margin" height="50%" width="50%" /></p>
<p>A point is correctly classified if it is on the good side of the plane, and outside of the margin. On this image, we see that a sample is correctly classified if <script type="math/tex">w^T.f + b > 1</script> and <script type="math/tex">Y=1</script> or <script type="math/tex">% <![CDATA[
w^T.f + b < 1 %]]></script> and <script type="math/tex">Y=0</script>. This can be summarised as <script type="math/tex">(2y_i - 1)(w^T.f+b)</script>. We want to maximise the margin <script type="math/tex">\frac{2}{||w||} > 1</script> thus the optimisation problem of a support vector classifier is:</p>
<script type="math/tex; mode=display">\left\{ \begin{array}{ll}
\min_{w,b} \frac{1}{2}||w||^2 \\
s.t. (2y_i - 1)(w^T.f+b) \geq 1
\end{array} \right.</script>
<p>However, if our observations are not linearly separable, such a solution doesn’t exist. Therefore we introduce <em>slack variables</em> that allow our model to incorrectly classify some points at some cost <script type="math/tex">C</script>:</p>
<script type="math/tex; mode=display">\left\{ \begin{array}{ll}
\min_{w,b} \frac{1}{2}||w||^2 + C \sum_{i=1}^n \epsilon_i \\
s.t. (2y_i - 1)(w^T.f+b) \geq 1 - \epsilon_i \mbox{ with }
\epsilon_i \geq 0
\end{array} \right.</script>
<h3 id="logistic-regression">Logistic regression</h3>
<p>In logistic regression, the probability of a label to be <script type="math/tex">1</script> given a vector <script type="math/tex">f</script> is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{array}{lll}
P(Y=1 | \, f) = \sigma(w^T.f+b) & \mbox{ where } & \sigma(x) = \frac{1}{1+e^{-x}}
\end{array} %]]></script>
<p>If we add a <em>l2-regularisation</em> penalty to our regression, the objective function becomes:</p>
<script type="math/tex; mode=display">\min_{w,b} \frac{1}{2}||w||^2 + C \sum_{i=1}^n \ln\left(1+e^{-y_i(w^T.f+b)}\right)</script>
<p>Where <script type="math/tex">\sum_{i=1}^n \ln\left(1+e^{-y_i(w^T.f+b)}\right)</script> is the negative log-likelihood of our observations.
If you like statistics, it is worth noting that adding the <em>l2-penalty</em> is the same as maximising the likelihood with a Gaussian prior on the weights (or a Laplacian prior for a <em>l1-penalty</em>).</p>
<h3 id="why-are-they-similar">Why are they similar?</h3>
<p>We define the likelihood ratio as</p>
<script type="math/tex; mode=display">r = \frac{P(Y=1|\, f)}{P(Y=0| \,f)} = e^{w^Tf+b}</script>
<p>the cost of a positive example for the support vector machine is:</p>
<script type="math/tex; mode=display">cost_{Y=1} = C \times max(0, 1-(w^T.f+b)) = C \times max(0, 1-log(r))</script>
<p>and for the logistic regression with a <em>l2-regularisation</em> penalty:</p>
<script type="math/tex; mode=display">cost_{Y=1} = C \times \ln\left(1+e^{-(w^T.f+b)}\right) = C \times \ln\left(1+\frac{1}{R}\right)</script>
<p>If we plot the cost of a positive example for the two models, we see that we have very similar losses:</p>
<p style="text-align: center;"><img src="http://www.ml-hack.com/assets/images/stat/svc_vs_lr.png" alt="support vector machine against logit loss" /></p>
<p>This is why a SVC with a linear kernel will give results similar to a <em>l2-penalized</em> linear regression.</p>
<p>In our classification problem, we have 25000 training examples, and more than 130000 features, so a SVC will be extremely long to train.
However, a linear classifier with a l2 penalty is much faster than a SVC when the number of samples grows, and gives very similar results, as we just saw.</p>
<h3 id="dual-formulation-of-the-logistic-regression">Dual formulation of the logistic regression</h3>
<p>When the number of samples is fewer than the number of features, as it is here, one might consider solving the dual formulation of the logistic regression.
If you are interested in finding out about this formulation, I recommend <a href="https://link.springer.com/content/pdf/10.1007%2Fs10994-010-5221-8.pdf">Hsiang-Fu Yu, Fang-Lan Huang, and Chih-Jen Lin</a> which makes a nice comparison between the linear SVC and the dual formulation of the logistic regression, uncovering more similarities between these techniques.</p>
<h2 id="implementation-of-the-model">Implementation of the model</h2>
<p>As seen before, we define</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{array}{lll} P = \alpha + \sum_{i, y_i=1} f_i & \mbox{ and } &
Q = \alpha + \sum_{i, y_i=0} f_i \end{array} %]]></script>
<p>For some smoothing parameter <script type="math/tex">\alpha</script>. The log-ratio <script type="math/tex">R</script> is defined as:</p>
<script type="math/tex; mode=display">R = \ln\left(\frac{P/||P||_1}{Q/||Q||_1}\right)</script>
<p>Where
<script type="math/tex">||.||_1</script>
stands for the <script type="math/tex">L^1</script> norm.</p>
<p>At last, the Bayesian features used to fit our SVC will be</p>
<script type="math/tex; mode=display">R \circ X_{train}</script>
<p>Of course, we will use a sparse matrix to save memory (our vectors are mostly zeros).
Wrapped in some python code, this gives:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">__future__</span> <span class="kn">import</span> <span class="n">division</span>
<span class="kn">from</span> <span class="nn">scipy.sparse</span> <span class="kn">import</span> <span class="n">csr_matrix</span>
<span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LogisticRegression</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="k">class</span> <span class="nc">NBSVM</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">alpha</span> <span class="o">=</span> <span class="n">alpha</span>
<span class="c"># Keep additional keyword arguments to pass to the classifier</span>
<span class="bp">self</span><span class="o">.</span><span class="n">kwargs</span> <span class="o">=</span> <span class="n">kwargs</span>
<span class="k">def</span> <span class="nf">fit</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
<span class="n">f_1</span> <span class="o">=</span> <span class="n">csr_matrix</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="o">.</span><span class="n">transpose</span><span class="p">()</span>
<span class="n">f_0</span> <span class="o">=</span> <span class="n">csr_matrix</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">subtract</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="n">y</span><span class="p">))</span><span class="o">.</span><span class="n">transpose</span><span class="p">()</span> <span class="c">#Invert labels</span>
<span class="c"># Compute the probability vectors P and Q</span>
<span class="n">p_</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">alpha</span><span class="p">,</span> <span class="n">X</span><span class="o">.</span><span class="n">multiply</span><span class="p">(</span><span class="n">f_1</span><span class="p">)</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">))</span>
<span class="n">q_</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">alpha</span><span class="p">,</span> <span class="n">X</span><span class="o">.</span><span class="n">multiply</span><span class="p">(</span><span class="n">f_0</span><span class="p">)</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">))</span>
<span class="c"># Normalize the vectors</span>
<span class="n">p_normed</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">divide</span><span class="p">(</span><span class="n">p_</span><span class="p">,</span> <span class="nb">float</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">p_</span><span class="p">)))</span>
<span class="n">q_normed</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">divide</span><span class="p">(</span><span class="n">q_</span><span class="p">,</span> <span class="nb">float</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">q_</span><span class="p">)))</span>
<span class="c"># Compute the log-ratio vector R and keep for future uses</span>
<span class="bp">self</span><span class="o">.</span><span class="n">r_</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">divide</span><span class="p">(</span><span class="n">p_normed</span><span class="p">,</span> <span class="n">q_normed</span><span class="p">))</span>
<span class="c"># Compute bayesian features for the train set</span>
<span class="n">f_bar</span> <span class="o">=</span> <span class="n">X</span><span class="o">.</span><span class="n">multiply</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">r_</span><span class="p">)</span>
<span class="c"># Fit the regressor</span>
<span class="bp">self</span><span class="o">.</span><span class="n">lr_</span> <span class="o">=</span> <span class="n">LogisticRegression</span><span class="p">(</span><span class="n">dual</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="o">**</span><span class="bp">self</span><span class="o">.</span><span class="n">kwargs</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">lr_</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">f_bar</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">predict</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">lr_</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X</span><span class="o">.</span><span class="n">multiply</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">r_</span><span class="p">))</span></code></pre></figure>
<p>And finally (I chose the parameters <script type="math/tex">\alpha = 0.1</script> and <script type="math/tex">C=12</script> with a cross-validation):</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># Transform the training and testing sets</span>
<span class="n">X_train</span> <span class="o">=</span> <span class="n">myvectorizer</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">train_set</span><span class="p">)</span>
<span class="n">X_test</span> <span class="o">=</span> <span class="n">myvectorizer</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">test_set</span><span class="p">)</span>
<span class="n">clf</span> <span class="o">=</span> <span class="n">NBSVM</span><span class="p">(</span><span class="n">alpha</span><span class="o">=</span><span class="mf">0.1</span><span class="p">,</span><span class="n">C</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
<span class="n">clf</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="n">predictions</span> <span class="o">=</span> <span class="n">clf</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Accuracy: {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">accuracy_score</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">predictions</span><span class="p">)))</span></code></pre></figure>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Accuracy: 0.91648
</code></pre></div></div>
<p>That was a pretty painless way of achieving 91.6% accuracy!</p>
<p>Thank you a lot for reading, and don’t hesitate to leave a comment if you have any question or suggestion ;)</p>Pierre ForetPierre_foret@berkeley.eduHow to classify texts using Bayesian features and support vector machines.Neural Networks Specialization2017-03-21T00:00:00+00:002017-03-21T00:00:00+00:00http://www.ml-hack.com/neural-networks-specialization<h1 id="specializing-a-neural-network-with-svc">Specializing a neural network with SVC</h1>
<p><em>This article is a follow up to <a href="/characters-recognition-with-keras/">this post</a>, where we trained a CNN to recognise Devanagari characters.</em></p>
<p>Transfer learning is the practice of using knowledge already acquired to perform new tasks, and it’s awesome. Of course, why would you start from scratch when the problem is almost already solved?</p>
<p>In the case of neural networks, a way to perform transfer learning is to re-train the last layers of a network. I’m not fond of this method, as it can feel unnatural to implement. I think a more explicit way to benefit from a trained network is to use it as a features extractor by chopping off the last layers. When you see your network as just a features extractor, retraining the last layers mean stacking a new network on top. But why should we limit ourselves to this possibility? If we have few training samples, why not add a more suitable algorithm like support vector classifiers, for instance?</p>
<p>This is exactly what we are going to do in this final chapter of our series on handwritten characters recognition. Today, we will improve greatly our accuracy by training a CNN on the whole database (numerals, consonants, and vowels), before replacing the last layers with support vector machines.</p>
<p><em>This article follows <a href="/characters-recognition-with-keras/">this one</a> and presupposes the same data structures are loaded in the workspace</em></p>
<h3 id="merging-the-datasets">Merging the datasets</h3>
<p><strong>We start by merging the three sets</strong>. To do so, we increment the labels of the vowels by 10 (the number of numerals) and the labels of the consonants by 22 (number of numerals + vowels) to resolve conflicts between labels.</p>
<p><strong>We then split the dataset into a training, a validation and a testing set</strong>, using the same method as before (stratifying, and using the same proportions).</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">all_tensors</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">vstack</span><span class="p">((</span><span class="n">tensors_numerals</span><span class="p">,</span> <span class="n">tensors_vowels</span><span class="p">,</span> <span class="n">tensors_consonants</span><span class="p">))</span>
<span class="n">all_labels_int</span> <span class="o">=</span> <span class="n">numerals_labels</span> <span class="o">+</span> <span class="p">[</span><span class="nb">int</span><span class="p">(</span><span class="n">i</span><span class="p">)</span><span class="o">+</span><span class="mi">10</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">vowels_labels</span><span class="p">]</span> <span class="o">+</span> <span class="p">[</span><span class="nb">int</span><span class="p">(</span><span class="n">i</span><span class="p">)</span><span class="o">+</span><span class="mi">22</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">consonants_labels</span><span class="p">]</span>
<span class="n">nb_labels</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="nb">set</span><span class="p">(</span><span class="n">all_labels_int</span><span class="p">))</span>
<span class="n">all_labels</span> <span class="o">=</span> <span class="n">np_utils</span><span class="o">.</span><span class="n">to_categorical</span><span class="p">([</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">all_labels_int</span><span class="p">)</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">int</span><span class="p">)],</span> <span class="n">nb_labels</span><span class="p">)</span>
<span class="n">X_model</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_model</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">all_tensors</span><span class="p">,</span> <span class="n">all_labels</span><span class="p">,</span> <span class="n">test_size</span> <span class="o">=</span> <span class="mf">0.15</span><span class="p">,</span> <span class="n">stratify</span><span class="o">=</span><span class="n">all_labels</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_val</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_val</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X_model</span><span class="p">,</span> <span class="n">y_model</span><span class="p">,</span> <span class="n">test_size</span> <span class="o">=</span> <span class="mf">0.176</span><span class="p">,</span> <span class="n">stratify</span><span class="o">=</span><span class="n">y_model</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span></code></pre></figure>
<h3 id="building-a-new-model">Building a new model</h3>
<p><strong>We then define a model that will be trained on the whole training dataset</strong> (numerals, consonants, and vowels together). We now have a more consequent dataset (over 9000 images for training), and we will use data-augmentation. Because of that, <strong>we can afford a more complex model to better fit the new diversity of our dataset</strong>. The new model is constructed as followed:</p>
<ul>
<li>We start with a convolutional layer with <strong>more filters (128)</strong>.</li>
<li>We put <strong>two dense layers of 512 nodes</strong> before the last layer, to construct a better representation of the features uncovered by the convolutional layers. We will keep these layers when specialising the model to one of the three datasets.</li>
<li>Because the model is still quite simple (no more than 4 millions parameters), we can afford to perform <strong>numerous epochs during the training on a GPU</strong>. Numerous epochs are also a good way to benefit fully from data-augmentation, as the model will discover new images at each iteration. However, to prevent overfitting, <strong>we put a drop out layer before each dense layer</strong>, and also one after the first convolutional layers.</li>
</ul>
<p>The <code class="highlighter-rouge">keras</code> implementation of the model is:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">model_for_all</span> <span class="o">=</span> <span class="n">Sequential</span><span class="p">()</span>
<span class="n">model_for_all</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Conv2D</span><span class="p">(</span><span class="mi">128</span><span class="p">,</span> <span class="n">kernel_size</span><span class="o">=</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">),</span>
<span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">,</span>
<span class="n">input_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">36</span><span class="p">,</span><span class="mi">36</span><span class="p">,</span><span class="mi">1</span><span class="p">)))</span>
<span class="n">model_for_all</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dropout</span><span class="p">(</span><span class="mf">0.25</span><span class="p">))</span>
<span class="n">model_for_all</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Conv2D</span><span class="p">(</span><span class="mi">64</span><span class="p">,</span> <span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">),</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">))</span>
<span class="n">model_for_all</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">MaxPooling2D</span><span class="p">(</span><span class="n">pool_size</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">)))</span>
<span class="n">model_for_all</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Conv2D</span><span class="p">(</span><span class="mi">32</span><span class="p">,</span> <span class="p">(</span><span class="mi">3</span><span class="p">,</span><span class="mi">3</span><span class="p">),</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">))</span>
<span class="n">model_for_all</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dropout</span><span class="p">(</span><span class="mf">0.25</span><span class="p">))</span>
<span class="n">model_for_all</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Flatten</span><span class="p">())</span>
<span class="n">model_for_all</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dense</span><span class="p">(</span><span class="mi">512</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">))</span>
<span class="n">model_for_all</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dropout</span><span class="p">(</span><span class="mf">0.7</span><span class="p">))</span>
<span class="n">model_for_all</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dense</span><span class="p">(</span><span class="mi">512</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">))</span>
<span class="n">model_for_all</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dropout</span><span class="p">(</span><span class="mf">0.7</span><span class="p">))</span>
<span class="n">model_for_all</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dense</span><span class="p">(</span><span class="mi">58</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'softmax'</span><span class="p">))</span>
<span class="n">model_for_all</span><span class="o">.</span><span class="n">summary</span><span class="p">()</span>
<span class="n">opt</span> <span class="o">=</span> <span class="n">RMSprop</span><span class="p">(</span><span class="n">lr</span><span class="o">=</span><span class="mf">0.001</span><span class="p">,</span> <span class="n">rho</span><span class="o">=</span><span class="mf">0.9</span><span class="p">,</span> <span class="n">epsilon</span><span class="o">=</span><span class="mf">1e-08</span><span class="p">,</span> <span class="n">decay</span><span class="o">=</span><span class="mf">0.0</span><span class="p">)</span>
<span class="n">model_for_all</span><span class="o">.</span><span class="nb">compile</span><span class="p">(</span><span class="n">optimizer</span><span class="o">=</span><span class="n">opt</span><span class="p">,</span> <span class="n">loss</span><span class="o">=</span><span class="s">'categorical_crossentropy'</span><span class="p">,</span> <span class="n">metrics</span><span class="o">=</span><span class="p">[</span><span class="s">'accuracy'</span><span class="p">])</span></code></pre></figure>
<h3 id="fitting-the-model-with-data-augmentation">Fitting the model with data-augmentation</h3>
<p><strong>We will now fit the same model using data-augmentation</strong>. We use Keras’ <code class="highlighter-rouge">ImageDataGenerator</code> to dynamically generate new batches of images. We specify the transformations we want on augmented images:</p>
<ul>
<li>A <strong>small random rotation</strong> of the characters (maximum 15 degrees)</li>
<li>A <strong>small random zoom</strong> (in or out), up to a maximum of 20% of the image size.</li>
<li>We could add random translations, but they are pretty useless if the first layers are convolutional.</li>
</ul>
<p>When using data-augmentation, we need to fit the model using a special function, <code class="highlighter-rouge">fit_generator</code>. We specify that <strong>we want to monitor the training with a non-augmented validation set</strong>, by specifying <code class="highlighter-rouge">validation_data=(X_val, y_val)</code>. Finally, we save the weights only when the validation loss is decreasing, and we predict the accuracy on the testing set.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">keras.preprocessing.image</span> <span class="kn">import</span> <span class="n">ImageDataGenerator</span>
<span class="n">datagen</span> <span class="o">=</span> <span class="n">ImageDataGenerator</span><span class="p">(</span>
<span class="n">rotation_range</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span>
<span class="n">zoom_range</span><span class="o">=</span><span class="mf">0.2</span><span class="p">)</span>
<span class="c"># The checkpointer allows us to monitor the validation loss and to save weights</span>
<span class="n">checkpointer</span> <span class="o">=</span> <span class="n">ModelCheckpoint</span><span class="p">(</span><span class="n">filepath</span><span class="o">=</span><span class="s">'saved_models/weights.best.for_all'</span><span class="p">,</span>
<span class="n">verbose</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">save_best_only</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">model_for_all</span><span class="o">.</span><span class="n">fit_generator</span><span class="p">(</span><span class="n">datagen</span><span class="o">.</span><span class="n">flow</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">400</span><span class="p">),</span>
<span class="n">steps_per_epoch</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span> <span class="o">/</span> <span class="mi">400</span><span class="p">,</span> <span class="n">epochs</span><span class="o">=</span><span class="mi">150</span><span class="p">,</span>
<span class="n">validation_data</span><span class="o">=</span><span class="p">(</span><span class="n">X_val</span><span class="p">,</span> <span class="n">y_val</span><span class="p">),</span> <span class="n">callbacks</span><span class="o">=</span><span class="p">[</span><span class="n">checkpointer</span><span class="p">])</span>
<span class="c">#The last weights are not the ones we want to keep, so we must reload the best weights found</span>
<span class="n">model_for_all</span><span class="o">.</span><span class="n">load_weights</span><span class="p">(</span><span class="s">'saved_models/weights.best.for_all'</span><span class="p">)</span>
<span class="n">y_pred</span> <span class="o">=</span> <span class="n">model_for_all</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Accuracy on test set: {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span>
<span class="n">accuracy_score</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span> <span class="n">np</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">y_pred</span><span class="p">,</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">))))</span></code></pre></figure>
<p><strong>We now remove the last two layers of our model</strong> (the last dense layer and the drop out layer before). We then freeze the remaining layers to make them non-trainable, and we save our base model for future use.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># Removes the two last layers (dense and dropout)</span>
<span class="n">model_for_all</span><span class="o">.</span><span class="n">pop</span><span class="p">();</span> <span class="n">model_for_all</span><span class="o">.</span><span class="n">pop</span><span class="p">()</span>
<span class="c"># Makes the layers non trainable</span>
<span class="k">for</span> <span class="n">l</span> <span class="ow">in</span> <span class="n">model_for_all</span><span class="o">.</span><span class="n">layers</span><span class="p">:</span>
<span class="n">l</span><span class="o">.</span><span class="n">trainable</span> <span class="o">=</span> <span class="bp">False</span>
<span class="c"># Saves the model for easy loading</span>
<span class="n">model_for_all</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="s">"Models/model_for_all"</span><span class="p">)</span></code></pre></figure>
<p>We should now split our whole dataset into its numerals, vowels and consonants components. This part is a little tedious but is necessary to ensure that we will test and validate our specified model on samples that were not seen during the previous learning.</p>
<p>To do that, we define an <code class="highlighter-rouge">extract_subset</code> function that allows us to extract samples for which the label is in a given range. For instance, to extract only the consonants, we should extract all the samples with a label between 0 and 9.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">extract_subset</span><span class="p">(</span><span class="n">X_set</span><span class="p">,</span> <span class="n">y_set</span><span class="p">,</span> <span class="n">start_index</span><span class="p">,</span> <span class="n">end_index</span><span class="p">):</span>
<span class="n">X</span> <span class="o">=</span> <span class="p">[[</span><span class="n">x</span><span class="p">]</span> <span class="k">for</span> <span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">i</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">X_set</span><span class="p">,</span><span class="n">y_set</span><span class="p">)</span> <span class="k">if</span> <span class="n">np</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="o">>=</span> <span class="n">start_index</span> <span class="ow">and</span> <span class="n">np</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="o"><</span> <span class="n">end_index</span><span class="p">]</span>
<span class="n">y</span> <span class="o">=</span> <span class="p">[[</span><span class="n">i</span><span class="p">[</span><span class="n">start_index</span><span class="p">:</span><span class="n">end_index</span><span class="p">]]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">y_set</span> <span class="k">if</span> <span class="n">np</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="o">>=</span> <span class="n">start_index</span> <span class="ow">and</span> <span class="n">np</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="o"><</span> <span class="n">end_index</span><span class="p">]</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">vstack</span><span class="p">(</span><span class="n">X</span><span class="p">),</span> <span class="n">np</span><span class="o">.</span><span class="n">vstack</span><span class="p">(</span><span class="n">y</span><span class="p">)</span>
<span class="c"># numerals : labels between 0 and 9</span>
<span class="n">X_test_numerals</span><span class="p">,</span> <span class="n">y_test_numerals</span> <span class="o">=</span> <span class="n">extract_subset</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span>
<span class="c"># vowels : labels between 10 and 21</span>
<span class="n">X_test_vowels</span><span class="p">,</span> <span class="n">y_test_vowels</span> <span class="o">=</span> <span class="n">extract_subset</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">22</span><span class="p">)</span>
<span class="c"># Consonants : labels between 22 and 58</span>
<span class="n">X_test_consonants</span><span class="p">,</span> <span class="n">y_test_consonants</span> <span class="o">=</span> <span class="n">extract_subset</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">,</span> <span class="mi">22</span><span class="p">,</span> <span class="mi">58</span><span class="p">)</span>
<span class="n">X_val_numerals</span><span class="p">,</span> <span class="n">y_val_numerals</span> <span class="o">=</span> <span class="n">extract_subset</span><span class="p">(</span><span class="n">X_val</span><span class="p">,</span> <span class="n">y_val</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span>
<span class="n">X_val_vowels</span><span class="p">,</span> <span class="n">y_val_vowels</span> <span class="o">=</span> <span class="n">extract_subset</span><span class="p">(</span><span class="n">X_val</span><span class="p">,</span> <span class="n">y_val</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">22</span><span class="p">)</span>
<span class="n">X_val_consonants</span><span class="p">,</span> <span class="n">y_val_consonants</span> <span class="o">=</span> <span class="n">extract_subset</span><span class="p">(</span><span class="n">X_val</span><span class="p">,</span> <span class="n">y_val</span><span class="p">,</span> <span class="mi">22</span><span class="p">,</span> <span class="mi">58</span><span class="p">)</span>
<span class="n">X_train_numerals</span><span class="p">,</span> <span class="n">y_train_numerals</span> <span class="o">=</span> <span class="n">extract_subset</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span>
<span class="n">X_train_vowels</span><span class="p">,</span> <span class="n">y_train_vowels</span> <span class="o">=</span> <span class="n">extract_subset</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">22</span><span class="p">)</span>
<span class="n">X_train_consonants</span><span class="p">,</span> <span class="n">y_train_consonants</span> <span class="o">=</span> <span class="n">extract_subset</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="mi">22</span><span class="p">,</span> <span class="mi">58</span><span class="p">)</span></code></pre></figure>
<p>We have extracted the training, validation and testing sets for the three datasets. We can now load the pre-trained model, using keras’ <code class="highlighter-rouge">load_model</code> function, and extract the activation of the last layers. These activations can be seen as high-level features of our images.</p>
<p>One way to specialize our model would be to add another dense layer (with a softmax activation) on the top. This is, in fact, equivalent to <strong>performing a logistic regression over the activation of the last layers</strong> produced by each image (these activations will be called ‘bottleneck_features’). Thus, we suggest here to use a more powerful classifier instead of a last dense layer. We will train a SVC to predict the class of the character, given the bottleneck features as input.</p>
<p>Of course, our features extractor will be the model trained on the whole dataset (using data-augmentation) with the last dense layer removed. It will then transform an image into a vector of 512 high-level features that we can feed to our SVC.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">keras.models</span> <span class="kn">import</span> <span class="n">load_model</span>
<span class="n">features_extractor</span> <span class="o">=</span> <span class="n">load_model</span><span class="p">(</span><span class="s">"Models/model_for_all"</span><span class="p">)</span></code></pre></figure>
<p>We then merge the training and validation set (to perform a K-fold for validation instead), and perform a grid search to find the best parameters for our SVC.
The following function will do so and return the best SVC found during the grid search.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">extract_bottleneck_features</span><span class="p">(</span><span class="n">X</span><span class="p">):</span>
<span class="c"># Features are the activations of the last layer</span>
<span class="k">return</span> <span class="n">features_extractor</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">SVC_Top</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
<span class="c"># Scikit SVC takes an array of integers as labels</span>
<span class="n">y_flat</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">y</span><span class="p">]</span>
<span class="c"># We train the SVC on the extracted features</span>
<span class="n">X_flat</span> <span class="o">=</span> <span class="n">extract_bottleneck_features</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="n">High_level_classifier</span> <span class="o">=</span> <span class="n">SVC</span><span class="p">()</span>
<span class="c"># The parameters to try with grid_search</span>
<span class="n">param_grid</span> <span class="o">=</span> <span class="p">{</span><span class="s">'C'</span><span class="p">:[</span><span class="mi">1</span><span class="p">,</span><span class="mi">10</span><span class="p">,</span><span class="mi">100</span><span class="p">,</span><span class="mi">1000</span><span class="p">],</span>
<span class="s">'gamma'</span><span class="p">:[</span><span class="mf">0.001</span><span class="p">,</span><span class="mf">0.01</span><span class="p">,</span><span class="mf">0.1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">10</span><span class="p">,</span><span class="mi">100</span><span class="p">]}</span>
<span class="n">grid_search</span> <span class="o">=</span> <span class="n">GridSearchCV</span><span class="p">(</span><span class="n">High_level_classifier</span><span class="p">,</span> <span class="n">param_grid</span><span class="p">)</span>
<span class="n">grid_search</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_flat</span><span class="p">,</span> <span class="n">y_flat</span><span class="p">)</span>
<span class="c"># We return the best SVC found</span>
<span class="k">return</span> <span class="n">grid_search</span><span class="o">.</span><span class="n">best_estimator_</span>
<span class="c"># We get the best SVC for each type of characters</span>
<span class="n">X_consonants</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">vstack</span><span class="p">((</span><span class="n">X_train_consonants</span><span class="p">,</span> <span class="n">X_val_consonants</span><span class="p">))</span>
<span class="n">Best_top_classifier_consonants</span> <span class="o">=</span> <span class="n">SVC_Top</span><span class="p">(</span><span class="n">X_consonants</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">vstack</span><span class="p">((</span><span class="n">y_train_consonants</span><span class="p">,</span> <span class="n">y_val_consonants</span><span class="p">)))</span>
<span class="n">X_vowels</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">vstack</span><span class="p">((</span><span class="n">X_train_vowels</span><span class="p">,</span> <span class="n">X_val_vowels</span><span class="p">))</span>
<span class="n">Best_top_classifier_vowels</span> <span class="o">=</span> <span class="n">SVC_Top</span><span class="p">(</span><span class="n">X_vowels</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">vstack</span><span class="p">((</span><span class="n">y_train_vowels</span><span class="p">,</span> <span class="n">y_val_vowels</span><span class="p">)))</span>
<span class="n">X_numerals</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">vstack</span><span class="p">((</span><span class="n">X_train_numerals</span><span class="p">,</span> <span class="n">X_val_numerals</span><span class="p">))</span>
<span class="n">Best_top_classifier_numerals</span> <span class="o">=</span> <span class="n">SVC_Top</span><span class="p">(</span><span class="n">X_numerals</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">vstack</span><span class="p">((</span><span class="n">y_train_numerals</span><span class="p">,</span> <span class="n">y_val_numerals</span><span class="p">)))</span></code></pre></figure>
<h3 id="results">Results</h3>
<p>These results improved by far what we obtained with a CNN trained from scratch:</p>
<table>
<thead>
<tr>
<th> </th>
<th style="text-align: center">SVC over bottleneck features</th>
<th style="text-align: center">CNN from scratch</th>
<th style="text-align: center">SVC with PCA</th>
</tr>
</thead>
<tbody>
<tr>
<td>Numerals</td>
<td style="text-align: center">99.7%</td>
<td style="text-align: center">97.9%</td>
<td style="text-align: center">96.9%</td>
</tr>
<tr>
<td>Vowels</td>
<td style="text-align: center">99.5%</td>
<td style="text-align: center">93.5%</td>
<td style="text-align: center">87.9%</td>
</tr>
<tr>
<td>Consonants</td>
<td style="text-align: center">94.9%</td>
<td style="text-align: center">85.9%</td>
<td style="text-align: center">75.0%</td>
</tr>
</tbody>
</table>
<p>On the vowels and the numerals, we achieve an accuracy of 99.5% and 99.7%, thus improving by 2.2 and 0.6 points the accuracy of the previous CNN model. The results are even more spectacular with the consonants, where we improve our accuracy from 85.9% up to 94.9% (+9.0 points).</p>
<h2 id="neural-networks-as-features-extractors">Neural networks as features extractors</h2>
<p>The fact that a trained neural network can be used as a features extractor is very useful. For image recognition, a popular technique consists in using pre-trained CNN (such as <a href="https://towardsdatascience.com/neural-network-architectures-156e5bad51ba">Inception or VGG</a>) to extract high-level features that can be fed to other machine learning algorithms. By doing so, I save myself the struggle of training a CNN and improve my precision by using the fact that these CNN were trained on a database far bigger than mine.</p>
<p>To visualize this phenomenon, I trained another CNN, with a dense layer containing only two neurons somewhere in the middle. If we remove all the layers after this one, the output of this CNN will be a vector of size two representing the input image in the plane. Please note that the final accuracy of this CNN is far inferior: it is generally a bad idea to put such a bottleneck on the information flowing in a neural network.</p>
<p>The features discovered by this CNN are displayed below (one colour by class, logarithmic transformations applied):</p>
<p><img src="http://www.ml-hack.com/assets/images/devanagari/Features2d_all.jpeg" alt="Features 2d" /></p>
<p>Or, for clarity, if we keep only the vowels:</p>
<p><img src="http://www.ml-hack.com/assets/images/devanagari/Features2d_vowels.jpeg" alt="Features 2d vowels" /></p>
<p>Here, we can see that the features extracted from the images are grouped by classes: this is why it is so efficient to use them as inputs to another classifier.</p>
<h2 id="conclusion">Conclusion</h2>
<p>We tested several ways to classify Devanagari characters. The first one was a Support Vector Classifier trained over the first 24 axes of a PCA. We then improved our accuracy by switching to a Convolutional Neural Network, trained only on the relevant dataset (consonants, vowels or numerals). At last, we trained another CNN on all the dataset, using data-augmentation, to provide a powerful features extractor. We then trained one specialized SVC for each type of character over the high levels features provided by this CNN. <strong>With this technique, we achieved an accuracy far superior to the other methods (99.7% for the numerals, 99.5% for the vowels and 94.9% for the consonants.</strong></p>
<p>That’s the end of the series! Thank for your attention, and I promise no more Devanagari characters here ;)</p>Pierre ForetPierre_foret@berkeley.eduHow to specialize a convolutional neural network, by replacing the last layers with a SVC.Characters Recognition With Keras2017-03-20T00:00:00+00:002017-03-20T00:00:00+00:00http://www.ml-hack.com/characters-recognition-with-keras<h1 id="classifying-hand-written-characters-with-keras">Classifying hand written characters with Keras</h1>
<p>In <a href="/PCA-for-image-classification/">this article</a>, we saw how to apply principal component analysis to image recognition. Our performances were quite good, but clearly not state-of-the art. Today, we are going to see how we can improve our accuracy using convolutional neural network (CNN). The best results will be obtained by combining CNN and support vector machines. This article is only meant as an introduction to CNN and <code class="highlighter-rouge">Keras</code>, so feel free to jump to the <a href="/neural-networks-specialization/">last article of the serie</a> if you are already familiar with this framework.</p>
<h2 id="simples-cnn-beats-our-pca-svc-approach">Simples CNN beats our PCA-SVC approach</h2>
<p>As one could guess, a simple CNN is enough to improve the results obtained in the previous post. We explain here how to build a CNN using <code class="highlighter-rouge">Keras</code> (TensorFlow backend).</p>
<p>Several categories of neural networks are available on Keras, such as recurrent neural networks (RNN) or graph models. <strong>We will only use sequential models</strong>, which are constructed by stacking several neural layers.</p>
<h3 id="choosing-layers">Choosing layers</h3>
<p>We have several types of layers than we can stack in our model, including:</p>
<ul>
<li><strong>Dense Layers</strong>: The simplest layers, where all the weights are independent and the layer is fully connected to the previous and following ones. These layers works well at the top of the network, analysing the high level features uncovered by the lower ones. However, they tend to add a lot of parameters to our model and make it longer to train.</li>
<li><strong>Convolutional layers</strong>: The layers from which the CNN takes its name. Convolutional layers work like small filters (with a size of often 3 or 4 pixels) that slide over the image (or the previous layer) and are activated when they find a special pattern (such as straight lines, of angles). Convolutional layers can be composed of numerous filters that will learn to uncover different patterns. They offer translation invariance to our model, which is very useful for image classification. In addition to this, they have a reasonable number of weights (usually much fewer than dense layers) and make the model faster to train compared to dense layers.</li>
<li><strong>Pooling layers</strong>: Pooling layers are useful when used with convolutional layers. They return the maximum activation of the neurons they take as input. Because of this, they allow us to easily reduce the output dimension of the convolutional layers.</li>
<li><strong>Dropout layers</strong>: These layers are very different from the previous ones, as they only serve for the training and not the final model. Dropout layers will randomly “disconnect” neurons from the previous layer during training. Doing so is an efficient regularisation technique that efficiently reduces overfitting (mode details below)</li>
</ul>
<h3 id="compiling-the-model">Compiling the model</h3>
<p>Once our model is built, we need to compile it before training. Compilation is done by specifying a loss, here the <strong>categorical cross-entropy</strong>, a metric (<strong>accuracy</strong> here) and an optimization method.
The loss is the objective function that the optimization method will minimize. Cross-entropy is a very popular choice for classification problems because it is differentiable, and reducing the cross-entropy leads to better accuracy. Choosing accuracy as our performance metric is fair only because our classes are well balanced in our datasets. I cannot emphasize enough how much accuracy would be a poor choice if our classes were imbalanced (more of some characters than others).</p>
<p>Finally, we use the <strong>root mean square propagation (RMSprop)</strong> as an optimization method. This method is a variant from the classic gradient descent method, which will adapt the learning rate for each weight. This optimizer allows us to tune the <strong>learning rate</strong> since, generally speaking, a smaller learning rate leads to better final results, even if the number of epochs needed for the training increase. Generally, this optimizer works well, and changing it has very minimal effects on performance.</p>
<p>With all these tools, we define a first model for the consonants dataset (just assume we do the same for the numerals and the vowels). This model is meant to be trained from scratch <strong>without transfer learning or data-augmentation</strong>, in order to allow us to quantify the improvements brought by these techniques in another article.</p>
<h3 id="a-model-to-train-from-scratch">A model to train from scratch</h3>
<p>Now the fun part: we stack layers like pancakes, hoping we don’t do something stupid. If you follow this basic reasoning, nothing should go wrong:</p>
<ul>
<li>We start with a <strong>two-dimension convolutional layer</strong>(because our images only have one channel, as we work with gray images). We specify the number of filters we want for this layer. 32 seems like a good compromise between complexity and performance. Putting <strong>32 filters</strong> in this layer means that this layer will be able to identify up to 32 different patterns. It is worth noting that raising this number to 64 doesn’t improve the overall performance, but also doesn’t make the model notably harder to train. We specify a kernel size: 3 pixels by 3 pixels seems like a correct size, as it is enough to uncover simple patterns like straight lines or angles, but not too big given the size of our inputs (only 36x36 pixels, the input shape). At last, we specify an activation function for this layer. We will use <strong>rectified linear units (ReLU)</strong>, as they efficiently tackle the issue of the <a href="https://machinelearningmastery.com/exploding-gradients-in-neural-networks/">vanishing gradient</a>.</li>
<li>We then add another <strong>convolutional layer</strong>, to uncover more complicated patterns, this time with <strong>64 filters</strong> (as we expect that more complicated patters than simple patterns will emerge from our dataset). We keep the same kernel size and the same activation function.</li>
<li>After that, we add a <strong>max-pooling layer</strong> to reduce the dimensionality of our inputs. The pooling layer has no weights or activation function, and will output the biggest value found in its kernel. We choose a <strong>kernel size of 2 by 2</strong>, to lose as little information as possible while reducing the dimension.</li>
<li>After that pooling layer, we add a first <strong>dense layer with 256 nodes</strong> to analyze the patterns uncovered by the convolutional layers. Being fully connected to the previous layer and the following dense one, the size of this layer will have a huge impact on the total number of trainable parameters of our model. Because of that, we try to keep this layer reasonably small, while keeping it large enough to fit the complexity of our dataset. Because our images are not really complex, we choose a size of 256 nodes for this layer. We add a ReLU activation function, as we did in the previous layers.</li>
<li>Finally, we add the <strong>final dense layer</strong>, with <strong>one node for each class</strong> (36 for the consonant dataset). Each node of this layer should output a probability for our image to belong to one of the classes. Therefore, we want our activation function to return values between 0 and 1, and thus choose a <strong>softmax activation function</strong> instead of a ReLU as before.</li>
</ul>
<h3 id="overfitting">Overfitting</h3>
<p>Because of their complexity and their large number of weights, <strong>neural networks are very prone</strong> to overfitting. Overfitting can be observed when the accuracy on the training set is really high, but the accuracy on the validation set is much poorer. This phenomenon occurs when the model has learnt “by heart” the training observations but is no longer capable of generalizing its predictions to new observations. As a result, we should stop the training of our model when the accuracy on the validation set is no longer decreasing. Keras allows us to easily do that by saving the weights at each iteration, only if the validation score decreases.</p>
<p>However, if our model overfits too quickly, this method will stop the training too soon and the model will yield very poor results on the validation and testing sets. To counter that, we will use a <strong>regularisation method</strong>, preventing overfitting while allowing our model to perform enough iterations during the learning phase to be efficient.</p>
<p>The method we will use relies on <strong>dropout layers</strong>. Dropout layers are layers that will randomly “disconnect” neurons from the previous layer, meaning their activation for this training iteration will be null. By disconnecting different neurons randomly, we prevent the neural network to build overly specific structures that are only useful for learning the training observations and not the “concept” behind them.</p>
<p>To apply this method, <strong>we insert two drop_out layers in our model</strong>, before each dense layer. Drop_out layers require only one parameter: the probability of a neuron to be disconnected during a training iteration. These parameters should be adjusted with trials and errors, by monitoring the accuracy on the testing and validation set during training. We found that <strong>25% for the first drop_out layer and 80% for the second</strong> gives the best results.</p>
<h3 id="implementation">Implementation</h3>
<p>We use <code class="highlighter-rouge">Keras</code> with a TensorFlow backend to implement our model:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">keras</span>
<span class="kn">from</span> <span class="nn">keras.models</span> <span class="kn">import</span> <span class="n">Sequential</span>
<span class="kn">from</span> <span class="nn">keras.layers</span> <span class="kn">import</span> <span class="n">Dense</span><span class="p">,</span> <span class="n">Dropout</span><span class="p">,</span> <span class="n">Flatten</span>
<span class="kn">from</span> <span class="nn">keras.layers</span> <span class="kn">import</span> <span class="n">Conv2D</span><span class="p">,</span> <span class="n">MaxPooling2D</span>
<span class="kn">from</span> <span class="nn">keras.optimizers</span> <span class="kn">import</span> <span class="n">RMSprop</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">Sequential</span><span class="p">()</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Conv2D</span><span class="p">(</span><span class="mi">32</span><span class="p">,</span> <span class="n">kernel_size</span><span class="o">=</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">),</span>
<span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">,</span>
<span class="n">input_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">36</span><span class="p">,</span><span class="mi">36</span><span class="p">,</span><span class="mi">1</span><span class="p">)))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Conv2D</span><span class="p">(</span><span class="mi">64</span><span class="p">,</span> <span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">),</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">MaxPooling2D</span><span class="p">(</span><span class="n">pool_size</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">)))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dropout</span><span class="p">(</span><span class="mf">0.25</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Flatten</span><span class="p">())</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dense</span><span class="p">(</span><span class="mi">256</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dropout</span><span class="p">(</span><span class="mf">0.8</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dense</span><span class="p">(</span><span class="mi">36</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'softmax'</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">summary</span><span class="p">()</span>
<span class="n">opt</span> <span class="o">=</span> <span class="n">RMSprop</span><span class="p">(</span><span class="n">lr</span><span class="o">=</span><span class="mf">0.0005</span><span class="p">,</span> <span class="n">rho</span><span class="o">=</span><span class="mf">0.9</span><span class="p">,</span> <span class="n">epsilon</span><span class="o">=</span><span class="mf">1e-08</span><span class="p">,</span> <span class="n">decay</span><span class="o">=</span><span class="mf">0.0</span><span class="p">)</span>
<span class="n">model</span><span class="o">.</span><span class="nb">compile</span><span class="p">(</span><span class="n">optimizer</span><span class="o">=</span><span class="n">opt</span><span class="p">,</span> <span class="n">loss</span><span class="o">=</span><span class="s">'categorical_crossentropy'</span><span class="p">,</span> <span class="n">metrics</span><span class="o">=</span><span class="p">[</span><span class="s">'accuracy'</span><span class="p">])</span></code></pre></figure>
<p>Also, we will implement a <code class="highlighter-rouge">get_score</code> function that will take as inputs the following:</p>
<ul>
<li><strong><em>tensors</em></strong>: A whole dataset as a tensor</li>
<li><strong><em>labels</em></strong>: The corresponding labels</li>
<li><strong><em>model</em></strong>: The untrained Keras model for which we want to compute the accuracy</li>
<li><strong><em>epoch</em></strong>: An integer specifying the number of epochs for training</li>
<li><strong><em>batch_size</em></strong>: An integer, the size of a batch for learning (the greater the better, if the memory allows it)</li>
<li><strong><em>name</em></strong>: The name of the model (to save the weights)</li>
<li><strong><em>verbose</em></strong>: An optional boolean (default is false) that determines if we should tell Keras to display information during the training (useful for experimentation).</li>
</ul>
<p>The function will:</p>
<ul>
<li><strong>Perform one-hot encoding</strong> on the labels, so they can be understood by the model.</li>
<li><strong>Split our dataset</strong> into a training, a validation and a testing sets as detailed above.</li>
<li>Create a checkpointer which allows us to <strong>save the weights during training</strong> (only if the accuracy is still improving).</li>
<li><strong>Fit the model on the training set</strong> and <strong>monitor its performances on the validation set</strong> (to know when to save weights).</li>
<li>Compute and print the accuracy on the testing set.</li>
<li>Return the trained model with the best weights available.</li>
</ul>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">train_test_split</span>
<span class="kn">from</span> <span class="nn">keras.callbacks</span> <span class="kn">import</span> <span class="n">ModelCheckpoint</span>
<span class="kn">from</span> <span class="nn">keras.utils</span> <span class="kn">import</span> <span class="n">np_utils</span>
<span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">accuracy_score</span>
<span class="k">def</span> <span class="nf">get_score</span><span class="p">(</span><span class="n">tensors</span><span class="p">,</span> <span class="n">labels</span><span class="p">,</span> <span class="n">model</span><span class="p">,</span> <span class="n">epoch</span><span class="p">,</span> <span class="n">batch_size</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">verbose</span><span class="o">=</span><span class="bp">False</span><span class="p">):</span>
<span class="n">nb_labels</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="nb">set</span><span class="p">(</span><span class="n">labels</span><span class="p">))</span> <span class="c">#Get the number of disctinct labels in the dataset</span>
<span class="c"># Encode the labels (integers) into one-hot vectors</span>
<span class="n">y_all</span> <span class="o">=</span> <span class="n">np_utils</span><span class="o">.</span><span class="n">to_categorical</span><span class="p">([</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">labels</span><span class="p">)</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">int</span><span class="p">)],</span> <span class="n">nb_labels</span><span class="p">)</span>
<span class="c"># Split the testing set from the whole set, with stratification</span>
<span class="n">X_model</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_model</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">tensors</span><span class="p">,</span> <span class="n">y_all</span><span class="p">,</span>
<span class="n">test_size</span><span class="o">=</span><span class="mf">0.15</span><span class="p">,</span> <span class="n">stratify</span><span class="o">=</span><span class="n">y_all</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="c"># Then split the remaining set into a training and a validation set</span>
<span class="c"># We use a test size of 17.6% because our remaining set account for only 85% of the whole set</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_val</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_val</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X_model</span><span class="p">,</span> <span class="n">y_model</span><span class="p">,</span>
<span class="n">test_size</span><span class="o">=</span><span class="mf">0.176</span><span class="p">,</span> <span class="n">stratify</span><span class="o">=</span><span class="n">y_model</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="c"># Display the sizes of the three sets</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Size of the training set: {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">X_train</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Size of the validation set: {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">X_val</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Size of the testing set: {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">X_test</span><span class="p">)))</span>
<span class="c"># Create a checkpointer to save the weights when the validation loss decreases</span>
<span class="n">checkpointer</span> <span class="o">=</span> <span class="n">ModelCheckpoint</span><span class="p">(</span><span class="n">filepath</span><span class="o">=</span><span class="s">'saved_models/weights.best.{}.hdf5'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">name</span><span class="p">),</span>
<span class="n">verbose</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">save_best_only</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c"># Fit the model, using 'verbose'=1 if we specified 'verbose=True' when calling the function (0 else)</span>
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">validation_data</span><span class="o">=</span><span class="p">(</span><span class="n">X_val</span><span class="p">,</span> <span class="n">y_val</span><span class="p">),</span> <span class="n">callbacks</span><span class="o">=</span><span class="p">[</span><span class="n">checkpointer</span><span class="p">],</span>
<span class="n">batch_size</span><span class="o">=</span><span class="n">batch_size</span><span class="p">,</span> <span class="n">epochs</span><span class="o">=</span><span class="n">epoch</span><span class="p">,</span> <span class="n">verbose</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span> <span class="k">if</span> <span class="n">verbose</span> <span class="k">else</span> <span class="mi">0</span><span class="p">))</span>
<span class="c"># Reload best weights before prediction, and predict</span>
<span class="n">model</span><span class="o">.</span><span class="n">load_weights</span><span class="p">(</span><span class="s">'saved_models/weights.best.{}.hdf5'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">name</span><span class="p">))</span>
<span class="n">y_pred</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
<span class="c"># Compute and print the accuracy</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Accuracy on test set: {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span>
<span class="n">accuracy_score</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span> <span class="n">np</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">y_pred</span><span class="p">,</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">))))</span>
<span class="k">return</span> <span class="n">model</span> <span class="c"># And return the trained model</span></code></pre></figure>
<p>We can now train our model and get our score:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">get_score</span><span class="p">(</span><span class="n">tensors_consonants</span><span class="p">,</span> <span class="n">consonants_labels</span><span class="p">,</span> <span class="n">model</span><span class="p">,</span> <span class="n">epoch</span><span class="o">=</span><span class="mi">180</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">800</span><span class="p">,</span>
<span class="n">name</span><span class="o">=</span><span class="s">'consonants_from_scratch'</span><span class="p">)</span></code></pre></figure>
<p>By reporting the results obtained for the three datasets, we see improvements compared to the SVC methods.</p>
<table>
<thead>
<tr>
<th> </th>
<th style="text-align: center">CNN from scratch</th>
<th style="text-align: center">SVC with PCA</th>
</tr>
</thead>
<tbody>
<tr>
<td>Numerals</td>
<td style="text-align: center">97.9%</td>
<td style="text-align: center">96.9%</td>
</tr>
<tr>
<td>Vowels</td>
<td style="text-align: center">93.5%</td>
<td style="text-align: center">87.9%</td>
</tr>
<tr>
<td>Consonants</td>
<td style="text-align: center">85.9%</td>
<td style="text-align: center">75.0%</td>
</tr>
</tbody>
</table>
<p>Ultimately, it is possible to increase again this accuracy by training our models on bigger datasets. How can we have more training images with only the dataset we used? To answer this question, we will in the <a href="/neural-networks-specialization/">last article of the series</a>:</p>
<ul>
<li>Use data-augmentation</li>
<li>Train a generic model on the three datasets, before specializing it by replacing the last layers by support vector machines.</li>
</ul>Pierre ForetPierre_foret@berkeley.eduPerforming characters recognition with Keras and TensorFlowPca For Image Classification2017-01-04T00:00:00+00:002017-01-04T00:00:00+00:00http://www.ml-hack.com/PCA-for-image-classification<h1 id="principal-components-analysis-for-image-classification">Principal Components Analysis for image classification</h1>
<p>In image recognition, we generally overlook other techniques that were used before neural networks became standard. These techniques are still worth our time, as they present some advantages:</p>
<ul>
<li>They are usually simpler and faster to implement.</li>
<li>If the database is small, they can outperform deep-learning methods.</li>
<li>When your grandchildren will ask how the job was done before quantum deep reinforcement learning was a thing, you will have a great story to tell.</li>
</ul>
<p>For these reasons, I often start addressing an image classification problem without neural networks if possible, in order to get an idea of the “minimum” performance I should get when switching to more powerful algorithms.
Always know the basics. You don’t want to be the guy who does sentiment analysis using deep pyramid CNN and doesn’t realise a naive bayes classifier gives better results on his 50MB dataset.</p>
<p>So today, we will see how to recognise hand-written characters using simple machine learning algorithms.</p>
<h2 id="classifying-some-devanagari-characters">Classifying some Devanagari characters</h2>
<p>Devanagari is an alphabet used in India and Nepal, composed of 36 consonants and 12 vowels. We will also add to this the 10 digits used to write numbers. For this classification problem, we will use a <a href="https://www.kaggle.com/ashokpant/devanagari-character-dataset">small database available on Kaggle</a>, composed of approximately 200 images of each class, hand written by 40 distinct individuals.</p>
<p>Some characters from the dataset are displayed below. The ones on the upper row are all different types of characters, but some of them can be really similar to a novice eye (like mine). On the other hand, the lower row shows some ways to write the same consonant, <em>“chha”</em>.</p>
<p><img src="http://www.ml-hack.com/assets/images/devanagari/characters%20examples.png" alt="Some characters" /></p>
<p>With 58 classes of 200 images each and such an intra-class diversity, this problem is non-trivial. Today, we will build a classifier for each dataset of characters (consonant, vowel of numeral) separately. We will see how to achieve an accuracy between 97% (for the numerals) and 75% (for the consonants), using only scikit learn’s algorithms. In another article, we will see how deep learning can push these results up to 99.7% for the numerals and 94.9% for the consonants.</p>
<h2 id="dealing-with-images">Dealing with images</h2>
<p>We start by loading the images we want to classify, using <code class="highlighter-rouge">PIL</code> (Python Image Library). A demonstration code for that can be found <a href="https://github.com/PForet/Devanagari_recognition/blob/master/load_data.py">here</a> if needed, but let’s assume we already have a list of PIL images, and a list of integers representing their labels:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">consonants_img</span><span class="p">,</span> <span class="n">consonants_labels</span> <span class="o">=</span> <span class="n">load_data</span><span class="o">.</span><span class="n">PIL_list_data</span><span class="p">(</span><span class="s">'consonants'</span><span class="p">)</span></code></pre></figure>
<p>For the sake of exposition, we will display the code only for the consonants dataset. Just assume everything is the same for the two others.
At this point, you might want to rescale all your images to the same dimensions, if it is not already done. Luckily, images from this dataset are all 36 x 36 pixels (thanks to you, kind Kaggle stranger).</p>
<p>We convert our images to black and white, and take their negative. PIL allows us to do that easily:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">PIL</span> <span class="kn">import</span> <span class="n">ImageOps</span>
<span class="k">def</span> <span class="nf">pre_process</span><span class="p">(</span><span class="n">img_list</span><span class="p">):</span>
<span class="n">img_bw</span> <span class="o">=</span> <span class="p">[</span><span class="n">img</span><span class="o">.</span><span class="n">convert</span><span class="p">(</span><span class="s">'LA'</span><span class="p">)</span> <span class="k">for</span> <span class="n">img</span> <span class="ow">in</span> <span class="n">img_list</span><span class="p">]</span>
<span class="k">return</span> <span class="p">[</span><span class="n">ImageOps</span><span class="o">.</span><span class="n">invert</span><span class="p">(</span><span class="n">img</span><span class="p">)</span> <span class="k">for</span> <span class="n">img</span> <span class="ow">in</span> <span class="n">img_list</span><span class="p">]</span>
<span class="n">consonants_proc</span> <span class="o">=</span> <span class="n">pre_process</span><span class="p">(</span><span class="n">consonants_img</span><span class="p">)</span></code></pre></figure>
<p>Finally, we must transform our images into vectors. In order to accomplish that, we transform each image into a matrix representing the pixels activation (a zero for a black pixel, and a 255 for a white one). We then rescale each element of the matrix by dividing it by the maximum possible value (255), before flattening the matrix:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="k">def</span> <span class="nf">vectorize_one_img</span><span class="p">(</span><span class="n">img</span><span class="p">):</span>
<span class="c"># Represent the image as a matrix of pixel weights, and flatten it</span>
<span class="n">flattened_img</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">asmatrix</span><span class="p">(</span><span class="n">img</span><span class="p">)</span><span class="o">.</span><span class="n">flatten</span><span class="p">()</span>
<span class="c"># Rescaling by dividing by the maximum possible value of a pixel</span>
<span class="n">flattened_img</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">divide</span><span class="p">(</span><span class="n">flattened_img</span><span class="p">,</span><span class="mf">255.0</span><span class="p">)</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">asarray</span><span class="p">(</span><span class="n">flattened_img</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span></code></pre></figure>
<p>And we apply this transformation to all the images of our dataset:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">to_vectors</span><span class="p">(</span><span class="n">img_list</span><span class="p">):</span>
<span class="k">return</span> <span class="p">[</span><span class="n">vectorize_one_img</span><span class="p">(</span><span class="n">img</span><span class="p">)</span> <span class="k">for</span> <span class="n">img</span> <span class="ow">in</span> <span class="n">img_list</span><span class="p">]</span>
<span class="n">consonants_inputs</span> <span class="o">=</span> <span class="n">to_vectors</span><span class="p">(</span><span class="n">consonants_proc</span><span class="p">)</span>
</code></pre></div></div>
<p>Cheers ! The tedious part of pre-processing the images is over now.</p>
<h2 id="import-sklearn">Import sklearn</h2>
<p>Or as I call it, the poor man’s <code class="highlighter-rouge">import keras</code>. After just some a few lines of code and we will be done classifying our images. Once satisfied, we will try to understand what happened exactly.</p>
<h3 id="choosing-the-best-model">Choosing the best model</h3>
<p>Here, we choose to use a support vector machine classifier (SVC) on the reduced features returned by a principal component analysis (PCA, we will get back to that later). The SVC is well adapted when we have few samples (these things quickly become painfully slow as the number of samples grows).</p>
<p>These classifiers have a lot of meta-parameters, but we will tune here only C and gamma. We choose to use a gaussian kernel, the default one which works usually very well. We thus define a simple function that takes a vector of inputs and a vector of labels as arguments, tests several sets of parameters, and returns the best SVC found. In order to do that, we use Scikit’s <code class="highlighter-rouge">GridSearch</code> that will test all the combinations of parameters from a dictionary, compute an accuracy with a K-Fold, and return the best model.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.svm</span> <span class="kn">import</span> <span class="n">SVC</span>
<span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">GridSearchCV</span>
<span class="k">def</span> <span class="nf">best_SVC</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="n">y</span><span class="p">):</span>
<span class="c"># Initiate a SVC classifier with default parameters</span>
<span class="n">svc_model</span> <span class="o">=</span> <span class="n">SVC</span><span class="p">()</span>
<span class="c"># The values to test for the C and gamma parameters.</span>
<span class="n">param_dic</span> <span class="o">=</span> <span class="p">{</span><span class="s">'C'</span><span class="p">:[</span><span class="mi">1</span><span class="p">,</span><span class="mi">10</span><span class="p">,</span><span class="mi">100</span><span class="p">],</span>
<span class="s">'gamma'</span><span class="p">:[</span><span class="mf">0.001</span><span class="p">,</span><span class="mf">0.005</span><span class="p">,</span><span class="mf">0.01</span><span class="p">]}</span>
<span class="n">clf</span> <span class="o">=</span> <span class="n">GridSearchCV</span><span class="p">(</span><span class="n">svc_model</span><span class="p">,</span> <span class="n">param_dic</span><span class="p">,</span> <span class="n">n_jobs</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>
<span class="c"># Search for the best set of parameters for our dataset, using bruteforce</span>
<span class="n">clf</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Best parameters: "</span><span class="p">,</span> <span class="n">clf</span><span class="o">.</span><span class="n">best_params_</span><span class="p">)</span>
<span class="c"># We return the best model found</span>
<span class="k">return</span> <span class="n">clf</span><span class="o">.</span><span class="n">best_estimator_</span>
</code></pre></div></div>
<h3 id="splitting-the-dataset">Splitting the dataset</h3>
<p>As usual, we split our dataset into a training set to train the model on, and a testing set to evaluate its results. We use Scikit’s <code class="highlighter-rouge">train_test_split</code> function that is straightforward, and keep the default training/testing ratio of 0.8/0.2. Don’t use the testing set to tune C and gamma, that’s cheating.</p>
<h3 id="using-a-pca">Using a PCA</h3>
<p>Our input space is large: we have a dimension for each pixel of the picture: we thus have 1296 features by observation. We choose to use a PCA to reduce this number of dimensions to 24. That’s were the magic happens.</p>
<h3 id="computing-the-result">Computing the result</h3>
<p>Our pipeline is very simple: given a list of inputs and a list of labels:</p>
<ul>
<li>We split the lists to obtain a training set and a testing set.</li>
<li>We find the axis that maximises variance on the training set (using <code class="highlighter-rouge">pca.fit</code>).</li>
<li>We project the training and testing points on these axes (using <code class="highlighter-rouge">pca.transform</code>).</li>
<li>We find the SVC model that maximises the accuracy and fit it on the training set.</li>
<li>We compute the accuracy of this model on the test set and return it</li>
</ul>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">train_test_split</span>
<span class="kn">from</span> <span class="nn">sklearn.decomposition</span> <span class="kn">import</span> <span class="n">PCA</span>
<span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">accuracy_score</span>
<span class="k">def</span> <span class="nf">benchmark</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="n">y</span><span class="p">):</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="n">y</span><span class="p">)</span>
<span class="n">pca</span> <span class="o">=</span> <span class="n">PCA</span><span class="p">(</span><span class="n">n_components</span> <span class="o">=</span> <span class="mi">24</span><span class="p">)</span>
<span class="n">pca</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
<span class="n">reduced_X_train</span><span class="p">,</span> <span class="n">reduced_X_test</span> <span class="o">=</span> <span class="n">pca</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_train</span><span class="p">),</span> <span class="n">pca</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
<span class="n">best_model</span> <span class="o">=</span> <span class="n">best_SVC</span><span class="p">(</span><span class="n">reduced_X_train</span><span class="p">,</span><span class="n">y_train</span><span class="p">)</span>
<span class="n">predictions</span> <span class="o">=</span> <span class="n">best_model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">reduced_X_test</span><span class="p">)</span>
<span class="k">return</span> <span class="n">accuracy_score</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">predictions</span><span class="p">)</span>
</code></pre></div></div>
<p>And we run this function on our three sets:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">score_on_numerals</span> <span class="o">=</span> <span class="n">benchmark</span><span class="p">(</span><span class="n">numerals_inputs</span><span class="p">,</span> <span class="n">numerals_labels</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Best accuracy on numerals: {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">score_on_numerals</span><span class="p">))</span>
<span class="n">score_on_vowels</span> <span class="o">=</span> <span class="n">benchmark</span><span class="p">(</span><span class="n">vowels_inputs</span><span class="p">,</span> <span class="n">vowels_labels</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Best accuracy on vowels: {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">score_on_vowels</span><span class="p">))</span>
<span class="n">score_on_consonants</span> <span class="o">=</span> <span class="n">benchmark</span><span class="p">(</span><span class="n">consonants_inputs</span><span class="p">,</span> <span class="n">consonants_labels</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Best accuracy on consonants: {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">score_on_consonants</span><span class="p">))</span>
</code></pre></div></div>
<p>Which should give something along these lines:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ('Best parameters: ', {'C': 10, 'gamma': 0.01})
Best accuracy on numerals: 0.972222222222
('Best parameters: ', {'C': 10, 'gamma': 0.01})
Best accuracy on vowels: 0.906485671192
('Best parameters: ', {'C': 10, 'gamma': 0.005})
Best accuracy on consonants: 0.745257452575
</code></pre></div></div>
<p>Here we got the promised 97% accuracy on the numerals. That was easy. Remember a good code is like a good dentist: quick, and without unnecessary agonising pain. But now that we made this work, maybe it’s time to understand what this PCA thing did to our images…</p>
<h2 id="pca-mon-amour">PCA, mon amour,</h2>
<h3 id="the-basic-idea">The basic idea</h3>
<p>From now, we have two ways of explaining things:</p>
<ul>
<li>With linear algebra, spectral decomposition, singular values and covariance matrix.</li>
<li>With some pictures.</li>
</ul>
<p>If you want to explore the mathematical side of this (and you should, as it is not so difficult and PCA is fundamental in statistics), you will find plenty of good resources online. I like <a href="http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf">this one</a> which is complete and introduces all the algebra tools needed.</p>
<p>However, if you still can find your inner child, you’ll follow me though the picture book explanation!</p>
<p>Let’s start with a small set of amazing pictures that could easily belong in a MoMA collection:</p>
<p><img src="http://www.ml-hack.com/assets/images/devanagari/demo_pca.png" alt="Some characters" /></p>
<p>Those are 5 by 5 unicolour pictures, so they could be represented by a vector of 25 dimensions. However, this is a bit too much, as this images seems so have some repeated patterns…</p>
<p><img src="http://www.ml-hack.com/assets/images/devanagari/pca_toy.png" alt="And their components" /></p>
<p>Indeed, all of the eight pictures can be represented by adding some of these patterns. That would definitely give an advantage for classifying them: we need only a vector of size 4 to encode these pictures, and not 25 as before. Of course, a vector of size 25 is easily dealt with by most machine learning algorithms, but remember we used the same technique to reduce the dimension of our Devanagari characters from 1296 to 24.
But the main advantage is not here. Remember that we flatten our image into a vector where each dimension represents a pixel. Considering each pixel as a dimension has obvious drawbacks: a translation of one pixel for a character will lead to a very different point in the input vector space and, by opposition to CNN, most general machine learning algorithms don’t take into account the relative positions of the pixels.</p>
<p>That’s how principal component analysis will help. It uncovers spacial patters in our images. In fact, the PCA will “group” together the pixels which are activated simultaneously in our images. Pixels which are close to one another will have good chances of been activated simultaneously: the pen will leave a mark on both of them! But let’s see how it worked out for our hand written characters.</p>
<h3 id="and-now-the-real-world-application">And now the real world application</h3>
<p>We start by displaying the patters uncovered by the PCA, for the three datasets:
<img src="http://www.ml-hack.com/assets/images/devanagari/pca_main.png" alt="Nepali components" /></p>
<p>Those patterns are ordered according to the variance they explain. In other word, if there is a pattern that is composed by a lot of pixels that are often activated simultaneously, we say it explains a lot of variance. On the other hand, a very small and uncommon pattern is most likely noise and isn’t really useful.</p>
<p>For instance, if we take a look at the first row, we will see that the most important pattern is a “O” shape, meaning this pattern is often repeated in our images. If we feed the vector returned by the PCA to a machine learning algorithm, it will have access to the information “there is a big ‘O’ shape on the image” only by looking at the first element of this vector. That will surely be useful to learn how to classify “zero”!</p>
<p>But how many patterns should we keep in our vectors? One way to decide is to visualise how many are needed to get a good reconstitution of our original images:</p>
<p><img src="http://www.ml-hack.com/assets/images/devanagari/pca_recomposition.png" alt="PCA recomposition" /></p>
<p>Here, we show the original numerals on the top row, and the reconstituted images using 1, 4, 8, 24 and 48 patterns. We observe that using 24 patterns, we get a pretty good reconstitution of the original. That’s the number we will put in <code class="highlighter-rouge">PCA(n_components = 24)</code>. Another way to find this number would be with tries and errors (there is a pretty good range of correct values), or looking at the proportion of explained variance, if you have a good grasp on how the PCA works.</p>
<h3 id="thats-pretty-much-it">That’s pretty much it</h3>
<p>I hope you now have some understanding of how the PCA can be applied to image classification. Please keep in mind that PCA is a really powerful tool that can tackle a lot of statistics problems. We only have scraped the surface, and occulted some important points (the fact that the patterns are uncorrelated, for instance). So if you are not totally familiar with this tool, don’t hesitate to do some research and have some practice !</p>Pierre ForetPierre_foret@berkeley.eduA visual exploration on how to apply principal component analysis to image recognition.