Jekyll2020-06-09T18:19:21+00:00https://statchaitya.github.io/feed.xmlBlog & PortfolioChaitanya Gokhalecgokhale92@gmail.comTime Series - Initial Fundamental Choices2018-09-06T00:00:00+00:002018-09-06T00:00:00+00:00https://statchaitya.github.io/settingupatimeseriesproject<blockquote>
<p>The main point of this post is just to throw light on things which we might not have considered till now while starting a Time Series project because we have been accustomed to using ready made clean TS datasets (Well, as recent college grad at least I have been till now)</p>
</blockquote>
<h2 id="-difference-between-business-and-a-local-setup-"><center> Difference Between Business and a Local Setup </center></h2>
<p>While doing Time Series analysis just for fun or for practice out of work environment, we are lucky to be directly given a
ready made clean dataset of a univariate time series. The goal then is to just get as good a prediction as possible. But
when starting on a Time Series project in a business environment, we have to think about a lot of things which
might prove to be crucial in the eventual outcome of the project. Out of the many like choice of metric/metrics, time level, handling irregularity, handling missing values etc.. I am trying to cover the first two in this post.</p>
<p>In this post I try to focus on 2 fundamental choices we face while starting a time series project and their relation with the outcome/business decision using an example of a retail store chain. Understanding this simple example will hopefully enable us to think about the metric and time level before we jump into a TS project in a business environment.</p>
<h2 id="-metrics-time-granularity-and-decisions-"><center> Metrics, Time Granularity and Decisions </center></h2>
<blockquote>
<p>Choice of a metric is important and affects the decisions we want to take</p>
</blockquote>
<p>It is important to choose the right metric for a particular goal, because <u>the metric has a direct impact on the goal/ business decisions</u>. To understand this consider an example of a retail store chain. The metric <span style="color:red">Number of item X sold</span> deals with the <strong><span style="color:green">decision of restocking item X</span></strong> while the metric <span style="color:red"> Number of visitors entering the store</span> deals with the <strong><span style="color:green">decision of assignment of Number of personnel on a particular day</span></strong>. This is a very simple case but in general each <u>different strategic goal might have one or several metrics associated with it</u> and can on the other hand have one or several metrics which are not at all associated with it.. Hence it is important that we keep the potential decisions in mind and start with the appropriate metric/metrics.</p>
<blockquote>
<p>An appropriate time level?</p>
</blockquote>
<p><strong>Short Term</strong></p>
<p>Lets consider the same example of a retail store chain. <u>Short term forecasts</u> (Daily in this case. In cases like log files data forecasting, a minute will be short term and an hour will be medium/long term) <u>are much more accurate and can be used to take decisions that make sense on a daily basis</u> ex: Keeping an eye on the product shelves, anticipating a high sale day and allocating more employees on that day etc..</p>
<p><strong>Medium Term</strong></p>
<p>For medium term forecasts (weeks in this case), we can probably think of hiring new people at stores where the sales are to increase and staffing is less. <u>Recruiting takes weeks so this is a good example of a decision to be taken w.r.t medium term forecasts</u>. Even catching products with an increasing trend and planning on how to make more room for them is an example of a medium term decision.</p>
<p><strong>Long Term</strong></p>
<p>If the timeframe of a forecast is long i.e. if we are forecasting for monthly 2 years in the future then these forecasts might help in <u>decisions like increasing the number of stores, enlarging existing store space etc..</u></p>
<p>Hence, <u>contemplating on the time level of a TS and the potential decisions for our goal accomplishment</u> (given that we have the flexibility to think of multiple time levels) <u>is also the right thing to do.</u></p>
<blockquote>
<p>One more thing to note is that the forecasts will get less accurate as we go ahead in time. This is a general law.</p>
</blockquote>
<p>So if we argue, why can’t we use Daily (short term) forecasts to take long term decisions, the answer to that lies in the above general property of a time series.</p>
<p><strong>Not always will we have to think of these things. Most of the times, the choices would be trivial but in times where you have the flexibility to think about different metrics, time levels and decisions in a business environment, this understanding of their inter-relation would hopefully be useful.</strong></p>
<blockquote>
<p><em>Please leave a comment about what you think (Then I will at least know somebody was here). Cheers! ;)</em></p>
</blockquote>Chaitanya Gokhalecgokhale92@gmail.comTime Series Analysis, ForecastingLog Loss Explained2018-04-05T00:00:00+00:002018-04-05T00:00:00+00:00https://statchaitya.github.io/abtesting<h1 id="-intuition-behind-log-loss-using-its-formula-"><center> Intuition behind log loss using its formula </center></h1>
<p><br />
<img src="/images/uncertainty_2.jpg" alt="Uncertainty" /></p>
<p><br />
Log loss is used when we have <script type="math/tex">{0,1}</script> response. This is usually because when we have <script type="math/tex">{0,1}</script> response, the best models give us values in terms of probabilities.
In simple words, log loss measures the <strong>uncertainty</strong> of the probabilities of your model by comparing them to the true labels. Let us look closely at its formula and see how it measures the <strong>uncertainty</strong>.</p>
<p>Now the question is, your training labels are <strong>0</strong> and <strong>1</strong> but your training predictions are <strong>0.4, 0.6, 0.89, 0.1122 etc..</strong> So how do we calculate a measure of the error of our model ? If we directly classify all the observations having values > 0.5 into 1 then we are at a high risk of increasing the misclassification. This is because it may so happen that many values having probabilities <strong>0.4, 0.45, 0.49</strong> can have a true value of <strong>1</strong>.
This is where <strong>logloss</strong> comes into picture.</p>
<p>Now let us closely follow the formula of <strong>logloss</strong>.</p>
<p>The formula for <strong>logloss</strong> is :
<script type="math/tex">logLoss = \frac{-1}{N} \sum_{i=1}^{N}(y_{i}(log{p_{i}})+(1- {y_{i}})log(1-p_{i}))</script></p>
<p>There can be 4 major cases for the values of <script type="math/tex">y_{i}</script> and <script type="math/tex">p_{i}</script></p>
<p>Case 1 : <script type="math/tex">y_{i} = 1</script> , <script type="math/tex">p_{i}</script> = High , <script type="math/tex">1 - y_{i} = 0</script> , <script type="math/tex">1 - p_{i}</script> = Low</p>
<p>Case 2 : <script type="math/tex">y_{i} = 1</script> , <script type="math/tex">p_{i}</script> = Low , <script type="math/tex">1 - y_{i} = 0</script> , <script type="math/tex">1 - p_{i}</script> = High</p>
<p>Case 3 : <script type="math/tex">y_{i} = 0</script> , <script type="math/tex">p_{i}</script> = Low , <script type="math/tex">1 - y_{i} = 1</script> , <script type="math/tex">1 - p_{i}</script> = High</p>
<p>Case 4 : <script type="math/tex">y_{i} = 0</script> , <script type="math/tex">p_{i}</script> = High , <script type="math/tex">1 - y_{i} = 1</script> , <script type="math/tex">1 - p_{i}</script> = Low</p>
<p><em>Case 1</em>:
In this case y = 1 and p = high implies that we have got things right! Because the true value of the response agrees with our high probability. Now look closely.. occurence of <em>Case 1</em> will significantly inflate the sum because, Yi * log (Pi) would be high and simultaneously the other term in the summation would be zero since 1 - Yi = 1 - 1 = 0. So more occurrences of Case 1 would inflate the sum and consequently inflate the mean.
Also note that this is possible because if Pi > Pi-1 , log (Pi) > log (Pi-1)</p>
<p><em>Case 2</em>:
In this case y = 1 and p = low. This is a totally undesirable case because our probability of Y being 1 is low but still the true value of Y is 1. Now again looking at the formula closely, the second term in the summation would be zero since 1- yi would be zero. And since p = low, Yi * log (Pi) would not inflate the sum as much as Case 1. So <em>Case 2</em> would ultimately not affect the sum a lot.</p>
<p>Similarly the occurrences of <em>Case 3</em> would inflate the sum significantly and occurrences of <em>Case 4</em> would not.
Now coming back to the main question, how does log loss measure UNCERTAINTY of your model? The answer is simple. Suppose we have more of <em>Case 1s</em> and <em>Case 3s</em>, then the sum inside the logloss formula would be greater (would tend to increase). This would imply that the mean (/N) would also tend to increase and will be substantially larger in comparison to what it would have been if <em>Case2s</em> and <em>Case4s</em> got added.</p>
<p>So now this value is as large as possible at Case1s and Case3s which indicates a good prediction. If we multiply it by (- 1) , we would make the value as small as possible. This would now intuitively mean, smaller the value, better is the model i.e. smaller the logloss, better is the model i.e. smaller the UNCERTAINTY, better is the model.
This was as simple as I could get.</p>Chaitanya Gokhalecgokhale92@gmail.comMachine Learning, metrics, predictive analyticsAdaboost Explained2018-04-05T00:00:00+00:002018-04-05T00:00:00+00:00https://statchaitya.github.io/adaboostclassifier<h1 id="-detailed-explaination-of-adaboost-"><center> Detailed explaination of AdaBoost </center></h1>
<p><br /></p>
<p>Boosting is one of the most powerful ideas introduced in the last twenty years. It was originally designed for classification
problems but can be extended to regression problems as well. It is a method where multiple models are created using a logic
and then combined together to get a prediction which is better than what we could have got using a single strong model.</p>
<p>With respect to the idea of combining multiple models together it is similar to bagging but the way in which boosting
comes up with multiple models is totally different from the way things are done in bagging.</p>
<p>To understand the concept of boosting <strong>intuitively</strong>, we will consider one of the earliest and important boosting algorithm called as <strong>“AdaBoost.M1”</strong> or <strong>“Discrete AdaBoost”</strong>. To understand this algorithm we set a context where our response variable <script type="math/tex">Y</script> has two classes <script type="math/tex">{-1, 1}</script>. Note that by making a few changes, AdaBoost can
be extended to a continuous response or a multiclass response as well but for now we will stick to the <script type="math/tex">{-1, 1}</script> response.</p>
<p>Following is the pseudo code of <strong>AdaBoost.M1</strong> algorithm. We will disect and look at each and every step in detail after that.</p>
<p><img src="/images/adaboostm1.png" alt="AdaBoost PseudoCode" /></p>
<h2 id="-to-simplify-things-a-bit-the-algorithm-"><center> To simplify things a bit, the algorithm: </center></h2>
<ol>
<li>Takes in the following parameters: <script type="math/tex">M</script> Total number of iterations/runs of a classifier (Let us stick to decision trees for now so <script type="math/tex">M</script> decision trees) & a training data of <script type="math/tex">N</script> training samples.</li>
<li>Starts off with weighting equally, each of the <script type="math/tex">Nth</script> training example by weight <script type="math/tex">\frac{1}{N}</script></li>
<li>Comes up with the first decision tree <script type="math/tex">G_{1}(x)</script> that minimizes a weighted error function given by <script type="math/tex">err_{1} = \frac{\sum_{i=1}^{N} (w_{i} \hspace{0.2cm} I(y_{i}\hspace{0.2cm}\neq\hspace{0.2cm} G_{1}(x_{i})))}{\sum_{i=1}^{N} w_{i}}</script></li>
<li>Calculates an update parameter <script type="math/tex">\alpha_{1}</script> for the first tree using <script type="math/tex">\log_{}[\frac{1-err_{1}}{err_{1}}]</script> and more generally at each <script type="math/tex">m = 1,2,...,M</script> <script type="math/tex">\alpha_{m}</script> is calculated using <script type="math/tex">\log_{}[\frac{1-err_{m}}{err_{m}}]</script></li>
<li><script type="math/tex">\alpha_{1}</script> updates the weights (i.e. we do what we were doing in Step 2) and the process continues by executing Step 3, 4 and 5.</li>
</ol>
<p>The above mechanism raises two important questions. Why are the weights assigned to the training examples? And What is the use of <script type="math/tex">\alpha_{m}</script>??</p>
<p>The answers to these two questions are the key to understanding how and why boosting works.</p>
<h2 id="1-why-assign-weights-to-training-examples">1. Why assign weights to training examples?</h2>
<p><br /></p>
<p>Lets look at the first question about the weights. At first, the training samples are weighted, then a tree is fit using the weighted error function. After that, while fitting subsequent trees <strong><script type="math/tex">\alpha_{m}</script> does the job of <em>reducing/not increasing</em> the weights of those training examples which were <em>correctly classified</em> by the current tree <script type="math/tex">G_{m}(x)</script> and <em>increasing</em> the weights of those which were <em>misclassified</em></strong>. (Do not worry if you don’t understand how <script type="math/tex">\alpha_{m}</script> updates the weights. We will look at that shortly but just keep in mind at a high level what <script type="math/tex">\alpha_{m}</script> does to the weights according to correct classification)</p>
<p>To illustrate why this weight updation works: Once <script type="math/tex">G_{1}(x)</script> is fit, we will have <script type="math/tex">\alpha_{1}</script> which will update the weights for <script type="math/tex">G_{2}(x)</script> accordingly. Now consider the fitting of the next tree <script type="math/tex">G_{2}(x)</script>. We know that decision tree algorithms choose that feature and that value of the feature which minimizes the error at each split. Our error function is <script type="math/tex">err_{1} = \frac{\sum_{i=1}^{N} (w_{i} I(y_{i}\hspace{0.2cm}\neq\hspace{0.2cm} G_{1}(x_{i})))}{\sum_{i=1}^{N} w_{i}}</script>. Note the indicator variable <script type="math/tex">I(y_{i}\hspace{0.2cm}\neq\hspace{0.2cm} G_{1}(x_{i}))</script>. Since we have this indicator in the error function, we can see that the decision tree will try to pick the features for the splits which <em>correctly classify the highly weighted observations (misclassified by the previous tree)</em> since if these observations were misclassified again then the indicator variable will take the value of <script type="math/tex">1</script> and the high weight <script type="math/tex">w_{i}</script> will be considered in the evaluation of error. So the decision tree will be forced to prioritize the previously misclassified examples because if it does not, then the weighted error would cease to come down and converge to a minima. Its like manipulating an algorithm to focus on previous misclassified/ highly weighted examples by including the weights in the loss function. This is why weights are assigned to each training example and are accomodated in the error.</p>
<p><strong>So higher the weight of a particular training sample, more the chance of the algorithm finding the right criteria to correctly classify it.</strong></p>
<h2 id="2-what-is-the-use-of-alpha_m">2. What is the use of <script type="math/tex">\alpha_{m}</script>?</h2>
<p><br />
The second important question is, how does <script type="math/tex">\alpha_{m}</script> increase the weights of training example misclassified by a classifier <script type="math/tex">G_{m}(x)</script> and decrease the weights of correctly classified training examples? Let us look at this by looking at a few cases.</p>
<p><strong><em>Case 1</em></strong>
<br /></p>
<p><strong>Error of <script type="math/tex">G_{m}(x)</script> is high</strong></p>
<p>This means <script type="math/tex">err_{m}</script> is high. Lets assume error is 0.9. Then <script type="math/tex">\alpha_{m} = \log_{}[\frac{1-err_{m}}{err_{m}}] = \log_{}[\frac{1-0.9}{0.9}] = -0.95424</script></p>
<p>As we can see <script type="math/tex">\alpha_{m}</script> turned out to be low. <strong><script type="math/tex">\alpha_{m}</script> is inversely proportional to <script type="math/tex">err_{m}</script></strong></p>
<p>Now let us consider weight updation for the sub-cases of correctly-classified/mis-classified training examples. The formula for weight updation for <strong>any</strong> training example is <script type="math/tex">w^{\prime}_{i} = w_{i} * [ \exp(\alpha_{m} * I(y_{i}\hspace{0.2cm}\neq\hspace{0.2cm} G_{m}(x_{i})) ]</script>.</p>
<p><strong><em>Case 1.A</em></strong>
<br /></p>
<p><strong>Weights of Examples correctly-classified by <script type="math/tex">G_{m}(x)</script></strong></p>
<p>For rightly classified examples, the indicator variable is zero. This implies the exponent term in the weight updation becomes 1 (<script type="math/tex">\exp(0) = 1</script>). This implies updated weight <script type="math/tex">w^{\prime}_{i} = w_{i}</script> i.e. no change.</p>
<p><strong><em>Case 1.B</em></strong>
<br /></p>
<p><strong>Weights of Examples mis-classified by <script type="math/tex">G_{m}(x)</script></strong></p>
<p>For mis-classified examples, the indicator variable is 1. This implies the exponent term in the weight updation becomes
<script type="math/tex">\exp(\alpha_{m} * 1) = \exp(-0.95424 * 1) = 1.76404</script>. And therefore, <script type="math/tex">w^{\prime}_{i} = w_{i} * [ \exp(\alpha_{m} * I(y_{i}\hspace{0.2cm}\neq\hspace{0.2cm} G_{m}(x_{i})) ] = w_{i} * 1.76404 > w_{i}</script></p>
<p>Hence, when error is high, alpha is low and the weights of mis-classified samples are increased for the next iteration of the classifier which means that next classifier will be forced to prioritize the classification of the mis-classified samples.</p>
<p>This is how the combination Error, Alpha and Weights is used to create iteratively better learners.</p>
<p><strong><em>Case 2</em></strong>
<br /></p>
<p><strong>Error of <script type="math/tex">G_{m}(x)</script> is low</strong></p>
<p>This means <script type="math/tex">err_{m}</script> is low. Lets assume error is 0.1. Then <script type="math/tex">\alpha_{m} = \log_{}[\frac{1-err_{m}}{err_{m}}] = \log_{}[\frac{1-0.1}{0.1}] = +0.95424</script></p>
<p>As we can see <script type="math/tex">\alpha_{m}</script> turned out to be high. <strong><script type="math/tex">\alpha_{m}</script> is inversely proportional to <script type="math/tex">err_{m}</script></strong></p>
<p>Now let us consider weight updation for the sub-cases of correctly-classified/mis-classified training examples. The formula for weight updation for <strong>any</strong> training example is <script type="math/tex">w^{\prime}_{i} = w_{i} * [ \exp(\alpha_{m} * I(y_{i}\hspace{0.2cm}\neq\hspace{0.2cm} G_{m}(x_{i})) ]</script>.</p>
<p><strong><em>Case 2.A</em></strong>
<br /></p>
<p><strong>Weights of Examples correctly-classified by <script type="math/tex">G_{m}(x)</script></strong></p>
<p>For rightly classified examples, the indicator variable is zero. This implies the exponent term in the weight updation becomes 1 (<script type="math/tex">\exp(0) = 1</script>). This implies updated weight <script type="math/tex">w^{\prime}_{i} = w_{i}</script> i.e. no change.</p>
<p><strong><em>Case 2.B</em></strong>
<br /></p>
<p><strong>Weights of Examples mis-classified by <script type="math/tex">G_{m}(x)</script></strong></p>
<p>For mis-classified examples, the indicator variable is 1. This implies the exponent term in the weight updation becomes
<script type="math/tex">\exp(\alpha_{m} * 1) = \exp(+0.95424 * 1) = 2.59669</script>. And therefore, <script type="math/tex">w^{\prime}_{i} = w_{i} * [ \exp(\alpha_{m} * I(y_{i}\hspace{0.2cm}\neq\hspace{0.2cm} G_{m}(x_{i})) ] = w_{i} * 2.59669 > w_{i}</script></p>
<h2 id="summary">Summary</h2>
<p>So this is how the Boosting works in essence. Examples are weighted according to their correct classification and weights are updated at each iteration.</p>Chaitanya Gokhalecgokhale92@gmail.comMachine Learning, predictive models, boosting, ensemble learning, classificationTensorflow: What, Why and Where?2018-04-05T00:00:00+00:002018-04-05T00:00:00+00:00https://statchaitya.github.io/tensorflowgettingstarted<h1 id="-tensorflow---what-why-and-where-"><center> TensorFlow - What, Why and Where? </center></h1>
<p>Most, if not all Data Scientists, have hit a phase of stagnancy when it comes to writing Data Science Code. We have repetitively trained, tested and inferred thousands of models locally, coding them in R and Python. It is inevitable that at some point or the other, the need for prototyping using scalable technologies will increase. Enter the world of <strong>Spark</strong> and <strong>TensorFlow</strong>, two scalable technologies for creating machine learning models in the era of Big Data. While <strong>Spark</strong> has found popularity in developing traditional ML systems due to its intuitive syntax, <strong>TensorFlow</strong> is used to develop large scale deep learning systems because it leverages array like structures (tensors).</p>
<p>This post looks at What TensorFlow is, Why is it effective and What are a few business use cases of it.</p>
<h2 id="-what-is-tensorflow-"><center> What is TensorFlow? </center></h2>
<p>TensorFlow is an open-source software library which is used to run deep learning experiments. It was developed by Google Brain team for internal Google use.</p>
<h2 id="-why-tensorflow-"><center> Why TensorFlow? </center></h2>
<p>Because <em>it is opensource</em></p>
<p>Because <em>it is Scalable/Distributed and also runs on mobiles</em></p>
<ul>
<li>TensorFlow is a suite of software tools including a very strong Python API which enables data scientists to <strong>create models
in a local Python environment and scale them using the same locally developed code. Cool isn’t it?</strong>. TensorFlow can run on
<em>clusters</em> as well as on <em>multiple CPUs/GPUs on the same machine</em> as well as runs on <em>mobiles</em> which makes it flexible. The scalability is a result of <strong>TensorFlow’s programming paradigm in which problems are modelled as graphs</strong> which leads to massively parallel operations.</li>
</ul>
<p>Because <em>it comes with additional support tools</em></p>
<ul>
<li>‘tensorflow’ programming language is a part of a ‘Suite of Software’ which is made available to us. In tensorflow, all elements of code are nodes and edges of a graph. When the project gets complicated i.e. the graph gets complicated, we can use <strong>TensorBoard</strong> to visualize the graph and hence visualize the workings of our code. It is a very useful debugging tool.</li>
<li>The main advantage of tensorflow is that we can easily scale and productionalize our locally developed code. Another tensorflow suite tool called as <strong>TensorFlow Serving</strong> lets us do this in a very straightforward way.</li>
<li>So combining these multiple tools along with the <strong>Strong Python API</strong>, tensorflow becomes a very powerful framework for building Deep Learning Neural Networks.</li>
</ul>
<h2 id="-where-is-tensorflow-used-"><center> Where is tensorflow used? </center></h2>
<p>TensorFlow has and will continue to find use cases as we progress in the era of Deep Learning and Big Data. Following is a summary of some interesting projects where TensorFlow is put to use</p>
<ol>
<li><strong>Language Understanding</strong>
<ul>
<li>Smart Systems are only going to get smarter as the years go on. Every year new breakthroughs are happening in the fields like AI and IoT. It is not hard to believe that smart assistants might very soon become an essential part of our lives like Cell Phones.</li>
<li>TensorFlow finds usecases in Voice Recognition, speech-to-text and language modelling problems which in a combined way contribute to Language Understanding of smart systems.</li>
</ul>
</li>
<li><strong>Image Recognition</strong>
<ul>
<li>The most explored and effective usecase of Deep Learning is Image Recognition.</li>
<li>Applications like face recognition, image search, motion detection, machine vision and photo clustering can be developed using TensorFlow.</li>
</ul>
</li>
<li><strong>Text-based applications</strong>
<ul>
<li>Text-based applications is another extensive area where deep learning is used and is very effective.</li>
<li>Applications for as Sentiment Analysis, threat detection from comments, fraud detection from comments, language detection and text summarization can be developed using TensorFlow.</li>
</ul>
</li>
</ol>
<p>In the next article we will explore the basic functionality of TensorFlow’s Python API, its computational graph model as well as strategies on how to intuitively think about TensorFlow’s computational graph model and get comfortable with it.</p>Chaitanya Gokhalecgokhale92@gmail.comNeural networks, AI, Scalability