Insert data and ETL
Stack overflow data were used for this analysis. The dataset was downloaded from the Stack Exchange Data Explorer. The processed file is also downloadable here. It was used by David Robinson in a datacamp project.
Each Stack Overflow question has a tag, which marks a question to describe its topic or technology. This data has one observation for each pair of a tag and a year, showing the number of questions asked in that tag in that year and the total number of questions asked in that year. For instance, there were 54 questions asked about the .htaccess tag in 2008, out of a total of 58390 questions in that year.
Instead of using the counts of each tag questions in a year we’ll calculate & use the fraction of each tag questions and the overall questions in that year. It is more convenient to perform the comparison now.
How have popular programming languages changed over time?
It would be interesting to look at the popularity of the top programming languages. In particular, the following programming languages are included:
The fraction of each tag questions (on the overall questions in the year) used for this comparison.
## `summarise()` ungrouping output (override with `.groups` argument)
It is clear that since 2013 the total number of questions is not significantly growing. So we can say that the trends of the programming languages could be significant, especially from 2013 onward.
Predicting the future popularity of programming languages
It would be interesting to predict the future popularity of the programming
languages. I’ll use the forecast package
to generate predictions. In particular, I’m combining the power of the main forecasting
methodologies, ARIMA & Exponential smoothing. In particular for each time-series
(each programming language) 2 separate models are created, using ARIMA* & Exponential smoothing methods,
and the best one is selected for prediction. MAPE (mean absolute percentage error)**
is chosen to evaluate the forecasting models.
* ARIMA (Auto-regressive integrating moving average) is a very popular technique for time series modelling. It describes the correlation between data points and takes into account the difference of the values.
** Exponential Smoothing methods include simple exponential smoothing (larger weights are assigned to more recent observations than to observations from the distant past), double exponential smoothing or Holt linear trend model (also takes account the trend of the series) and triple exponential smoothing or Host’s Winters method (also takes account both the trend and the seasonality of the time series)
As you can see above, R & C++ predictions are better when applying exponential smoothing method than ARIMA. For the rest of the programming languages, ARIMA seems to be the best methodology.
Below there is a table with the future predictions, using the best performing model
Below there is a plot with the future predictions
To compare programming languages popularity we used the fraction of total questions (on Stack Overflow) that concern each language for the last 10 years. Stack overflow is, by far, the most popular platform for questions and answers on a wide range of topics in computer programming.
Of course, there are a lot of other metrics that could be used to measure popularity. Furthermore, forecasting models, like all models, are not perfect and could have higher deviations than predicted. But, given the relatively small MAPE for all time series (less than 9 % for all), the predictions should be a good indication for the future popularity of programming languages.
Overall, the results are the following:
- Analytics programming languages (Python & R) will continue gaining popularity
- Java will gain a little and then keep a constant popularity
- PHP & Ruby could lose almost all their popularity and become obsolete in the next 5 years