Predicting the future popularity of programming languages

Sep 14, 2019 00:00 · 901 words · 5 minute read Forecasting Arima Exponential Smoothing

Insert data and ETL

Stack overflow data were used for this analysis. The dataset was downloaded from the Stack Exchange Data Explorer. The processed file is also downloadable here. It was used by David Robinson in a datacamp project.

Each Stack Overflow question has a tag, which marks a question to describe its topic or technology. This data has one observation for each pair of a tag and a year, showing the number of questions asked in that tag in that year and the total number of questions asked in that year. For instance, there were 54 questions asked about the .htaccess tag in 2008, out of a total of 58390 questions in that year.

year tag number year_total
2008 .htaccess 54 58390
2008 .net 5910 58390
2008 .net-2.0 289 58390
2008 .net-3.5 319 58390
2008 .net-4.0 6 58390
2008 .net-assembly 3 58390

Instead of using the counts of each tag questions in a year we’ll calculate & use the fraction of each tag questions and the overall questions in that year. It is more convenient to perform the comparison now.

year tag number year_total fraction
2008 .htaccess 54 58390 0.0009
2008 .net 5910 58390 0.1012
2008 .net-2.0 289 58390 0.0049
2008 .net-3.5 319 58390 0.0055
2008 .net-4.0 6 58390 0.0001
2008 .net-assembly 3 58390 0.0001

Predicting the future popularity of programming languages

It would be interesting to predict the future popularity of the programming languages. I’ll use the forecast package to generate predictions. In particular, I’m combining the power of the main forecasting methodologies, ARIMA & Exponential smoothing. In particular for each time-series (each programming language) 2 separate models are created, using ARIMA* & Exponential smoothing methods, and the best one is selected for prediction. MAPE (mean absolute percentage error)** is chosen to evaluate the forecasting models.
* ARIMA (Auto-regressive integrating moving average) is a very popular technique for time series modelling. It describes the correlation between data points and takes into account the difference of the values.
** Exponential Smoothing methods include simple exponential smoothing (larger weights are assigned to more recent observations than to observations from the distant past), double exponential smoothing or Holt linear trend model (also takes account the trend of the series) and triple exponential smoothing or Host’s Winters method (also takes account both the trend and the seasonality of the time series)

tag mape_arima mape_ets
c# 5.14 9.15
c++ 5.12 4.36
java 3.62 6.32
php 4.54 9.81
python 5.80 10.65
r 15.26 8.91
ruby 6.63 11.44

As you can see above, R & C++ predictions are better when applying exponential smoothing method than ARIMA. For the rest of the programming languages, ARIMA seems to be the best methodology.

Below there is a table with the future predictions, using the best performing model

tag index key fraction lo.80 lo.95 hi.80 hi.95
c# 2019 forecast 0.0473700 0.0369888 0.0314934 0.0577512 0.0632466
c# 2020 forecast 0.0400400 0.0253588 0.0175870 0.0547212 0.0624930
c# 2021 forecast 0.0327100 0.0147293 0.0052109 0.0506907 0.0602091
c# 2022 forecast 0.0253800 0.0046176 -0.0063733 0.0461424 0.0571333
java 2019 forecast 0.0777976 0.0720138 0.0689520 0.0835814 0.0866432
java 2020 forecast 0.0815882 0.0719591 0.0668618 0.0912173 0.0963146
java 2021 forecast 0.0862540 0.0749955 0.0690356 0.0975125 0.1034724
java 2022 forecast 0.0895722 0.0781559 0.0721125 0.1009884 0.1070319
java 2023 forecast 0.0904288 0.0788499 0.0727203 0.1020077 0.1081372
python 2019 forecast 0.1095000 0.1035134 0.1003444 0.1154866 0.1186556
python 2020 forecast 0.1201000 0.1067137 0.0996274 0.1334863 0.1405726
python 2021 forecast 0.1307000 0.1083004 0.0964427 0.1530996 0.1649573
python 2022 forecast 0.1413000 0.1085103 0.0911525 0.1740897 0.1914475
python 2023 forecast 0.1519000 0.1075025 0.0839999 0.1962975 0.2198001
r 2023 forecast 0.0336761 0.0186408 0.0106816 0.0487115 0.0566707

Below there is a plot with the future predictions

Summary

To compare programming languages popularity we used the fraction of total questions (on Stack Overflow) that concern each language for the last 10 years. Stack overflow is, by far, the most popular platform for questions and answers on a wide range of topics in computer programming.

Of course, there are a lot of other metrics that could be used to measure popularity. Furthermore, forecasting models, like all models, are not perfect and could have higher deviations than predicted. But, given the relatively small MAPE for all time series (less than 9 % for all), the predictions should be a good indication for the future popularity of programming languages.

Overall, the results are the following:
- Analytics programming languages (Python & R) will continue gaining popularity
- Java will gain a little and then keep a constant popularity
- JavaScript, C# & C++ will loose significant popularity
- PHP & Ruby could lose almost all their popularity and become obsolete in the next 5 years

Full R code