Friday, December 19, 2008

A community index of programming languages

You can find this index here. It is based on search hits from popular search engines. So, it may not necessarily indicate if the language in question is the best or the most widely used. However, it gives some indication of the current trend based on the assumption that higher the search hits, the higher the popularity. Like Google Flu Trend (although it is based on the frequency of search key words), they need to have a clever way of handling false positives in order to give an objective confidence value of the stats. (As pointed out, how do you differentiate between "Basic Programming" and "Basic Programming with Java"?) They only seem to have a manual approach for this which could be tedious, time-consuming and error prone. I think a key reason for false positives is current mainstream search engines are unable to perform context-aware search (for example, if major search engines could answer "current top five programming languages" accurately, the problem could have been trivial) (to my knowledge, context-aware search technologies have yet come out of research labs). Absent of such capabilities, I am sure there should be research work in IR (Information Retrieval) and DM (Data Mining) to attack such problems and automate the task. Off the top of my head, I think we may be able to adapt something similar to Association Rules mining; once we get the search hit list, we can find the support for each language under consideration and then calculate the confidence using those support values.

Even if the false positives are minimized, does this index really give a measure of popularity? I think the answer is quite subjective. There may be articles criticizing one language or the other. Some languages could be so easy to use that there is no need for extensive tutorials, references, etc. They may need to take some of the following matrices to device a better index:

-what languages companies use
-tooling and other support (libraries, etc.) available for languages (both commercial and free - should there be a trend based on whether those are commercial or free, it needs to considered as well)
-what current open/closed code bases use and their sizes, and better yet the growth factor
-giving different weights to blog entries, forum messages, normal articles, tutorials, etc.
-making the search outcome time sensitive
-and so on.

Provided that we have an accurate index, how could this be useful? For different stakeholders, it may be good a decision point. For example, if you are company developing software, you may want to make sure that you select mainstream languages with the hope that you won't be short of human resources and you'll have guaranteed support and tooling available. If you're a tool developer, both upward and downward trends could create opportunities for you; with upward trend, you'll have a larger market to target at. If a language shows a downward trend, you may want to analyze if it is due to some gap such as lack of tooling support, and fill the gap.

No comments: