George Kingsley Zipf (1902-1950) observed an interesting phenomenon in natural languages. This phenomenon, now known as Zipf’s law, which Wikipedia defines as follows “Zipf’s law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc”
This law is now tested for multiple languages, for music and even for numbers found in commercial databases such as accounting data and financial report.
The chart below shows Zipf’s law revealed in musical masterpieces:
Although the data, especially for Schoenburg, fits in discrete steps with Zipf’s law curve on log-log scale; it is clear that even the creativity of maestros is broadly subject to mathematical laws.
This law in the field of Linguistics, has a cousin in Number Theory, called Benford’s Law, or the First Digit Law, discovered by Frank Benford, a physicist working with the General Electric Company, in1 938. The law says that a number “n” (base 10) appears in most databases as the first digit of number with probability
log (1+1/n), as shown in the chart below:
The application of this principle in fraud detection using machine learning should now be obvious and straightforward. Make these laws your feature sets, run your algorithm and find out if any or all of the reported financial statement looks doctored. Presto!