From the Nua Chorpas - Michal Boleslav Měchura sent me it years ago, but it has Open Database Licence.
I believe column 1 is the frequency (1= most frequent) Column 2 is the word (or lemma) Column 3 is the frequency in the corpus (the number of times it occurs in the whole corpus of 30m words) Column 4 is the window size (I think this means that 25 means the word occurs every 25th word in Irish)
I became interested in Word Frequency when studying Russian, and Nick Brown's Learner's Frequency Dictionary of Russian arranged the most frequent 10,000 words by frequency (note: it always depends on what the corpus consists of - the frequency is only approximate). He added that there were words like the Russian for "woodpecker" (дятел) that native speakers would know, but that were not very helpful to learners as you could read Russian every day for 10 years and not come across it. He viewed only words as frequent as those occurring 10 times per 1m words worth learning. There are 8,000 such words in Russian, but he rounded the book out with 10,000 words, giving all words occurring 8 times per million.
Now Irish is periphrastic - so the raw number of words will be less. E.g. "eviction" is a word in English, but in Irish cur ó sheilbh uses three frequent words to make a new meaning. In fact, in the Nua-Chorpas, there are only 4,122 Irish words that occur at least 8 times per 1m words (i.e. a window size of 125,000). There are many anomalies: 6343rd is polla, occurring 101 times, or once very 340180 words. Surely the word is more common that that? This may reflect the type of works fed into the Corpus? Here are the first 100 most common Irish words. If people are interested, we could gradually work through the list of 6,450 words that Michal sent me from the Nua-Chorpas.
1 an 1338874 25 2 bí 1194301 28 3 ar 898707 38 4 agus 856233 40 5 is 678055 50 6 ag 673684 51 7 le 663052 51 8 na 660024 52 9 do 526579 65 10 go 458180 74 11 de 304296 112 12 sé 295900 116 13 sin 243901 140 14 ó 240522 142 15 é 212565 161 16 seo 186776 183 17 cuir 181783 189 18 mar 181317 189 19 ach 174944 196 20 déan 169196 203 [=dein in Cork] 21 faoi 150173 228 [=fé in Cork] 22 nó 142220 241 [pronounced nú in Cork] 23 duine 139569 246 24 tabhair 123139 279 25 féin 114602 299 26 ní 104620 328 27 aon 100018 343 28 as 98622 348 29 chun 96082 357 30 eile 94831 362 31 abair 94140 364 32 mé 91318 376 [usually pronounce me as an object pronoun in Cork] 33 tar 91096 377 [=tair in Cork] 34 cuid 87857 391 35 maith 87286 393 36 faigh 86973 395 37 sí 81913 419 38 ná 79199 433 39 bliain 75787 453 40 siad 75348 455 41 téigh 74714 459 42 nuair 73679 466 43 iad 67270 510 44 amach 63887 537 45 mo 63778 538 46 cé 62903 546 47 nach 61995 554 [=nách in Cork] 48 bain 60640 566 49 ceann 58819 584 50 gach 55191 622 51 tú 54337 632 [usually tu or thu where an object pronoun in Cork] 52 rud 54179 634 53 í 53027 647 54 caith 52901 649 55 Gaeilge 52339 656 [=Gaelainn in Cork] 56 trí 52004 660 57 gan 51455 667 58 féidir 50408 681 59 lá 48892 702 60 chomh 47797 718 61 fear 45850 749 62 isteach 45573 753 63 fad 45242 759 64 áit 44613 770 65 beag 44314 775 66 am 43223 794 67 chuig 41165 834 [a variant of chun, so not used in Cork] 68 Éire 41141 835 69 obair 41108 835 70 céad 40394 850 71 amháin 40383 850 72 taobh 39944 860 73 anois 39654 866 74 céile 38960 881 75 mac 38875 883 76 feic 38852 884 77 níos 38529 891 78 má 37692 911 79 teach 37246 922 [=tigh in Cork] 80 ceart 36986 928 81 gur 36788 933 82 idir 36440 942 83 scéal 35691 962 84 tír 35130 978 85 saol 34478 996 86 bith 34266 1002 [only really in 'ar bith'] 87 roimh 33297 1031 [usually roim in Cork] 88 féad 32801 1047 89 ceist 32045 1072 90 ansin 31686 1084 [=ansan in Cork] 91 deireadh 30577 1123 92 bean 29714 1156 93 dóigh 29194 1176 [pronounced dó in Cork] 94 dá 28842 1191 95 fios 28504 1205 96 uair 28084 1223 97 alt 27940 1229 98 te 27935 1229 99 pobal 27643 1242 100 comhairle 26702 1286
|