.. _lexicons:

Lexicons
========

SASTA uses a range of lexicons. We will briefly describe these here.

.. _Alpinolexicon:

Alpinolexicon
-------------
The Alpinolexicon is used inside Alpino. This lexicon is not just a list of words with their properties but has a more complex structure and also contains rules to deal with systematic cases, programmed in Prolog. This makes it less usable outside of Alpino.

The predicate ‘is contained in the Alpino-lexicon’ is also not easy to define. One can only do this by parsing the relevant word. But Alpino will come back with a word analysed as a compound if it can analyse it as a compound, even if it is not listed as such in the Alpino lexicon, and the difference cannot be seen. The properties that a word will get in the parse are also in part dependent on the context in which it occurs. For example, a verb may have many frame options but will have only one left in a particular utterance.

Gertjan van Noord wrote to me about this::

	als je Alpino hebt, gaat het zo
	 
	Alpino
	p lex_all
	 
	en nu kun je per regel een zin (of een woord) ingeven. Alpino toont vervolgens alle categorieën die het woord heeft gekregen. Vb:
	 
	1 |: p lex_all
	1 |: de autootje
	[... debug info ...]
	TAG#0|1|de|determiner(de)|normal(normal)|de|0.0
	TAG#1|2|autootje|noun(het,count,sg)|diminutive|auto_DIM|0.6931471805599453
	TAG#1|2|autootje|noun(het,count,sg)|normal(normal)|auto_DIM|0.6931471805599453
	 
	de regels die met TAG beginnen zijn relevant. En die hebben velden, gescheiden door | veld 4 is de woordsoort, veld 5 is de naam van de heuristiek die is gebruikt. Als die naam met 'normal(' begint, zou je kunnen zeggen dat het woord gewoon in het woordenboek staat.

.. _CELEX:

CELEX
-----
The lexicon that we use most is CELEX. There is a module lexicon.py which provides the interface to the lexion actually used:

.. automodule:: lexicon

But in it the actual lexicon used is the CELEX lexicon, taken care of by the celexlexicon module:

.. automodule:: celexlexicon

.. _top3000:

Top3000
-------

.. automodule:: top3000

.. _namelexicons:

Name lexicons
-------------

Names very often consist of multiple words.
For individual words it is therefore important to check whether they can be a part of a (possibly multiword) name.
The relevant module is the namepartlexicon module.

.. automodule:: namepartlexicon

The dictionary with nameparts has been derived by the SASTA script getnamepartslexicon:

.. automodule:: getnamepartslexicon
.. automodule:: namelexicons

 
.. _filledpauseslexicon:

Filled pauses lexicon
---------------------

The filledpauseslexicon is a set created in the module dedup on the basis of the file filledpauseslexicon/filledpauseslexicon.txt in the code folder.

This file has been created by searching for strings marked with & in the Dutch CHILDES corpora (with the script getchildes.py), and manual filtering.


.. _compounds:

Compounds
---------

.. automodule:: compounds

.. _exceptionslists:

Exception Lists
--------------- 

There are several  lists of words in SASTA for a variety of reasons. 
At the moment they are distributed over multiple files. But it would be a good idea to put them all together in  a single module.
We will call this module (that does not exist yet) exceptionlists.py.