'Generation'
+ Generate
-> Specify whether
you want to generate a non-word or select a word from the lexical database
->
WARNING: When 'word' is selected, WordGen will sometimes generate words that are
in fact not real words (especially in Dutch; English example: 'sango'). These
stimuli actually have an entry in the CELEX lexical database! Thus, they
appeared somehow in a written text which was compiled for CELEX. These (sometimes
typing) errors that are inherent in CELEX can not be corrected by WordGen.
->
WARNING: When 'non-word' is selected, WordGen will sometimes generate an
existing word. This is because WordGen only uses lemma databases, which do not
contain inflected word forms.
+ Constrain number
of neighbours
->
Specify the number of orthographic neighbors that the word/non-word you are
looking for should have.
-> For words, WordGen counts and reports all lemma entries in the lexical
database which share all but one letter with the generated word.
-> For non-words, increasing the number of neighbours will result in more 'wordlike'
non-words.
-> When setting this constraint, it is important to realise that this variable
is highly correlated with (non)word length. Click
here to see
the distribution of neighborhood size as a
function of word length for words in the four lexical corpora. This information
might be useful to select reasonable neighbour constraints. For example, these
figures illustrate that it does not make sense to probe WordGen for a Dutch
8-letter word with 11 neighbors.
+ Constrain word
frequency (only available in word generation)
->
Specify minimum and maximum frequency that the word you are looking for
should have.
-> In order to increase comparibility across languages/studies, WordGen only
includes frequency per million words.
->
WordGen only includes lemma frequencies. For example, the frequency of the word
'bank' includes the frequency of both associated English word forms, i.e. the
furniture and the financial institution.
-> In accordance with the word recognition literature, we prefer to use the
logarithm (base 10) of the frequency/million as a measure of word frequency.
This rescaling corrects for the fact that the difference between two words
occuring respectively 1 and 3 times per million is not the same as the
difference between two words occuring respectively 101 and 103 times per million.
->
IMPORTANT: In addition to the theoretical argument discussed above, we also
STRONGLY advise to use log freq/million because of computational reasons.
WordGen's source code primarily uses log freq/million numbers. Raw frequency is
computed as the inverted (base 10) logarithm freq/million; therefore, raw
frequencies may contain (very) small errors due to rounding off. They are made
included as approximate equivalents of log values which might be more difficult
to interpret.
+ Constrain
summated type bigram frequency
->
Specify minimum and maximum summated bigram frequency that the (non-)word you are looking for
should have.
->
For nonwords, increasing the bigram frequency boundaries will generally result
in more 'wordlike' non-words.
->
WordGen summates the position-nonspecific
frequency of each bigram of the generated letter string (word or non-word), based on how many times a bigram
occurs in the Celex or Lexique lemma databases, independent of its position in the word.
For example, the Dutch word
'boek' has a bigram frequency of 19898, which is the
sum of the number of occurrences of each of its bigrams: 'bo' (4123), 'oe'
(9120), and 'ek' (6655) in the Celex.
-> Because the four languages have a
different number of words in the database there is a big difference between the
bigram frequencies for these languages. For instance, the Dutch and English
databases in the CELEX contain 124.136 and 52.447 entries, respectively. This
means that on average Dutch summated bigram frequencies will be more than twice
as high than English summated bigram frequencies. Also, because the program
works with summated bigram frequencies, on average the bigram frequency for
short words will be lower than the bigram frequency for long words. To help the
user set the summated bigram frequency constraint we included four figures with
the distribution information of bigram frequencies as a function of word length,
plotted separately for each language. This is available
here.
+ Constrain
minimum legal bigram frequency (only
available in non-word generation)
-> Specify the
bigram frequency that each of the bigrams of the generated word/non-word should
have.
-> In general,
increasing this number will result in more 'wordlike' non-words.
->
This parameter supplements the summated bigram frequency parameter discussed
above. It prevents that WordGen generates words/non-words which contain illegal
bigrams but still have a high summated bigram frequency, because one of the
bigrams is very frequent.
-> Because Celex also contains some typing errors, it is not advised to use '1'
for example as the criterion for a 'legal' bigram. WordGen's default value has
been shown to be efficient in practice.
+ Constrain
minimum position-specific onset/suffix bigram frequency (only
available in non-word generation)
->
Specify the minimum position-specific bigram frequency that the first and last
bigram of the generated word/non-word should have.
->
WordGen counts how many words in the lexical database contain the first/last
bigram of the generated nonword as the first/last bigram of the word (position specific)
For example, the Dutch word
'boek' has a bigram frequency of 19898, which is the
sum of the number of occurrences of each of its bigrams: 'bo' (4123), 'oe'
(9120), and 'ek' (6655) in the Celex. The position specific onset bigram
frequency of the first bigram, 'bo', in Dutch is 1608. Hence, there are 1608
words in Celex that begin with the letters 'bo'. The suffix bigram frequency of
'ek' is 1815. Hence, there are 1815 words in Celex ending in 'ek'.
-> In general,
increasing this number will result in more 'wordlike' non-words.
->
This paramater supplements the bigram frequency parameters discussed above. Some
bigrams are quite frequent in a certain language, but almost never occur as the
first (or last) two letters of a word (e.g. 'rt' in English). This option
prevents that WordGen generates a non-word which is illegal because it has such
an onset (or suffix).
-> Because Celex also contains some typing errors, it is not advised to use '1'
for example as the criterion for a 'legal' bigram. WordGen's default value has
been shown to be efficient in practice.
+ Use heuristic (only
available in non-word generation)
->
By default, WordGen generates random letter strings, and then checks whether the
generated letter string meets all the specified constrains (about 1000 strings
per second). When using a difficult combination of strict constraints, this
process can take some time, especially when searching for long non-words. In
that case, it may be advisable to use the heuristic approach.
->
When selected, WordGen will create a nonword by randomly selecting an entry from
the lexical corpus, and changing one letter from that lemma. It then checks
whether the created non-word meets all other constraints.
+ Use wildcard
-> Specify any
letters that the generated word/non-word should have (position specific)
->
Use an asterix as the wildcard. For example, if you are looking for a 6-letter
word/non-word starting with a 'b', and ending in a 'k', enter 'b****k' (without
the quotation marks).
+
Forbidden letter list
->
Specify any letters that the generated word/non-word should not contain (not
position specific)
->
type in all 'forbidden' letters next to each other, without spaces. For example,
if you do not want any words/non-words containing the letters 'x', 'y' or 'z',
simply enter 'xyz' (without the quotation marks).
+ Load/save
paramaters
->
Sometimes, it can take a while before the user finds an adequate set of
parameters for those specific stimuli that the user needs. Also, when generating
nonwords, it may be useful to be able to look up the criteria that were used in
nonword generation some time after the creation itself. In those cases, it may
be useful to save the parameter combinations in a file on the harddisk. This
information is stored in a plain text file (.pmf extension), in any location
that the user wants. This information can be loaded back into the GUI by the 'load
parameters' button.
->
An example parameter file can be found
here.
+
Generate
->
When clicked, WordGen generates a single word/non-word satisfying the
constraints that were set.
+
Generate List frame
->
It is often the case that researchers need several words/non-words satisyfing
the same constraints. Also, somebody may want to see several nonwords for
example, all satisyfing specified constraints, before then manually selecting
one from that list. Or, somebody may want a list of words in a certain frequency
category, before manually selecting all nouns from that list. In those cases,
WordGen can generate a list of words/non-words satisyfing the same set of
parameters.
->
Enter the desired number of words/non-words. Note that WordGen still operates
within its 'search time limit' constraint (see
above). For each word/non-word, WordGen
will try to find a letter string within the specified time limit. If this fails
for a given word, WordGen will abort if 'break' is selected in the 'options'
pane (see
above). In that case, the output file will
contain less stimuli than requested. When generating long lists of words/non-words
with strict constraint combinations, it is advised to disable the 'limit search
time' feature, or select 'continue' in the 'options' pane (see above).
->
always specify a location and file name where WordGen can store its search
results. By default, file extension is .prn. IMPORTANT: With each new search,
WordGen overwrites this file if another filename is not specified with the new
search!
->
An example output file can be seen here.
This file contains WordGen's output when it was probed for 100 five-letter
English non-words having 2-10 neighbours and satisyfing the minimum legal bigram
and legal onset/suffix constraints. This parameter combination corresponds to
this example parameter file (also
mentioned above).
->
The output file contains the following fields (separated by spaces, exportable
to a spreadsheet program)
-
(non)wordstring
- log frequency per million of wordstring (this field is set to 0 for non-words)
- number of neighbours (n)
- neighbourstrings (1...n)
-
sumatted type bigram frequency
-
number of bigrams (o)
- bigramstrings [1...o]
- type bigram frequencies [1...o]
->
if 'write detailed output to file' is not selected, WordGen will only write the
first field [(non)wordstring]
->
IMPORTANT: if WordGen is asked to generate a list of word stimuli, that list
will contain only one different word, if 'linear' is selected on the 'options'
pane (see above) !