Before making a decision if we can add a support for Turkish language we need to check availability of basic training data. We went through the publicly available Turkish corpora and analysed this.
CoNLL-U is the format used by the Universal Dependencies initiative to annotate dependency treebanks. A large number of treebanks for many different languages are available in the CoNLL-U format.
Sentences consist of one or more word lines, and word lines contain the following fields:
The main problem in training spaCy models is obtaining threebanks data in CoNLL-U format. Source treebanks can be obtained only by an nonautomated way, by expert linguistic analysis. The small treebanks can be obtained by Universal Dependencies.
Universal Dependencies (UD) is a framework for cross-linguistically consistent grammatical annotation and an open community effort with over 200 contributors producing more than 100 treebanks in over 60 languages. The UD Turkish Treebank, also called the IMST-UD Treebank, is a semi-automatic conversion of the IMST Treebank (Sulubacak et al., 2016), which is itself a reannotated version of the METU-Sabancı Turkish Treebank (Oflazer et al., 2003). All three of the treebanks share the same raw data, a set of 5635 sentences collected from daily news reports and novels.
The spaCy convert command helps data from *.conllu to convert the UD Turkish Treebank data to JSON format for training.
python -m spacy convert tr-ud-dev.conllu ./ -c conllu
python -m spacy convert tr-ud-train.conllu ./ -c conll
python -m spacy convert tr-ud-test.conllu ./ -c conll
The following results can be obtained: tr-ud-dev.conllu gives 975 trees, tr-ud-train.conllu gives 3685 trees, tr-ud-test.conllu gives 975 trees.
An example of JSON file structure:
{
"id":3,
"paragraphs":[
{
"sentences":[
{
"tokens":[
{
"head":2,
"tag":"Noun",
"orth":"Orada",
"dep":"obl"
},
{
"head":-1,
"tag":"Rel",
"orth":"ki",
"dep":"case"
},
{
"head":2,
"tag":"Verb",
"orth":"tart\u0131\u015fma",
"dep":"nsubj"
},
{
"head":1,
"tag":"Adverb",
"orth":"hayli",
"dep":"advmod"
},
{
"head":0,
"tag":"Adj",
"orth":"zengin",
"dep":"ROOT"
},
{
"head":-1,
"tag":"Punc",
"orth":".",
"dep":"punct"
}
]
}
]
}
]
}