feat: add Kaikki extraction and import scripts for stage 1
- Add stage-1-extract/scripts/extract.ts — streams Kaikki JSONL, filters to supported POS and languages, skips abbreviations and senses with no translations in supported languages - Rewrite db/import.ts for Kaikki flat model — tracks sense_index offsets per headword+pos to handle duplicate JSONL entries - Rewrite db/schema.sql for Kaikki model — entries, translations, LLM vote tables, resolved tables - Add extract and db:import scripts to package.json - Sample mode hardcoded to 500 entries for development
This commit is contained in:
parent
963bff4eb8
commit
209d52f54b
17 changed files with 346 additions and 1055737 deletions
1
.gitignore
vendored
1
.gitignore
vendored
|
|
@ -12,6 +12,7 @@ __pycache__/
|
|||
|
||||
data-pipeline/archive/
|
||||
data-pipeline/stage-1-extract/output/
|
||||
data-pipeline/stage-1-extract/sources/
|
||||
data-pipeline/stage-2-annotate/output/
|
||||
data-pipeline/stage-3-enrich/output/
|
||||
data-pipeline/stage-4-merge/output/
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue