extraction, comparison and merging scripts for english are done, final english.json exists

2026-04-08 17:50:25 +02:00 · 2026-04-08 17:50:25 +02:00 · 59152950d6
commit 59152950d6
parent 3596f76492
14 changed files with 206319 additions and 0 deletions
--- a/scripts/README.md
+++ b/scripts/README.md
@ -0,0 +1,176 @@
 # CEFR Data Pipeline
 This directory contains the source data files and extraction/merge pipeline for generating CEFR-enriched datasets. The final output (`english-merged.json`) is consumed by the database seeding process in `packages/db`.
 ## Overview
 The pipeline transforms raw vocabulary data from multiple sources into a standardized format, resolves conflicts between sources, and produces an authoritative CEFR dataset per language. This dataset is then used by the Glossa database package to update translation records.
 ## Pipeline Stages
 ### Stage 1: Extraction
 Each source file is processed by a dedicated extractor script. The extractor reads the source-specific format, normalizes the data, filters for supported parts of speech, and outputs a standardized JSON file.
 **Input:** Raw source files (JSON, CSV, XLS)
 **Output:** `{source}-extracted.json` files (same directory as source)
 **Normalization rules:**
 - Words are lowercased and trimmed
 - Part of speech is mapped to supported values (noun, verb)
 - Entries with unsupported POS are skipped
 - CEFR levels are validated against A1-C2
 - Each record includes the source identifier for traceability
 **Location:** `extraction-scripts/english/`  
 **Scripts:**
 - `extract-cefrj-csv.py`
 - `extract-en_m3.py`
 - `extract-octanove.py`
 - `extract-random-json.py`
 ### Stage 2: Comparison
 Before merging, sources are compared to identify agreements and conflicts. This stage is read-only and serves as a quality gate.
 **Input:** All `{source}-extracted.json` files for a language
 **Output:** Console report showing:
 - Entry counts per source and CEFR level
 - Overlap between sources (words appearing in multiple sources)
 - Agreement rate (sources assigning the same CEFR level)
 - Conflicts (same word/POS with different CEFR levels)
 - Database coverage (how many extracted words exist in the database)
 **Location:** `comparison-scripts/compare-english.py`  
 **Usage:**
 ```bash
 cd scripts/
 python comparison-scripts/compare-english.py
 ```
 Conflicts are resolved in the next stage using source priority rules.
 ### Stage 3: Merge
 Multiple extracted sources are merged into a single authoritative JSON file per language. When the same word/POS appears in multiple sources with different CEFR levels, the conflict is resolved using a predefined priority order.
 **Input:** All `{source}-extracted.json` files for a language
 **Output:** `{language}-merged.json` in `../datafiles/`
 **Merge rules:**
 - Single source: use that source's CEFR level
 - Multiple sources agree: use the agreed CEFR level
 - Multiple sources conflict: use the level from the highest-priority source
 **Difficulty derivation:**
 Difficulty is not extracted from sources. It is derived from the final CEFR level:
 - A1, A2 → easy
 - B1, B2 → intermediate
 - C1, C2 → hard
 The merged file includes both CEFR level and derived difficulty, plus a list of sources that contributed to each entry.
 **Location**: merge-scripts/merge-english-json.py
 **Usage:**
 ```bash
 cd scripts/
 python merge-scripts/merge-english-json.py
 ```
 ### Stage 4: Enrichment
 The authoritative merged file is consumed by the database package (packages/db) during the seeding or update process. This stage is implemented in TypeScript and is not part of the Python scripts in this directory.
 ## File Organization
 ```
 scripts/
 ├── comparison-scripts/
 │   └── compare-english.py          # Stage 2: compare extracted data
 ├── datafiles/
 │   ├── english-merged.json         # Stage 3 output (authoritative dataset)
 │   ├── omw-noun.json
 │   └── omw-verb.json
 ├── data-sources/
 │   ├── english/
 │   │   ├── cefrj.csv
 │   │   ├── cefrj-extracted.json
 │   │   ├── en_m3.xls
 │   │   ├── en_m3-extracted.json
 │   │   ├── octanove.csv
 │   │   ├── octanove-extracted.json
 │   │   ├── random.json
 │   │   └── random-extracted.json
 │   ├── french/                     # (future)
 │   ├── german/                     # (future)
 │   ├── italian/                    # (future)
 │   └── spanish/                    # (future)
 ├── extraction-scripts/
 │   └── english/
 │       ├── extract-cefrj-csv.py
 │       ├── extract-en_m3.py
 │       ├── extract-octanove.py
 │       └── extract-random-json.py
 ├── merge-scripts/
 │   └── merge-english-json.py       # Stage 3: merge into authority
 ├── extract-own-save-to-json.py # script to extract words from wordnet
 ├── requirements.txt
 └── README.md                   # This file
 ```
 Extracted files are co-located with their sources for easy traceability. Merged files live in `../datafiles/`.
 ## Source Priority by Language
 Source priority determines which CEFR level wins when sources conflict:
 **English:**
 1. en_m3
 2. cefrj
 3. octanove
 4. random
 **Italian:**
 1. it_m3
 2. italian
 Priority is defined in the merge configuration. Higher priority sources override lower priority sources when conflicts occur.
 This is defined in merge-scripts/merge-english-json.py.
 ## Data Flow Summary
 ```
 Raw Source → Extracted JSON → Merged JSON → Database
    (1)           (2)            (3)           (4)
 ```
 1. **Extract:** Transform source formats to normalized records
 2. **Compare:** Validate source quality and surface conflicts
 3. **Merge:** Resolve conflicts, derive difficulty, create authority
 4. **Enrich:** Write to database (handled in packages/db)
 ## Adding New Sources
 To add a new source:
 1. Place the raw file in the appropriate `data-sources/{language}/` directory
 2. Create an extractor script in `../extractors/{language}/`
 3. Run the extractor to generate `{source}-extracted.json`
 4. Run comparison to assess coverage and conflicts
 5. Update source priority in the merge configuration if needed
 6. Run merge to regenerate the authoritative file
 7. Run enrichment to update the database
 ## Constants and Constraints
 The pipeline respects these constraints from the Glossa shared constants:
 - **Supported languages:** en, it
 - **Supported parts of speech:** noun, verb
 - **CEFR levels:** A1, A2, B1, B2, C1, C2
 - **Difficulty levels:** easy, intermediate, hard
 Entries violating these constraints are filtered out during extraction.
--- a/scripts/comparison-scripts/compare-english.py
+++ b/scripts/comparison-scripts/compare-english.py
@ -0,0 +1,166 @@
 #!/usr/bin/env python3
 """
 CEFR Data Pipeline - Stage 2: English Comparison
 Compares extracted JSON files for English and reports agreements and conflicts.
 """
 import json
 from collections import defaultdict
 from pathlib import Path
 from typing import Dict, List, Tuple
 # Supported CEFR levels
 CEFR_LEVELS = {"A1", "A2", "B1", "B2", "C1", "C2"}
 def load_extracted_files(data_dir: Path) -> Dict[str, List[dict]]:
    """Load all *-extracted.json files from the English data directory."""
    sources = {}
    for file_path in data_dir.glob("*-extracted.json"):
        source_name = file_path.stem.replace("-extracted", "")
        with open(file_path, "r", encoding="utf-8") as f:
            data = json.load(f)
            if isinstance(data, list):
                sources[source_name] = data
            else:
                print(f"Warning: {file_path} does not contain a list, skipping.")
    return sources
 def normalize_entry(entry: dict) -> Tuple[str, str]:
    """Return (word, pos) key for comparison."""
    return entry["word"].lower().strip(), entry["pos"].lower().strip()
 def compute_statistics(sources: Dict[str, List[dict]]) -> dict:
    """Compute overlap, agreement, and conflict statistics."""
    # Per-source counts by CEFR level
    source_counts = {}
    for src, entries in sources.items():
        cefr_counts = defaultdict(int)
        for e in entries:
            cefr = e.get("cefr", "UNKNOWN")
            cefr_counts[cefr] += 1
        source_counts[src] = dict(cefr_counts)
    # Build word->pos->sources and CEFR assignments
    word_map = defaultdict(lambda: defaultdict(dict))
    for src, entries in sources.items():
        for e in entries:
            key = normalize_entry(e)
            word_map[key][src] = e["cefr"]
    # Compute overlaps, agreements, conflicts
    total_entries = sum(len(e) for e in sources.values())
    unique_words = len(word_map)
    overlap_stats = defaultdict(int)
    agreement_count = 0
    conflict_count = 0
    conflict_details = []
    for key, src_cefr_map in word_map.items():
        num_sources = len(src_cefr_map)
        overlap_stats[num_sources] += 1
        if num_sources > 1:
            cefr_values = set(src_cefr_map.values())
            if len(cefr_values) == 1:
                agreement_count += 1
            else:
                conflict_count += 1
                conflict_details.append(
                    {"word": key[0], "pos": key[1], "assignments": dict(src_cefr_map)}
                )
    return {
        "source_counts": source_counts,
        "total_entries": total_entries,
        "unique_words": unique_words,
        "overlap_distribution": dict(overlap_stats),
        "agreements": agreement_count,
        "conflicts": conflict_count,
        "conflict_details": conflict_details,
    }
 def print_report(stats: dict, sources: Dict[str, List[dict]]):
    """Print formatted comparison report."""
    print(f"\n{'=' * 60}")
    print("CEFR COMPARISON REPORT - ENGLISH")
    print(f"{'=' * 60}")
    # Source entry counts
    print("\n📊 ENTRIES PER SOURCE AND CEFR LEVEL")
    print("-" * 50)
    for src, counts in stats["source_counts"].items():
        total = sum(counts.values())
        print(f"\n{src}: {total} total entries")
        for level in CEFR_LEVELS:
            cnt = counts.get(level, 0)
            if cnt > 0:
                print(f"  {level}: {cnt}")
        # Show non-standard levels
        for level, cnt in counts.items():
            if level not in CEFR_LEVELS and level != "UNKNOWN":
                print(f"  {level}: {cnt} (non-standard)")
    # Overlap statistics
    print("\n🔄 OVERLAP BETWEEN SOURCES")
    print("-" * 50)
    print(f"Total unique (word, POS) combinations: {stats['unique_words']}")
    print(f"Total entries across all sources: {stats['total_entries']}")
    overlap = stats["overlap_distribution"]
    for n_sources in sorted(overlap.keys()):
        count = overlap[n_sources]
        pct = (count / stats["unique_words"]) * 100
        print(f"Words appearing in {n_sources} source(s): {count} ({pct:.1f}%)")
    # Agreement and conflicts
    print("\n⚖️ AGREEMENT / CONFLICT SUMMARY")
    print("-" * 50)
    print(f"Words with >1 source: {stats['agreements'] + stats['conflicts']}")
    print(f"  ✅ Agreements (same CEFR): {stats['agreements']}")
    print(f"  ❌ Conflicts (different CEFR): {stats['conflicts']}")
    if stats["conflicts"] > 0:
        agreement_rate = (
            stats["agreements"] / (stats["agreements"] + stats["conflicts"])
        ) * 100
        print(f"  Agreement rate: {agreement_rate:.1f}%")
        print("\n📋 CONFLICT DETAILS (first 10 shown):")
        for i, conflict in enumerate(stats["conflict_details"][:10]):
            print(f"  {i + 1}. {conflict['word']} ({conflict['pos']})")
            for src, cefr in conflict["assignments"].items():
                print(f"       {src}: {cefr}")
        if len(stats["conflict_details"]) > 10:
            print(f"  ... and {len(stats['conflict_details']) - 10} more conflicts.")
    print(f"\n{'=' * 60}\n")
 def main():
    # Determine paths
    script_dir = Path(__file__).parent
    data_dir = script_dir.parent / "data-sources" / "english"
    if not data_dir.exists():
        print(f"Error: English data directory not found: {data_dir}")
        return
    print(f"Loading extracted files from {data_dir}...")
    sources = load_extracted_files(data_dir)
    if not sources:
        print("No extracted files found.")
        return
    print(f"Found sources: {', '.join(sources.keys())}")
    stats = compute_statistics(sources)
    print_report(stats, sources)
 if __name__ == "__main__":
    main()
--- a/scripts/data-sources/english/en_m3-extracted.json
+++ b/scripts/data-sources/english/en_m3-extracted.json
--- a/scripts/data-sources/english/octanove-extracted.json
+++ b/scripts/data-sources/english/octanove-extracted.json
--- a/scripts/datafiles/english-merged.json
+++ b/scripts/datafiles/english-merged.json
--- a/scripts/extraction-scripts/english/extract-en_m3.py
+++ b/scripts/extraction-scripts/english/extract-en_m3.py
@ -0,0 +1,107 @@
 #!/usr/bin/env python3
 """
 scripts/extraction-scripts/english/extract-en_m3.py
 Extracts CEFR data from en_m3.xls (M3 wordlist).
 """
 import json
 from pathlib import Path
 import xlrd
 # Constants matching @glossa/shared
 SUPPORTED_POS = ["noun", "verb"]
 CEFR_LEVELS = ["A1", "A2", "B1", "B2", "C1", "C2"]
 # POS mapping (case-insensitive)
 POS_MAP = {
    "noun": "noun",
    "verb": "verb",
 }
 # Paths (relative to project root)
 INPUT_FILE = Path("scripts/data-sources/english/en_m3.xls")
 OUTPUT_FILE = Path("scripts/data-sources/english/en_m3-extracted.json")
 def extract() -> None:
    print(f"Reading: {INPUT_FILE}")
    records = []
    skipped_pos = 0
    skipped_invalid_cefr = 0
    skipped_empty_word = 0
    total_rows = 0
    wb = xlrd.open_workbook(INPUT_FILE)
    ws = wb.sheet_by_index(0)
    # Skip header row, start from row 1
    for row_idx in range(1, ws.nrows):
        total_rows += 1
        # Unpack columns: ID number, Word, Part of Speech, CEFR, Points
        word_raw = ws.cell_value(row_idx, 1)
        pos_raw = ws.cell_value(row_idx, 2)
        cefr_raw = ws.cell_value(row_idx, 3)
        # Normalize POS (case-insensitive)
        pos = str(pos_raw).lower().strip() if pos_raw else ""
        if pos not in POS_MAP:
            skipped_pos += 1
            continue
        pos = POS_MAP[pos]
        # Normalize CEFR - handle smart quotes
        cefr_str = str(cefr_raw).strip() if cefr_raw else ""
        # Strip Unicode smart quotes (U+201C and U+201D)
        cefr_str = cefr_str.strip("\u201c\u201d")
        cefr = cefr_str.upper()
        if cefr not in CEFR_LEVELS:
            skipped_invalid_cefr += 1
            continue
        # Normalize word
        word = str(word_raw).lower().strip() if word_raw else ""
        if not word:
            skipped_empty_word += 1
            continue
        record = {"word": word, "pos": pos, "cefr": cefr, "source": "en_m3"}
        records.append(record)
    # Write output
    with open(OUTPUT_FILE, "w", encoding="utf-8") as f:
        json.dump(records, f, indent=2, ensure_ascii=False)
    # Stats
    noun_count = sum(1 for r in records if r["pos"] == "noun")
    verb_count = sum(1 for r in records if r["pos"] == "verb")
    cefr_distribution = {}
    for level in CEFR_LEVELS:
        count = sum(1 for r in records if r["cefr"] == level)
        if count > 0:
            cefr_distribution[level] = count
    print(f"\nTotal rows in XLS: {total_rows}")
    print(f"Extracted: {len(records)} records")
    print(f"  - Nouns: {noun_count}")
    print(f"  - Verbs: {verb_count}")
    print(f"\nCEFR distribution:")
    for level in CEFR_LEVELS:
        if level in cefr_distribution:
            print(f"  - {level}: {cefr_distribution[level]}")
    print(f"\nSkipped:")
    print(f"  - Unsupported POS: {skipped_pos}")
    print(f"  - Invalid CEFR: {skipped_invalid_cefr}")
    print(f"  - Empty word: {skipped_empty_word}")
    print(f"\nOutput: {OUTPUT_FILE}")
 if __name__ == "__main__":
    extract()
--- a/scripts/extraction-scripts/english/extract-octanove.py
+++ b/scripts/extraction-scripts/english/extract-octanove.py
@ -0,0 +1,90 @@
 #!/usr/bin/env python3
 """
 scripts/extraction-scripts/english/extract-octanove.py
 Extracts CEFR data from octanove.csv (Octanove vocabulary profile).
 Filters for supported POS (noun, verb).
 Input:  scripts/data-sources/english/octanove.csv
 Output: scripts/data-sources/english/octanove-extracted.json
 Output format (normalized):
 [
  { "word": "example", "pos": "noun", "cefr": "C1", "source": "octanove" }
 ]
 """
 import csv
 import json
 from pathlib import Path
 # Constants matching @glossa/shared
 SUPPORTED_POS = ["noun", "verb"]
 CEFR_LEVELS = ["A1", "A2", "B1", "B2", "C1", "C2"]
 # Paths (relative to project root)
 INPUT_FILE = Path("scripts/data-sources/english/octanove.csv")
 OUTPUT_FILE = Path("scripts/data-sources/english/octanove-extracted.json")
 def extract() -> None:
    print(f"Reading: {INPUT_FILE}")
    records = []
    skipped_pos = 0
    skipped_invalid_cefr = 0
    skipped_empty_word = 0
    total_rows = 0
    with open(INPUT_FILE, "r", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        for row in reader:
            total_rows += 1
            # Filter: must have supported POS
            pos = row.get("pos", "").lower().strip()
            if pos not in SUPPORTED_POS:
                skipped_pos += 1
                continue
            # Filter: must have valid CEFR level
            cefr = row.get("CEFR", "").upper().strip()
            if cefr not in CEFR_LEVELS:
                skipped_invalid_cefr += 1
                continue
            # Normalize word
            word = row.get("headword", "").lower().strip()
            if not word:
                skipped_empty_word += 1
                continue
            record = {"word": word, "pos": pos, "cefr": cefr, "source": "octanove"}
            records.append(record)
    # Write output
    with open(OUTPUT_FILE, "w", encoding="utf-8") as f:
        json.dump(records, f, indent=2, ensure_ascii=False)
    # Stats
    noun_count = sum(1 for r in records if r["pos"] == "noun")
    verb_count = sum(1 for r in records if r["pos"] == "verb")
    cefr_distribution = {}
    for level in CEFR_LEVELS:
        count = sum(1 for r in records if r["cefr"] == level)
        if count > 0:
            cefr_distribution[level] = count
    print(f"\nTotal rows in CSV: {total_rows}")
    print(f"Extracted: {len(records)} records")
    print(f"  - Nouns: {noun_count}")
    print(f"  - Verbs: {verb_count}")
    print("\nCEFR distribution:")
    for level in CEFR_LEVELS:
        if level in cefr_distribution:
            print(f"  - {level}: {cefr_distribution[level]}")
    print("\nSkipped:")
    print(f"  - Unsupported POS: {skipped_pos}")
    print(f"  - Invalid CEFR: {skipped_invalid_cefr}")
    print(f"  - Empty word: {skipped_empty_word}")
    print(f"\nOutput: {OUTPUT_FILE}")
 if __name__ == "__main__":
    extract()
--- a/scripts/extraction-scripts/italian/extract-it_m3.py
+++ b/scripts/extraction-scripts/italian/extract-it_m3.py
--- a/scripts/extraction-scripts/italian/extract-random-json.py
+++ b/scripts/extraction-scripts/italian/extract-random-json.py
--- a/scripts/merge-scripts/merge-english-json.py
+++ b/scripts/merge-scripts/merge-english-json.py
@ -0,0 +1,159 @@
 #!/usr/bin/env python3
 """
 CEFR Data Pipeline - Stage 3: English Merge
 Merges extracted JSON files for English into an authoritative dataset.
 """
 import json
 from collections import defaultdict
 from pathlib import Path
 from typing import Dict, List, Tuple
 # Supported CEFR levels and difficulty mapping
 CEFR_LEVELS = {"A1", "A2", "B1", "B2", "C1", "C2"}
 DIFFICULTY_MAP = {
    "A1": "easy",
    "A2": "easy",
    "B1": "intermediate",
    "B2": "intermediate",
    "C1": "hard",
    "C2": "hard",
 }
 # Source priority order (from lowest to highest priority)
 # Higher index = higher authority when conflicts occur
 PRIORITY_ORDER = ["random", "octanove", "cefrj", "en_m3"]
 def load_extracted_files(data_dir: Path) -> Dict[str, List[dict]]:
    """Load all *-extracted.json files from the English data directory."""
    sources = {}
    for file_path in data_dir.glob("*-extracted.json"):
        source_name = file_path.stem.replace("-extracted", "")
        with open(file_path, "r", encoding="utf-8") as f:
            data = json.load(f)
            if isinstance(data, list):
                sources[source_name] = data
            else:
                print(f"Warning: {file_path} does not contain a list, skipping.")
    return sources
 def normalize_entry(entry: dict) -> Tuple[str, str]:
    """Return (word, pos) key for merging."""
    return entry["word"].lower().strip(), entry["pos"].lower().strip()
 def get_source_priority(source_name: str) -> int:
    """Return priority index for a source (higher = more authoritative)."""
    try:
        return PRIORITY_ORDER.index(source_name)
    except ValueError:
        # If source not in list, assign lowest priority
        return -1
 def merge_entries(sources: Dict[str, List[dict]]) -> List[dict]:
    """Merge entries from multiple sources, resolving conflicts by priority."""
    grouped = defaultdict(list)
    for src_name, entries in sources.items():
        for entry in entries:
            key = normalize_entry(entry)
            grouped[key].append((src_name, entry["cefr"], entry))
    merged = []
    conflicts_resolved = 0
    total_multi_source = 0
    for (word, pos), src_entries in grouped.items():
        if len(src_entries) == 1:
            src_name, cefr, original = src_entries[0]
            final_cefr = cefr
            contributing_sources = [src_name]
        else:
            total_multi_source += 1
            sorted_entries = sorted(
                src_entries, key=lambda x: get_source_priority(x[0]), reverse=True
            )
            highest_src, highest_cefr, _ = sorted_entries[0]
            all_cefrs = {e[1] for e in src_entries}
            if len(all_cefrs) > 1:
                conflicts_resolved += 1
            final_cefr = highest_cefr
            contributing_sources = [e[0] for e in src_entries]
        difficulty = DIFFICULTY_MAP.get(final_cefr, "unknown")
        merged.append(
            {
                "word": word,
                "pos": pos,
                "cefr": final_cefr,
                "difficulty": difficulty,
                "sources": sorted(contributing_sources),
            }
        )
    print(f"Merge statistics:")
    print(f"  Total unique entries: {len(merged)}")
    print(f"  Entries with multiple sources: {total_multi_source}")
    print(f"  Conflicts resolved by priority: {conflicts_resolved}")
    return merged
 def print_summary(merged: List[dict]):
    """Print distribution of CEFR levels and difficulty in final dataset."""
    cefr_counts = defaultdict(int)
    diff_counts = defaultdict(int)
    for entry in merged:
        cefr_counts[entry["cefr"]] += 1
        diff_counts[entry["difficulty"]] += 1
    print("\n📊 Final CEFR distribution:")
    for level in sorted(CEFR_LEVELS):
        count = cefr_counts.get(level, 0)
        if count:
            print(f"  {level}: {count}")
    print("\n📊 Final difficulty distribution:")
    for diff in ["easy", "intermediate", "hard"]:
        count = diff_counts.get(diff, 0)
        print(f"  {diff}: {count}")
 def main():
    script_dir = Path(__file__).parent
    data_dir = script_dir.parent / "data-sources" / "english"
    output_dir = script_dir.parent / "datafiles"
    output_file = output_dir / "english-merged.json"
    if not data_dir.exists():
        print(f"Error: English data directory not found: {data_dir}")
        return
    output_dir.mkdir(parents=True, exist_ok=True)
    print(f"Loading extracted files from {data_dir}...")
    sources = load_extracted_files(data_dir)
    if not sources:
        print("No extracted files found.")
        return
    print(f"Found sources: {', '.join(sources.keys())}")
    print(f"Priority order (lowest to highest): {PRIORITY_ORDER}")
    merged = merge_entries(sources)
    with open(output_file, "w", encoding="utf-8") as f:
        json.dump(merged, f, indent=2, ensure_ascii=False)
    print(f"\n✅ Merged dataset written to: {output_file}")
    print_summary(merged)
 if __name__ == "__main__":
    main()
--- a/scripts/random-datafiles/italian/it-list_with_glossas.csv
+++ b/scripts/random-datafiles/italian/it-list_with_glossas.csv
--- a/scripts/random-datafiles/italian/subtlex-it.csv
+++ b/scripts/random-datafiles/italian/subtlex-it.csv
--- a/scripts/random-datafiles/italian/wordlist_of_italian_words_660000_parole_italiane.txt
+++ b/scripts/random-datafiles/italian/wordlist_of_italian_words_660000_parole_italiane.txt
--- a/scripts/requirements.txt
+++ b/scripts/requirements.txt
@ -1 +1,2 @@
 wn==1.1.0
 openpyxl==3.1.5