{ "cells": [ { "cell_type": "markdown", "id": "b1a2c3d4", "metadata": {}, "source": [ "# Qwen2.5-7B Pretraining Verification\n", "\n", "This notebook verifies the pretraining capability of the **Ascend 910B CANN image** by running Qwen2.5-7B pretraining with MindSpeed-LLM.\n", "\n", "**Workflow:**\n", "1. Check the environment\n", "2. Prepare the pretraining dataset\n", "3. Clone the MindSpeed-LLM scripts\n", "4. Convert HF weights to Megatron weights\n", "5. Preprocess the data\n", "6. Start pretraining\n", "\n", "> Training parameters are set for verification mode (few iterations + short sequences). Increase `TRAIN_ITERS` and `SEQ_LENGTH` for production runs." ] }, { "cell_type": "markdown", "id": "c2d3e4f5", "metadata": {}, "source": [ "## 0. Parameter Configuration" ] }, { "cell_type": "code", "execution_count": null, "id": "d3e4f5a6", "metadata": {}, "outputs": [], "source": "import warnings\nwarnings.filterwarnings('ignore', category=DeprecationWarning)\nwarnings.filterwarnings('ignore', category=ImportWarning)\nwarnings.filterwarnings('ignore', category=UserWarning)\n\nfrom pathlib import Path\n\n# ===== Path configuration =====\nHF_MODEL_DIR = Path('/opt/app-root/src/models/Qwen2.5-7B')\nWORK_DIR = Path('/opt/app-root/src/Qwen2.5-7B-work-dir')\nMINDSPEED_LLM_DIR = WORK_DIR / 'MindSpeed-LLM'\nDATA_DIR = WORK_DIR / 'pretrain_dataset'\nOUTPUT_DIR = WORK_DIR / 'output' / 'qwen25_7b_pretrained'\nLOGS_DIR = WORK_DIR / 'logs'\n\n# ===== Optional: real dataset path =====\nALPACA_PARQUET = Path('/opt/app-root/src/datasets/alpaca/train-00000-of-00001-a09b74b3ef9c3b56.parquet')\n\n# ===== Ascend environment scripts =====\nCANN_ENV = '/usr/local/Ascend/cann/set_env.sh'\nATB_ENV = '/usr/local/Ascend/nnal/atb/set_env.sh'\n\n# ===== Parallel configuration (must match weight conversion) =====\nTP = 1 # Tensor parallelism\nPP = 4 # Pipeline parallelism; requires at least TP*PP=4 NPUs\n# Note: With TP=1, each PP stage has about 1.6-2.2B parameters. The AdamW optimizer states\n# (exp_avg + exp_avg_sq, fp32) take about 17.6 GiB, and weights plus gradients exceed the 29 GiB memory on 910B.\n# Enable --use-distributed-optimizer (ZeRO-1) during training to shard optimizer states by the DP dimension and reduce memory usage.\n\n# ===== Weight conversion output (path includes parallel config to avoid reusing old weights after TP/PP changes) =====\nMCORE_WEIGHTS_DIR = WORK_DIR / 'model_weights' / f'qwen25_mcore_tp{TP}_pp{PP}'\n\n# ===== Training hyperparameters (verification mode) =====\nSEQ_LENGTH = 512 # Recommended production value: 4096\nTRAIN_ITERS = 50 # Recommended production value: 2000+\nMBS = 1\nLR = 1.25e-6\nMIN_LR = 1.25e-7\n\n# ===== Data preprocessing =====\nPROCESSED_DATA_PREFIX = DATA_DIR / 'alpaca'\nDATA_PATH = str(DATA_DIR / 'alpaca_text_document') # preprocess_data.py automatically adds the _text_document suffix\n\nprint('Configuration loaded')\nprint(f' Model: {HF_MODEL_DIR}')\nprint(f' Dataset: {ALPACA_PARQUET}' if ALPACA_PARQUET.exists() else ' Dataset: not found; sample data will be used')\nprint(f' TP={TP}, PP={PP}, SEQ={SEQ_LENGTH}, ITERS={TRAIN_ITERS}')" }, { "cell_type": "markdown", "id": "e4f5a6b7", "metadata": {}, "source": [ "## Helper Functions" ] }, { "cell_type": "code", "execution_count": null, "id": "f5a6b7c8", "metadata": {}, "outputs": [], "source": [ "import os\n", "import subprocess\n", "\n", "_SUPPRESS_WARNINGS = 'ignore::DeprecationWarning,ignore::ImportWarning,ignore::UserWarning'\n", "\n", "def run_cmd(cmd, cwd=None, check=True):\n", " 'Run a bash command in the Ascend environment and stream output in real time'\n", " env_prefix = f'source {CANN_ENV} && source {ATB_ENV}'\n", " full_cmd = f'{env_prefix} && {cmd}'\n", " print(f'$ {cmd}\\n')\n", " run_env = os.environ.copy()\n", " run_env['PYTHONWARNINGS'] = _SUPPRESS_WARNINGS\n", " result = subprocess.run(\n", " ['bash', '-lc', full_cmd],\n", " cwd=str(cwd or WORK_DIR),\n", " text=True,\n", " env=run_env,\n", " )\n", " if check and result.returncode != 0:\n", " raise RuntimeError(f'Command failed with return code: {result.returncode}')\n", " return result\n", "\n", "print('Helper function defined: run_cmd()')" ] }, { "cell_type": "markdown", "id": "a6b7c8d9", "metadata": {}, "source": [ "## 1. Environment Check" ] }, { "cell_type": "code", "execution_count": null, "id": "b7c8d9e0", "metadata": {}, "outputs": [], "source": [ "import warnings\n", "\n", "with warnings.catch_warnings():\n", " warnings.simplefilter('ignore', DeprecationWarning)\n", " warnings.simplefilter('ignore', ImportWarning)\n", " warnings.simplefilter('ignore', UserWarning)\n", " import torch\n", " import torch_npu\n", "\n", "print('=' * 60)\n", "print('Environment check')\n", "print('=' * 60)\n", "\n", "# PyTorch & NPU\n", "print(f'PyTorch: {torch.__version__}')\n", "print(f'torch_npu: {torch_npu.__version__}')\n", "nproc = torch.npu.device_count()\n", "print(f'NPU count: {nproc}')\n", "for i in range(nproc):\n", " print(f' NPU {i}: {torch.npu.get_device_name(i)}')\n", "\n", "# MindSpeed\n", "with warnings.catch_warnings():\n", " warnings.simplefilter('ignore', DeprecationWarning)\n", " warnings.simplefilter('ignore', ImportWarning)\n", " warnings.simplefilter('ignore', UserWarning)\n", " import mindspeed\n", " import mindspeed_llm\n", "\n", "print('MindSpeed: installed')\n", "print('MindSpeed-LLM: installed')\n", "\n", "# Model files\n", "print(f'\\nModel directory: {HF_MODEL_DIR}')\n", "assert HF_MODEL_DIR.exists(), f'Model directory does not exist: {HF_MODEL_DIR}'\n", "model_files = sorted(HF_MODEL_DIR.glob('*'))\n", "for f in model_files[:5]:\n", " if f.is_file():\n", " print(f' {f.name} ({f.stat().st_size / 1e9:.2f} GB)')\n", "if len(model_files) > 5:\n", " print(f' ... {len(model_files)} files in total')\n", "\n", "# Parallel configuration check\n", "assert nproc >= TP * PP, f'NPU count({nproc}) < TP*PP({TP*PP}); reduce PP'\n", "DP = nproc // (TP * PP)\n", "GBS = DP * MBS\n", "print(f'\\nParallel configuration: TP={TP}, PP={PP}, DP={DP}, GBS={GBS}')\n", "\n", "assert torch.npu.is_available(), 'NPU is not available'\n", "print('\\nEnvironment check passed!')" ] }, { "cell_type": "markdown", "id": "c8d9e0f1", "metadata": {}, "source": [ "## 2. Prepare the Pretraining Dataset\n", "\n", "Pretraining uses raw text data. MindSpeed-LLM's `preprocess_data.py` supports `.parquet`, `.json`, `.jsonl`, `.txt`, and other formats.\n", "\n", "This example uses the Alpaca dataset in parquet format, which contains a `text` field. If you use another dataset, make sure it contains a `text` field or uses plain text format." ] }, { "cell_type": "code", "execution_count": null, "id": "d9e0f1a2", "metadata": {}, "outputs": [], "source": [ "import json\n", "import warnings\n", "\n", "DATA_DIR.mkdir(parents=True, exist_ok=True)\n", "\n", "if ALPACA_PARQUET.exists():\n", " print(f'Dataset is ready: {ALPACA_PARQUET.name}')\n", " with warnings.catch_warnings():\n", " warnings.simplefilter('ignore', DeprecationWarning)\n", " import pandas as pd\n", " df = pd.read_parquet(ALPACA_PARQUET)\n", " print(f'{len(df)} samples, columns: {list(df.columns)}')\n", " print('\\nSample examples:')\n", " for i, row in df.head(3).iterrows():\n", " text = str(row.get('text', ''))[:100]\n", " print(f' [{i}] {text}...')\n", " DATA_INPUT = str(ALPACA_PARQUET)\n", "else:\n", " print('Alpaca dataset not found. Creating sample text data\\n')\n", " sample_texts = [\n", " {'text': 'Natural language processing is an important branch of artificial intelligence that studies how computers understand and generate human language.'},\n", " {'text': 'Deep learning uses multilayer neural networks to learn hierarchical data representations and is widely used in computer vision and natural language processing.'},\n", " {'text': 'Python is a high-level programming language known for its concise, readable syntax and rich ecosystem.'},\n", " {'text': 'Machine learning is a core artificial intelligence technology that enables computers to learn and improve automatically from data.'},\n", " {'text': 'Ascend 910B is an artificial intelligence processor from Huawei designed for deep learning training and inference workloads.'},\n", " {'text': 'Pretraining is the first stage of training large language models and learns statistical patterns from massive text corpora.'},\n", " {'text': 'The Transformer architecture is a foundation of modern natural language processing and uses self-attention for parallel sequence modeling.'},\n", " {'text': 'Distributed training spreads large model training workloads across multiple compute devices.'},\n", " {'text': 'Tensor parallelism and pipeline parallelism are two common parallelism strategies for large model training.'},\n", " {'text': 'Gradient accumulation simulates large-batch training when memory is limited.'},\n", " ]\n", " SAMPLE_FILE = DATA_DIR / 'sample_pretrain.jsonl'\n", " with open(SAMPLE_FILE, 'w', encoding='utf-8') as f:\n", " for item in sample_texts:\n", " f.write(json.dumps(item, ensure_ascii=False) + '\\n')\n", " DATA_INPUT = str(SAMPLE_FILE)\n", " print(f'Sample data created: {SAMPLE_FILE}')\n", " print(f'{len(sample_texts)} samples')\n", "\n", "print(f'\\nData input path: {DATA_INPUT}')" ] }, { "cell_type": "markdown", "id": "e0f1a2b3", "metadata": {}, "source": [ "## 3. Clone MindSpeed-LLM\n", "\n", "The `mindspeed_llm` Python package is installed during image build, but the training scripts (`convert_ckpt.py`, `preprocess_data.py`, `pretrain_gpt.py`, and others) must run from the repository directory." ] }, { "cell_type": "code", "execution_count": null, "id": "f1a2b3c4", "metadata": {}, "outputs": [], "source": [ "if MINDSPEED_LLM_DIR.exists():\n", " print(f'Already exists: {MINDSPEED_LLM_DIR}')\n", "else:\n", " print('Cloning MindSpeed-LLM (shallow clone)...')\n", " run_cmd(f'git clone --depth 1 https://gitcode.com/ascend/MindSpeed-LLM.git {MINDSPEED_LLM_DIR}')\n", "\n", "# Verify required scripts\n", "scripts = [\n", " ('Weight conversion', 'convert_ckpt.py'),\n", " ('Data preprocessing', 'preprocess_data.py'),\n", " ('Pretraining', 'pretrain_gpt.py'),\n", "]\n", "for name, script in scripts:\n", " exists = (MINDSPEED_LLM_DIR / script).exists()\n", " print(f' [{name}] {script}: {\"OK\" if exists else \"MISSING\"}')\n", "\n", "assert all((MINDSPEED_LLM_DIR / s).exists() for _, s in scripts), 'Required scripts are missing'\n", "print('\\nScript check passed!')" ] }, { "cell_type": "markdown", "id": "a2b3c4d5", "metadata": {}, "source": [ "## 4. Convert HF Weights to Megatron Weights\n", "\n", "Convert HuggingFace-format weights to Megatron-Mcore format, split by TP/PP. The first conversion usually takes 5-10 minutes.\n", "\n", "Qwen2.5 is based on the LLaMA2 architecture with QKV bias, so the conversion uses `--model-type-hf llama2` plus `--add-qkv-bias`." ] }, { "cell_type": "code", "execution_count": null, "id": "b3c4d5e6", "metadata": {}, "outputs": [], "source": [ "MCORE_WEIGHTS_DIR.mkdir(parents=True, exist_ok=True)\n", "\n", "# Check whether conversion already exists\n", "converted = any(MCORE_WEIGHTS_DIR.glob('iter_*'))\n", "\n", "if converted:\n", " print(f'Weights already exist; skipping conversion: {MCORE_WEIGHTS_DIR}')\n", " for p in sorted(MCORE_WEIGHTS_DIR.iterdir()):\n", " print(f' {p.name}')\n", "else:\n", " convert_cmd = ' && '.join([\n", " f'cd {MINDSPEED_LLM_DIR}',\n", " 'python convert_ckpt.py'\n", " ' --use-mcore-models'\n", " ' --model-type GPT'\n", " ' --load-model-type hf'\n", " ' --save-model-type mg'\n", " f' --target-tensor-parallel-size {TP}'\n", " f' --target-pipeline-parallel-size {PP}'\n", " ' --add-qkv-bias'\n", " f' --load-dir {HF_MODEL_DIR}'\n", " f' --save-dir {MCORE_WEIGHTS_DIR}'\n", " f' --tokenizer-model {HF_MODEL_DIR / \"tokenizer.json\"}'\n", " ' --model-type-hf llama2'\n", " ' --params-dtype bf16',\n", " ])\n", " print('Running weight conversion (about 5-10 minutes)...')\n", " run_cmd(convert_cmd, cwd=MINDSPEED_LLM_DIR)\n", " print('Weight conversion completed!')\n", " for p in sorted(MCORE_WEIGHTS_DIR.iterdir()):\n", " print(f' {p.name}')" ] }, { "cell_type": "markdown", "id": "c4d5e6f7", "metadata": {}, "source": [ "## 5. Data Preprocessing\n", "\n", "Convert text data into the binary format (`.bin` + `.idx`) required by MindSpeed-LLM pretraining.\n", "\n", "No handler needs to be specified for pretraining data processing. `preprocess_data.py` automatically extracts the `text` field and generates `alpaca_text_document.bin` and `alpaca_text_document.idx`." ] }, { "cell_type": "code", "execution_count": null, "id": "d5e6f7a8", "metadata": {}, "outputs": [], "source": [ "preprocess_cmd = ' && '.join([\n", " f'cd {MINDSPEED_LLM_DIR}',\n", " 'python preprocess_data.py'\n", " f' --input {DATA_INPUT}'\n", " f' --tokenizer-name-or-path {HF_MODEL_DIR}'\n", " f' --output-prefix {PROCESSED_DATA_PREFIX}'\n", " ' --tokenizer-type PretrainedFromHF'\n", " ' --workers 4'\n", " ' --log-interval 1000',\n", "])\n", "\n", "print('Running data preprocessing...')\n", "run_cmd(preprocess_cmd, cwd=MINDSPEED_LLM_DIR)\n", "\n", "# Verify outputs\n", "print('\\nPreprocessing outputs:')\n", "for f in sorted(DATA_DIR.glob('alpaca*')):\n", " print(f' {f.name} ({f.stat().st_size / 1024:.1f} KB)')\n", "\n", "assert (DATA_DIR / 'alpaca_text_document.bin').exists() or (DATA_DIR / 'alpaca_text_document.idx').exists(), \\\n", " f'Preprocessing outputs not found: {DATA_DIR / \"alpaca_text_document.*\"}'\n", "print('Data preprocessing completed!')" ] }, { "cell_type": "markdown", "id": "e6f7a8b9", "metadata": {}, "source": [ "## 6. Start Pretraining\n", "\n", "Run Qwen2.5-7B pretraining with MindSpeed-LLM. Training logs are streamed to the notebook in real time.\n", "\n", "> In verification mode, `TRAIN_ITERS=50`. For full pretraining, 2000+ iterations are recommended.\n", "\n", "**Qwen2.5-7B architecture parameters:**\n", "- 28 Transformer layers, hidden size 3584, FFN size 18944\n", "- 28 attention heads, 4 KV heads (GQA)\n", "- RoPE positional encoding, SwiGLU activation, RMSNorm, QKV bias" ] }, { "cell_type": "code", "execution_count": null, "id": "f7a8b9c0", "metadata": {}, "outputs": [], "source": "import torch\n\nnproc = torch.npu.device_count()\nDP = nproc // (TP * PP)\nGBS = DP * MBS\n\nLOGS_DIR.mkdir(parents=True, exist_ok=True)\nOUTPUT_DIR.mkdir(parents=True, exist_ok=True)\n\n# Environment variables\nenv = ' && '.join([\n f'cd {MINDSPEED_LLM_DIR}',\n 'export CUDA_DEVICE_MAX_CONNECTIONS=1',\n 'export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True',\n])\n\n# torchrun distributed arguments\ndistributed = ' '.join([\n 'torchrun',\n f'--nproc_per_node {nproc}',\n '--nnodes 1 --node_rank 0',\n '--master_addr localhost --master_port 6000',\n])\n\n# Model architecture (Qwen2.5-7B)\nmodel_args = ' '.join([\n '--use-mcore-models',\n f'--tensor-model-parallel-size {TP}',\n f'--pipeline-model-parallel-size {PP}',\n '--sequence-parallel --use-flash-attn',\n '--transformer-impl local',\n '--use-distributed-optimizer',\n '--num-layers 28 --hidden-size 3584 --num-attention-heads 28',\n '--ffn-hidden-size 18944 --max-position-embeddings 131072',\n f'--seq-length {SEQ_LENGTH}',\n '--make-vocab-size-divisible-by 1 --padded-vocab-size 152064',\n '--position-embedding-type rope --rotary-base 1000000 --use-rotary-position-embeddings',\n '--group-query-attention --num-query-groups 4',\n '--add-qkv-bias --disable-bias-linear',\n '--untie-embeddings-and-output-weights',\n '--swiglu --normalization RMSNorm --norm-epsilon 1e-6',\n])\n\n# Training hyperparameters\ntrain_args = ' '.join([\n f'--micro-batch-size {MBS} --global-batch-size {GBS}',\n f'--train-iters {TRAIN_ITERS}',\n '--lr-decay-style cosine --lr-warmup-fraction 0.01',\n '--init-method-std 0.01',\n f'--lr {LR} --min-lr {MIN_LR}',\n '--weight-decay 1e-1 --clip-grad 1.0',\n '--adam-beta1 0.9 --adam-beta2 0.95 --initial-loss-scale 4096',\n '--no-gradient-accumulation-fusion --attention-softmax-in-fp32',\n '--no-masked-softmax-fusion --no-load-optim --no-load-rng',\n '--seed 42 --bf16',\n])\n\n# Activation recomputation: recompute forward activations during backward pass to trade compute for memory.\n# With PP=4, each stage has 7 layers, so recomputing all layers maximizes memory savings.\nrecompute_args = ' '.join([\n '--recompute-granularity full',\n '--recompute-method block',\n '--recompute-num-layers 7',\n])\n\n# Data and outputs\ndata_args = ' '.join([\n f'--data-path {DATA_PATH}',\n '--split 100,0,0',\n '--tokenizer-type PretrainedFromHF',\n f'--tokenizer-name-or-path {HF_MODEL_DIR}',\n '--log-interval 1',\n f'--save-interval {TRAIN_ITERS}',\n f'--eval-interval {TRAIN_ITERS} --eval-iters 0',\n])\n\n# Load and save\noutput_args = ' '.join([\n f'--load {MCORE_WEIGHTS_DIR} --save {OUTPUT_DIR}',\n '--distributed-backend nccl',\n '--exit-on-missing-checkpoint',\n '--no-save-optim --no-save-rng',\n])\n\ncmd = f'{env} && {distributed} pretrain_gpt.py {model_args} {train_args} {recompute_args} {data_args} {output_args}'\n\nprint(f'Training configuration: {nproc} NPU, TP={TP}, PP={PP}, DP={DP}')\nprint(f'GBS={GBS}, MBS={MBS}, SEQ={SEQ_LENGTH}, ITERS={TRAIN_ITERS}')\nprint(f'Activation recomputation: full (7 layers per PP stage)')\nprint(f'\\nStarting pretraining...\\n')\nrun_cmd(cmd, cwd=MINDSPEED_LLM_DIR)\nprint(f'\\nPretraining completed! Weights saved to: {OUTPUT_DIR}')" }, { "cell_type": "markdown", "id": "a8b9c0d1", "metadata": {}, "source": [ "## Use a Real Dataset\n", "\n", "After verification succeeds, use a real dataset for full pretraining as follows:\n", "\n", "1. **Prepare the data**: place the text dataset inside the container\n", " - Supported formats: `.parquet`, `.json`, `.jsonl`, `.txt`\n", " - The data must contain a `text` field (parquet/json/jsonl) or one text segment per line (txt)\n", "\n", "2. **Adjust parameters**:\n", " - `SEQ_LENGTH = 4096` to match the model context length\n", " - `TRAIN_ITERS = 2000+` depending on dataset size\n", " - `GBS` based on the NPU count and dataset size; it can be set larger than `DP * MBS` to enable gradient accumulation\n", "\n", "3. **Save interval**: modify `--save-interval` in the training cell for periodic checkpoints\n", "\n", "4. **Weight conversion**: if TP/PP changes, rerun weight conversion" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.8" } }, "nbformat": 4, "nbformat_minor": 5 }