Laravel.io
VM@VM$ deepspeed --num_gpus 1 run_zero_inference_gpu_first.py --kv-offload --disk-offload --offload-dir /home/sadmin/nvmePath
[2025-10-06 12:44:08,177] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-10-06 12:44:13,678] [INFO] [logging.py:107:log_dist] [Rank -1] [TorchCheckpointEngine] Initialized with serialization = False
[2025-10-06 12:44:16,814] [WARNING] [runner.py:220:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2025-10-06 12:44:16,815] [INFO] [runner.py:610:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None run_zero_inference_gpu_first.py --kv-offload --disk-offload --offload-dir /home/sadmin/nvmePath
[2025-10-06 12:44:18,513] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-10-06 12:44:20,881] [INFO] [logging.py:107:log_dist] [Rank -1] [TorchCheckpointEngine] Initialized with serialization = False
[2025-10-06 12:44:21,531] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0]}
[2025-10-06 12:44:21,531] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=1, node_rank=0
[2025-10-06 12:44:21,531] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2025-10-06 12:44:21,531] [INFO] [launch.py:164:main] dist_world_size=1
[2025-10-06 12:44:21,531] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0
[2025-10-06 12:44:21,539] [INFO] [launch.py:256:main] process 2508302 spawned with command: ['/usr/bin/python3', '-u', 'run_zero_inference_gpu_first.py', '--local_rank=0', '--kv-offload', '--disk-offload', '--offload-dir', '/home/sadmin/nvmePath']
[2025-10-06 12:44:23,691] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-10-06 12:44:26,937] [INFO] [logging.py:107:log_dist] [Rank -1] [TorchCheckpointEngine] Initialized with serialization = False
CUDA_VISIBLE_DEVICES= 0
Loading tokenizer...
Loading model onto CPU (device_map={'': 'cpu'}) ...
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 50.08it/s]
Model loaded on CPU.
Initializing a Gloo process group (CPU-capable) for safe broadcast...
Gloo process group initialized.
[2025-10-06 12:44:30,105] [INFO] [logging.py:107:log_dist] [Rank -1] DeepSpeed info: version=0.17.4, git-hash=unknown, git-branch=unknown
[2025-10-06 12:44:30,105] [INFO] [comm.py:821:init_distributed] cdb=None
[2025-10-06 12:44:30,108] [INFO] [config.py:684:__init__] Config mesh_device None world_size = 1
[2025-10-06 12:44:30,146] [INFO] [engine.py:1339:_configure_distributed_model] ********** distributed groups summary **********
         self.dp_world_size=1
         self.mp_world_size=1
         self.seq_dp_world_size=1
         self.sequence_parallel_size=1
***********************************************
[2025-10-06 12:44:30,435] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2025-10-06 12:44:30,440] [INFO] [logging.py:107:log_dist] [Rank 0] Creating ZeRO Offload
[2025-10-06 12:44:30,602] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2025-10-06 12:44:30,602] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB
[2025-10-06 12:44:30,602] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 6.23 GB, percent = 2.9%
[2025-10-06 12:44:30,609] [INFO] [config.py:684:__init__] Config mesh_device None world_size = 1
Parameter Offload - Persistent parameters statistics: param_count = 161, numel = 1318912
[2025-10-06 12:50:36,130] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2025-10-06 12:50:36,138] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 1.96 GB         CA 1.96 GB         Max_CA 2 GB
[2025-10-06 12:50:36,140] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 158.23 GB, percent = 73.2%
[2025-10-06 12:50:36,143] [INFO] [config.py:954:print] DeepSpeedEngine configuration:
[2025-10-06 12:50:36,143] [INFO] [config.py:958:print]   activation_checkpointing_config  {
    "partition_activations": false,
    "contiguous_memory_optimization": false,
    "cpu_checkpointing": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false
}
[2025-10-06 12:50:36,143] [INFO] [config.py:958:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'intra_op_parallelism': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False}
[2025-10-06 12:50:36,143] [INFO] [config.py:958:print]   amp_enabled .................. False
[2025-10-06 12:50:36,143] [INFO] [config.py:958:print]   amp_params ................... False
[2025-10-06 12:50:36,144] [INFO] [config.py:958:print]   autotuning_config ............ {
    "enabled": false,
    "start_step": null,
    "end_step": null,
    "metric_path": null,
    "arg_mappings": null,
    "metric": "throughput",
    "model_info": null,
    "results_dir": "autotuning_results",
    "exps_dir": "autotuning_exps",
    "overwrite": true,
    "fast": true,
    "start_profile_step": 3,
    "end_profile_step": 5,
    "tuner_type": "gridsearch",
    "tuner_early_stopping": 5,
    "tuner_num_trials": 50,
    "model_info_path": null,
    "mp_size": 1,
    "max_train_batch_size": null,
    "min_train_batch_size": 1,
    "max_train_micro_batch_size_per_gpu": 1.024000e+03,
    "min_train_micro_batch_size_per_gpu": 1,
    "num_tuning_micro_batch_sizes": 3
}
[2025-10-06 12:50:36,145] [INFO] [config.py:958:print]   bfloat16_config .............. enabled=True immediate_grad_update=False check_grad_overflow=False
[2025-10-06 12:50:36,145] [INFO] [config.py:958:print]   checkpoint_config ............ {'tag_validation': 'WARN', 'checkpoint_serialization': True, 'writer': None}
[2025-10-06 12:50:36,145] [INFO] [config.py:958:print]   checkpoint_parallel_write_pipeline  False
[2025-10-06 12:50:36,145] [INFO] [config.py:958:print]   checkpoint_tag_validation_enabled  True
[2025-10-06 12:50:36,145] [INFO] [config.py:958:print]   checkpoint_tag_validation_fail  False
[2025-10-06 12:50:36,145] [INFO] [config.py:958:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x769df6293ce0>
[2025-10-06 12:50:36,145] [INFO] [config.py:958:print]   communication_data_type ...... None
[2025-10-06 12:50:36,145] [INFO] [config.py:958:print]   compile_config ............... deepcompile=False free_activation=False offload_activation=False offload_opt_states=False double_buffer=True symmetric_memory=False debug_log=False offload_parameters=False sync_before_reduce=False sync_after_reduce=False sync_before_allgather=False sync_after_allgather=False keep_int_input_tensors=True keep_all_input_tensors=False
[2025-10-06 12:50:36,145] [INFO] [config.py:958:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2025-10-06 12:50:36,145] [INFO] [config.py:958:print]   curriculum_enabled_legacy .... False
[2025-10-06 12:50:36,145] [INFO] [config.py:958:print]   curriculum_params_legacy ..... False
[2025-10-06 12:50:36,145] [INFO] [config.py:958:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'pin_memory': False, 'curriculum_learning': {'enabled': False}, 'dynamic_batching': {'enabled': False, 'lr_scaling_method': 'linear', 'min_batch_size': 1, 'max_batch_size': None, 'sequence_picking_order': 'dataloader', 'verbose': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2025-10-06 12:50:36,145] [INFO] [config.py:958:print]   data_efficiency_enabled ...... False
[2025-10-06 12:50:36,145] [INFO] [config.py:958:print]   dataloader_drop_last ......... False
[2025-10-06 12:50:36,145] [INFO] [config.py:958:print]   disable_allgather ............ False
[2025-10-06 12:50:36,145] [INFO] [config.py:958:print]   dump_state ................... False
[2025-10-06 12:50:36,145] [INFO] [config.py:958:print]   eigenvalue_enabled ........... False
[2025-10-06 12:50:36,145] [INFO] [config.py:958:print]   eigenvalue_gas_boundary_resolution  1
[2025-10-06 12:50:36,145] [INFO] [config.py:958:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2025-10-06 12:50:36,145] [INFO] [config.py:958:print]   eigenvalue_layer_num ......... 0
[2025-10-06 12:50:36,146] [INFO] [config.py:958:print]   eigenvalue_max_iter .......... 100
[2025-10-06 12:50:36,146] [INFO] [config.py:958:print]   eigenvalue_stability ......... 1e-06
[2025-10-06 12:50:36,146] [INFO] [config.py:958:print]   eigenvalue_tol ............... 0.01
[2025-10-06 12:50:36,146] [INFO] [config.py:958:print]   eigenvalue_verbose ........... False
[2025-10-06 12:50:36,146] [INFO] [config.py:958:print]   elasticity_enabled ........... False
[2025-10-06 12:50:36,146] [INFO] [config.py:958:print]   float16_config ............... enabled=False auto_cast=False loss_scale=0.0 initial_scale_power=16 loss_scale_window=1000 hysteresis=2 consecutive_hysteresis=False min_loss_scale=1 fp16_master_weights_and_grads=False
[2025-10-06 12:50:36,146] [INFO] [config.py:958:print]   flops_profiler_config ........ {
    "enabled": false,
    "recompute_fwd_factor": 0.0,
    "profile_step": 1,
    "module_depth": -1,
    "top_modules": 1,
    "detailed": true,
    "output_file": null
}
[2025-10-06 12:50:36,146] [INFO] [config.py:958:print]   global_rank .................. 0
[2025-10-06 12:50:36,146] [INFO] [config.py:958:print]   grad_accum_dtype ............. None
[2025-10-06 12:50:36,146] [INFO] [config.py:958:print]   gradient_accumulation_steps .. 1
[2025-10-06 12:50:36,146] [INFO] [config.py:958:print]   gradient_clipping ............ 0.0
[2025-10-06 12:50:36,146] [INFO] [config.py:958:print]   gradient_predivide_factor .... 1.0
[2025-10-06 12:50:36,146] [INFO] [config.py:958:print]   graph_harvesting ............. False
[2025-10-06 12:50:36,146] [INFO] [config.py:958:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2025-10-06 12:50:36,146] [INFO] [config.py:958:print]   load_universal_checkpoint .... False
[2025-10-06 12:50:36,146] [INFO] [config.py:958:print]   memory_breakdown ............. False
[2025-10-06 12:50:36,146] [INFO] [config.py:958:print]   mics_hierarchial_params_gather  False
[2025-10-06 12:50:36,146] [INFO] [config.py:958:print]   mics_shard_size .............. -1
[2025-10-06 12:50:36,146] [INFO] [config.py:958:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName')
[2025-10-06 12:50:36,146] [INFO] [config.py:958:print]   nebula_config ................ {
    "enabled": false,
    "persistent_storage_path": null,
    "persistent_time_interval": 100,
    "num_of_version_in_retention": 2,
    "enable_nebula_load": true,
    "load_path": null
}
[2025-10-06 12:50:36,146] [INFO] [config.py:958:print]   optimizer_legacy_fusion ...... False
[2025-10-06 12:50:36,146] [INFO] [config.py:958:print]   optimizer_name ............... None
[2025-10-06 12:50:36,146] [INFO] [config.py:958:print]   optimizer_params ............. None
[2025-10-06 12:50:36,146] [INFO] [config.py:958:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2025-10-06 12:50:36,146] [INFO] [config.py:958:print]   pld_enabled .................. False
[2025-10-06 12:50:36,146] [INFO] [config.py:958:print]   pld_params ................... False
[2025-10-06 12:50:36,146] [INFO] [config.py:958:print]   prescale_gradients ........... False
[2025-10-06 12:50:36,146] [INFO] [config.py:958:print]   scheduler_name ............... None
[2025-10-06 12:50:36,147] [INFO] [config.py:958:print]   scheduler_params ............. None
[2025-10-06 12:50:36,147] [INFO] [config.py:958:print]   seq_parallel_communication_data_type  torch.float32
[2025-10-06 12:50:36,147] [INFO] [config.py:958:print]   sparse_attention ............. None
[2025-10-06 12:50:36,147] [INFO] [config.py:958:print]   sparse_gradients_enabled ..... False
[2025-10-06 12:50:36,147] [INFO] [config.py:958:print]   steps_per_print .............. None
[2025-10-06 12:50:36,147] [INFO] [config.py:958:print]   tensor_parallel_config ....... dtype=torch.float16 autotp_size=0 tp_overlap_comm=False tensor_parallel=TPConfig(tp_size=1, tp_grain_size=1, mpu=None, tp_group=None) injection_policy_tuple=None keep_module_on_host=False replace_with_kernel_inject=False
[2025-10-06 12:50:36,147] [INFO] [config.py:958:print]   timers_config ................ enabled=True synchronized=True
[2025-10-06 12:50:36,147] [INFO] [config.py:958:print]   torch_autocast_dtype ......... None
[2025-10-06 12:50:36,147] [INFO] [config.py:958:print]   torch_autocast_enabled ....... False
[2025-10-06 12:50:36,147] [INFO] [config.py:958:print]   torch_autocast_lower_precision_safe_modules  None
[2025-10-06 12:50:36,147] [INFO] [config.py:958:print]   train_batch_size ............. 1
[2025-10-06 12:50:36,147] [INFO] [config.py:958:print]   train_micro_batch_size_per_gpu  1
[2025-10-06 12:50:36,147] [INFO] [config.py:958:print]   use_data_before_expert_parallel_  False
[2025-10-06 12:50:36,147] [INFO] [config.py:958:print]   use_node_local_storage ....... False
[2025-10-06 12:50:36,147] [INFO] [config.py:958:print]   wall_clock_breakdown ......... True
[2025-10-06 12:50:36,147] [INFO] [config.py:958:print]   weight_quantization_config ... None
[2025-10-06 12:50:36,147] [INFO] [config.py:958:print]   world_size ................... 1
[2025-10-06 12:50:36,147] [INFO] [config.py:958:print]   zero_allow_untested_optimizer  False
[2025-10-06 12:50:36,147] [INFO] [config.py:958:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100000000, max_in_cpu=1000000000, pin_memory=True) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=1000000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True log_trace_cache_warnings=False
[2025-10-06 12:50:36,147] [INFO] [config.py:958:print]   zero_enabled ................. True
[2025-10-06 12:50:36,147] [INFO] [config.py:958:print]   zero_force_ds_cpu_optimizer .. True
[2025-10-06 12:50:36,147] [INFO] [config.py:958:print]   zero_optimization_stage ...... 3
[2025-10-06 12:50:36,147] [INFO] [config.py:944:print_user_config]   json = {
    "train_batch_size": 1,
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "cpu",
            "pin_memory": true,
            "buffer_count": 5
        },
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "reduce_bucket_size": 5.000000e+08,
        "stage3_prefetch_bucket_size": 1.000000e+09,
        "stage3_param_persistence_threshold": 1.000000e+05,
        "stage3_max_live_parameters": 1.000000e+09,
        "stage3_max_reuse_distance": 1.000000e+09
    },
    "bf16": {
        "enabled": true
    },
    "fp16": {
        "enabled": false
    },
    "aio": {
        "block_size": 1.048576e+06,
        "queue_depth": 8,
        "single_submit": false,
        "overlap_events": true
    },
    "wall_clock_breakdown": true
}
Engine initialized. device: cuda:0
Before warmup: GPU allocated (GB): 0.000
Before warmup: GPU reserved  (GB): 2.103
/home/sadmin/.local/lib/python3.12/site-packages/transformers/generation/configuration_utils.py:631: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
/home/sadmin/.local/lib/python3.12/site-packages/transformers/generation/configuration_utils.py:636: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(
After warmup: GPU allocated (GB): 0.011
After warmup: GPU reserved  (GB): 7.273
Generating...
--- Decoded ---
system
Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024
You are a helpful assistant.user
Tell mahabharatha in 100 wordsassistant

The Mahabharata is an ancient Indian epic. The story revolves around the Pandavas (five brothers) and their struggle for the throne of Hastinapura against their cousins, the Kauravas. The Pandavas, aided by Lord Krishna, ultimately win the war after 18 days of battle. The epic explores themes of duty, morality, and the nature of reality. Key characters include Arjuna, Bhima, Yudhishthira, and Draupadi. The Mahabharata is a rich and complex tale of heroism, sacrifice, and the triumph of good over evil, with the Bhagavad Gita being a central part of the narrative.

--- Timing ---
Generation time: 831.68 sec for max_new_tokens=256
After generate: GPU allocated (GB): 0.011
After generate: GPU reserved  (GB): 7.323

--- Residency check (approx) ---
Param tensors on CUDA: 723
Param tensors on CPU : 0
Unknown/other        : 0
[2025-10-06 13:05:44,867] [INFO] [launch.py:351:main] Process 2508302 exits successfully.

Please note that all pasted data is publicly available.