
Operate on large production systems in a manager-less big data environment using PySpark, Hadoop, Git, and Jenkins, with sampling, yaml-csv-python pipelines, and logistic regression for model evaluation.
Search outlook and personal email across inbox, folders, and network drives for tasks assigned to you. Check last 24 hours or 7 days and note jira or wiki references.
Learn to use copilot in Visual Studio Code and PyCharm, compare community and paid editions, and understand edge node deployment with a bridge Ubuntu computer connecting Windows clients to Hadoop.
Big Data Code Changes for the Full‑Stack Simulation Engine
Objectives and Key Tasks
Fix and refactor code for the Full‑Stack Simulation Engine to support different simulation runs.
Understand the end‑to‑end setup and architecture of the full‑stack simulation engine, including:
Source code
Configuration settings
Data inputs
PySpark pipelines
Grid and execution environments
Run and debug PySpark code both locally and in the target (shared or distributed) environment.
Fix and update YAML configuration files and the associated Python files, and clearly understand all inputs and parameters used in the YAML files.
Configure PyCharm for effective local development and testing, especially local testing using small sample datasets.
Modify and extend YAML parameters as required by new logic, experiments, or use cases.
Fix variable definitions, naming issues, and scope inconsistencies — this is the primary responsibility of the role.
Understand regression modeling concepts used in the simulation — this is to understand the larger picture, not to redesign models.
Understand variable definitions, dependencies, and end‑to‑end data flow within the simulation model — this is required to ensure correctness and reproducibility.
Technical Setup
Configure the local development environment in PyCharm.
Run PySpark jobs for simulation workloads.
Validate YAML‑driven configuration pipelines and ensure correct parameter propagation.
Code & Configuration
Refactor full‑stack simulation engine code for clarity, correctness, and reusability.
Debug Python and YAML integration issues.
Correct variable definitions, parameter mappings, and configuration usage.
Modeling & Logic
Understand regression modeling techniques used in the simulations.
Analyze how variables impact model outputs across different runs.
Ensure correctness, consistency, and repeatability across simulation executions.
Configs for Running Notebook on Grids – different venv, auth, types of auth, errors
Setting up git SSH on masters and local system – git clone, git push, and common errors like matching ticket name or just restart and soft reset
Making the code work locally then on master systems – local on 5k, recreating sample as needed
Changing code on Copilot for different filtering on big data notebooks running on masters which have access to HDFS
Writing notes in MD and also using Copilot
Handshaking CSVs for different repos – CSVs act as a bridge – also understanding CSV definition
Understanding errors that can happen in multithreading, hard to locate for PySpark local setup
Starting and running afresh from git clone, auth, and the notebook on master systems
Git clone locally, installing venv locally, and then running unit tests on local data
Manual setting in PyCharm for making local unit tests work
Creating samples that will work with different kinds of tests locally, like 5k local files
First make the current unit tests work – settings like no coverage, others in pytest.ini
Adding new variables for local tests