🎬 DramaDirector

Geometry-Guided Short Drama Generation with Storyboard Planning and Alignment Rewards ✨

🤗 Model Checkpoints: PillowTa1k/DramaDirector

DramaDirector is a research codebase for controllable short-drama generation. It decomposes drama creation into storyboard planning, text-visual reward learning, retrieval-aware policy optimization, first-frame generation, and downstream video synthesis.

🚨 Current Short-Drama Generation Challenges

❌ Weak long-horizon consistency - Characters, scenes, and story states can drift across shots.
❌ Unstructured generation - Plain text prompts often miss camera language, subject layout, dialogue, and duration.
❌ Visual-narrative mismatch - A shot may look plausible while failing to follow the intended plot beat.

💡 DramaDirector Solution

🎬 DramaDirector treats short-drama generation as a structured directing workflow: it first plans shot-level storyboards, then improves visual faithfulness with text-visual rewards, and finally packages the generated shots for first-frame and video synthesis. 🚀

📢 News

Codebase organized around preprocessing, reward modeling, SFT, GRPO, and video generation.
Released checkpoints are hosted on Hugging Face: PillowTa1k/DramaDirector.
The inference pipeline supports both standard Transformers/Unsloth generation and vLLM LoRA serving.

🏗️ Framework

DramaDirector combines storyboard schema supervision, text-visual alignment reward modeling, multi-objective policy optimization, retrieval, first-frame generation, and final video synthesis.

📂 Repository Structure

DramaDirector/
  preprocess/                 # Raw video processing, ASR, shot analysis, depth/pose assets, embeddings
  reward_model/               # Text-visual reward model, hard negative mining, grid search, training
  train/                      # Storyboard SFT and GRPO training
  generation/                 # Inference, prompt packaging, first-frame generation, video generation
  config.py                   # API keys and local data paths
  pyproject.toml              # Python dependencies managed by uv

⚙️ Installation

We recommend using uv to create a reproducible Python environment.

uv sync

For vLLM inference, install the optional inference dependency group:

uv sync --extra inference

Main dependencies include torch, transformers, trl, peft, unsloth, dashscope, openai, opencv-python, and optional vllm.

🔐 Configuration

Before running preprocessing, retrieval, reward training, or video generation, edit config.py.

DASHSCOPE_API_KEY = ""
DASHSCOPE_API_KEYS = []
RUNNINGHUB_API_KEY = ""

SAVED_DIR = PREPROCESS_ROOT / "saved"
PROCESSED_SPLIT_DIR = PREPROCESS_ROOT / "processed_split"
TEXT_EMB_DIR = PREPROCESS_ROOT / "text_emb"
DEPTH_EMB_DIR = PREPROCESS_ROOT / "depth_emb"
POSE_EMB_DIR = PREPROCESS_ROOT / "pose_emb"
EMB_BASE_DIR = PREPROCESS_ROOT

DASHSCOPE_API_KEY is used for plot analysis, embedding, retrieval checks, and reward-model text encoding. RUNNINGHUB_API_KEY is used by the downstream RunningHub first-frame and video generation utilities.

🎞️ Preprocessing

The preprocessing pipeline converts raw drama episodes into structured data:

raw video
  -> shot splitting and keyframes
  -> plot analysis
  -> ASR and transcript repair
  -> structured shot descriptions
  -> origin/depth/pose assets
  -> text/depth/pose embeddings

Recommended input layout:

videos/
  drama_a/
    1.mp4
    2.mp4
  drama_b/
    1.mp4

Run from the preprocess/ directory:

python extract_frames.py videos
python video_scene_analyzer.py videos
python transcribe.py saved
python fix_transcripts.py saved
python image_shot_analyzer.py saved
python check.py saved
python process_pipeline.py
python embed_texts.py
python embed_images.py

Or run the main chain:

bash prepare_dataset.sh

Depth and pose rendering require local copies of:

git clone https://github.com/DepthAnything/Depth-Anything-V2.git
git clone https://github.com/IDEA-Research/DWPose.git

Prepare the Depth-Anything-V2 checkpoint before running process_pipeline.py. extract_frames.py also expects the local shot_split module used by ShotSplitter. For ASR, transcribe.py is based on the external Fun-ASR project; please follow the upstream project to prepare the ASR runtime and model resources when running this stage.

🔥 Training

🧩 Reward Model

The reward model learns text-visual alignment over storyboard text, depth representations, and pose representations.

uv run python reward_model/train.py

🧠 Supervised Fine-Tuning

uv run python train/sft.py

The SFT stage renders storyboard tasks and trains a LoRA policy over structured storyboard outputs. This step is optional if you use the released LoRA checkpoint from PillowTa1k/DramaDirector. For full SFT reproduction, the script expects data/verl_storyboard_sft/train.parquet, data/verl_storyboard_sft/val.parquet, and the internal rendering / short-task packing helpers used during training.

🚀 GRPO Optimization

uv run python train/rl_grpo_train.py \
  --nproc_per_node 2 \
  --base_model_dir ./Qwen/Qwen3-8B \
  --sft_checkpoint_dir outputs/verl_storyboard_sft/checkpoint-426 \
  --train_data_path data/verl_storyboard_sft/grpo_seq_continue/train.json \
  --val_data_path data/verl_storyboard_sft/grpo_seq_continue/val.json \
  --dashscope_api_key YOUR_DASHSCOPE_KEY \
  --judge_api_key YOUR_JUDGE_KEY \
  --output_dir rl_grpo_output \
  --num_train_epochs 2 \
  --per_device_train_batch_size 2 \
  --num_generations 4 \
  --max_completion_length 4096 \
  --learning_rate 5e-5

The GRPO reward combines:

R = alpha * R_retrieval + beta * R_video_gen + gamma * R_format

R_retrieval: visual retrieval alignment reward.
R_video_gen: LLM-judged video generation readiness.
R_format: JSON schema and field-completeness reward.

For GRPO reproduction, make sure the base model, SFT checkpoint, reward checkpoint, and GRPO JSON splits match your local paths before launch.

🚀 Inference

The repository provides two inference paths:

generation/infer.py: standard Transformers/Unsloth LoRA inference.
generation/infer_vllm.py: vLLM batched LoRA inference.

Download the released LoRA adapter from PillowTa1k/DramaDirector, then pass its local directory to --checkpoint.

Recommended LLM decoding parameters:

temperature = 0.9
top_p = 0.9
repetition_penalty = 1.2

⚡ vLLM LoRA Inference

uv run python generation/infer_vllm.py \
  --checkpoint /path/to/final_rl_lora \
  --base_model ./Qwen/Qwen3-8B \
  --test_file generation/test.json \
  --output_file infer_results.json \
  --temperature 0.9 \
  --top_p 0.9 \
  --repetition_penalty 1.2 \
  --max_new_tokens 8192 \
  --batch_size 8 \
  --tensor_parallel_size 1 \
  --gpu_memory_utilization 0.9 \
  --max_model_len 32768

🧪 Transformers/Unsloth Inference

uv run python generation/infer.py \
  --checkpoint /path/to/final_rl_lora \
  --base_model ./Qwen/Qwen3-8B \
  --test_file generation/test.json \
  --output_file infer_results.json \
  --temperature 0.9 \
  --top_p 0.9 \
  --repetition_penalty 1.2 \
  --max_new_tokens 8192

The inference input file should contain a user field with plot summary, character list, and existing storyboard context. The generated storyboard is saved as model_output.

🎥 Video Generation

After inference, convert generated storyboard JSON into shot-level prompts:

uv run python generation/prepare_video_inputs.py \
  --input_glob "infer_results*.json" \
  --output_dir prepared_video_inputs

This exports prompt-oriented files under prepared_video_inputs/storyboards/, prepared_video_inputs/video_inputs/, and prepared_video_inputs/prompt_text/.

The first-frame generator expects a richer RunningHub input directory with shot controls and character prompts:

prepared_data/
  shot_count_5/
    sample_xxx/
      sample_manifest.json
      characters/character_reference_prompts.json
      shots/shot_xxx/
        metadata.json
        prompt.txt
        controls/01_depth.png
        controls/02_pose.png

Once that directory is prepared, generate first frames:

uv run python generation/runninghub_image_generation.py \
  --data-root prepared_data \
  --output-root runninghub_outputs

Finally run image-to-video generation:

uv run python generation/run_runninghub2_video_generation.py \
  --assets-root runninghub_outputs \
  --prompt-source-template "prepared_video_inputs/video_inputs/infer_results" \
  --output-dir runninghub_video_generation_results

The complete generation flow is:

plot + characters + previous storyboard
  -> storyboard continuation
  -> cleaned shot prompts
  -> first-frame generation
  -> video generation
  -> final drama clips

📦 Checkpoints

Model weights are hosted on Hugging Face instead of stored in this repository:

PillowTa1k/DramaDirector

The release includes the reward model checkpoint and the final Qwen LoRA adapter trained for storyboard generation. Download the assets you need and pass their local paths to the corresponding training or inference scripts.

🧾 Output Schema

DramaDirector generates a JSON array of shot objects. Each shot follows this schema:

[
  {
    "index": 0,
    "shot_scale": "medium shot",
    "camera_angle": "eye-level",
    "camera_motion": "static",
    "subjects": [
      {
        "name": "character name",
        "gender": "gender",
        "clothing": "clothing description",
        "position": "position in frame",
        "action": "action description",
        "expression": "facial expression"
      }
    ],
    "background": "scene and environment",
    "description_narrative": "what happens in this shot",
    "dialogue": null,
    "speaker": null,
    "emotion": null,
    "duration": 2.5
  }
]

When an episode is complete, the model appends <END> after the JSON array.

🌟 If this project helps your research, please consider giving DramaDirector a Star!

Thanks for visiting DramaDirector ✨

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🎬 DramaDirector

Geometry-Guided Short Drama Generation with Storyboard Planning and Alignment Rewards ✨

🚨 Current Short-Drama Generation Challenges

💡 DramaDirector Solution

📢 News

🏗️ Framework

📂 Repository Structure

⚙️ Installation

🔐 Configuration

🎞️ Preprocessing

🔥 Training

🧩 Reward Model

🧠 Supervised Fine-Tuning

🚀 GRPO Optimization

🚀 Inference

⚡ vLLM LoRA Inference

🧪 Transformers/Unsloth Inference

🎥 Video Generation

📦 Checkpoints

🧾 Output Schema

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
dramaboard/storyboard		dramaboard/storyboard
generation		generation
preprocess		preprocess
reward_model		reward_model
train		train
LICENSE		LICENSE
README.md		README.md
config.py		config.py
project_paths.py		project_paths.py
pyproject.toml		pyproject.toml

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

🎬 DramaDirector

Geometry-Guided Short Drama Generation with Storyboard Planning and Alignment Rewards ✨

🚨 Current Short-Drama Generation Challenges

💡 DramaDirector Solution

📢 News

🏗️ Framework

📂 Repository Structure

⚙️ Installation

🔐 Configuration

🎞️ Preprocessing

🔥 Training

🧩 Reward Model

🧠 Supervised Fine-Tuning

🚀 GRPO Optimization

🚀 Inference

⚡ vLLM LoRA Inference

🧪 Transformers/Unsloth Inference

🎥 Video Generation

📦 Checkpoints

🧾 Output Schema

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages