Deblank

Important

🎉 This project originates from our ICSE'26 paper and has received the Distinguished Paper Award.

Deblank is a powerful tool designed to optimize LLM efficiency by reducing the token count of source code through removing the optional formatting in the code. It acts as a bidirectional translation layer, compressing code into a compact, unformatted version for LLM processing and restoring it to a human-readable format for developers.

📊 Why Deblank?

Our programming languages are mainly designed for human readability, where formatting is an essential part of the code. However, when it comes to LLMs, formatting becomes a barrier to their token efficiency. By removing the optional formatting in the code, we can significantly reduce the token count, thereby improving the token efficiency. Most importantly, removing formatting does not affect the semantic meaning of the code. Experiments on 10 different models including, DeepSeek-V3, Claude-3.7, and Gemini-1.5, show that removing formatting has negligible impact on Pass@1 performance for Fill-in-the-Middle code completion tasks. See our ICSE'26 research paper for details.

Format removel can achieve the following token reduction for source code in our experiments (measured by GPT-4o's tokenizer):

Language	Reduction
Java	33.7%
C#	26.2%
C++	33.9%
Python	9.4%

🚀 Features

Token-Efficient: Reduces tokens by ~30% for C-family languages and ~9% for Python without sacrificing model performance.
Semantically Safe: : Guarantees your Abstract Syntax Tree (AST) remains completely unchanged. Only non-essential formatting (like whitespace and indentation) is stripped away.
Lossless Round-Tripping: Compresses code for the LLM and reformats the output back to industry-standard styles (PEP 8, Google Style) for humans.
Multi-Language Support: Currently supports Python, Java, C, C++, C#, JavaScript, TypeScript, and Go.
High Performance: Averages just ~76ms per transformation, making it fully optimized for real-time inference pipelines.
Seamless Integration: Get up and running quickly via a clean, straightforward REST API.

🛠 Deployment

The core engine of Deblank is deployed as a service using Docker.

Before you begin, ensure you have Docker installed and running on your system. You can either pull the pre-built Docker image or build it from the source.

Step 1: Obtain the Image

Pull the image:

docker pull zhangcen456/deblank:latest

or

Build from source:

docker build -f dockerfile -t [image_name] .

Step 2: Run the Container

Once the image is ready, start the service by running:

docker run -d \
    -p [port_number]:5089 \
    -e ENABLE_GUESS_LANG=true \
    -e ENABLE_C_FAMILY=true \
    -e ENABLE_JS_TS=true \
    -e ENABLE_GO=true \
    [image_name]

The container is configured using environment variables to define supported features:

ENABLE_GUESS_LANG: Set to true to enable automatically infer the programming language if not specified in the request.
ENABLE_C_FAMILY: Set to true to enable support for Java, C, C++, and C#.
ENABLE_JS_TS: Set to true to enable support for JavaScript and TypeScript.
ENABLE_GO: Set to true to enable support for Go.

📝 Usage

Once the container is started, you can interact with Deblank through HTTP POST requests at localhost:[port_number].

Two endpoints are provided (respectively for unformatting the input and formatting the output):

http://localhost:[port_number]/unformat_code
http://localhost:[port_number]/format_code

Input Format

For each endpoint, the input is a JSON object with the following fields:

input: the text to be transformed by Deblank. It can be a pure code or code blocks mixed with natural language text. For the latter, the start_tag and end_tag are used to locate code blocks.
mode: specifies how the input should be read. Set this to code for pure code and to mixed for text containing code blocks. Defaults to mixed.
language: the language of the code. If not specified, we will try to infer the language from the code
repair_strategy: specifies the recovery behavior when formatting tools fail.
- none: Deblank will not attempt to fix syntax errors.
- on_failure: If the initial formatting attempt fails, Deblank will try to repair the code.
config: settings that define how code blocks are detected in mixed mode
- language_tag: when set to true, Deblank looks for a language name immediately following the start_tag. If found, this detected language will override the upper-level language setting
- start_tag: the start tag of the code block. Defaults to ```
- end_tag: the end tag of the code block. Defaults to ```

Output Format

The API returns a JSON object containing a segments list. Each segment represents a fragment of the input text, categorized by its type:

Text segments ("type": "text") represents natural language fragments which are returned exactly as they appear in the input.
Code segments ("type": "code") represents code snippets and contain:

content: The processed code (formatted or unformatted based on the API used).
language: The programming language used for processing, either provided in the input or inferred from the code.
meta_info: Technical details regarding the transformation process:
- status: Indicates the processing outcome.
  - success: The code is successfully processed.
  - regex: The primary formatting tool fails, and Deblank falls back to a heuristic, regex-based transformation.
  - failed: The transformation could not be performed. The content remains in its original state.
- repair_attempted (optional): A boolean value that indicates whether Deblank attempted to auto-fix syntax errors to satisfy the formatting tool.
- original_error (optional): The raw error message returned by the underlying formatting tool if a failure or fallback occurs.
- tool (optional): The formatting tool used for this segment.
- Note: For some early failures (for example, unsupported language), optional fields may be absent.

Examples

Example Usage 1 (Python): Pure Code Mode

import requests

url = "http://localhost:[port_number]/unformat_code"

payload = {
    "input": "public class HelloWorld {\n    public static void main(String[] args) {\n        System.out.println(\"Hello, World!\");\n    }\n}",
    "mode": "code",
    "language": "java",
    "repair_strategy": "on_failure"
}

response = requests.post(url, json=payload)
print(response.json())

The response will be:

{
  "segments": [
    {
      "type": "code",
      "content": "public class HelloWorld{public static void main(String[]args){System.out.println(\"Hello, World!\");}}",
      "language": "Java",
      "meta_info":{
        "status": "success",
        "repair_attempted": false,
        "original_error": null,
        "tool": "uncrustify"
      }
    }
  ],
  "response_time (ms)": 11.34
}

Example Usage 2 (Python): Mixed Mode

import requests

url = "http://localhost:[port_number]/unformat_code"

payload = {
    "input": "Here is the solution:```java\npublic class HelloWorld {\n    public static void main(String[] args) {\n        System.out.println(\"Hello, World!\");\n    }\n```",
    "mode": "mixed",
    "repair_strategy": "on_failure",
    "config": {
        "language_tag": True, 
        "start_tag":"```", 
        "end_tag":"```",
    }
}

response = requests.post(url, json=payload)
print(response.json())

The response will be:

{
  "segments":[
    {
      "type": "text",
      "content": "Here is the solution:"
    },
    {
      "type": "code",
      "content": "public class HelloWorld{public static void main(String[]args){System.out.println(\"Hello, World!\");}",
      "language": "Java",
      "meta_info":{
        "status": "success",
        "repair_attempted": true,
        "original_error": "do_source_file: Parsing: <source_file> as language JAVA\nindent_text(4321): size is 2\nindent_text(4323): File: <source_file>, open_line is 1, parent is NONE: Unmatched BRACE_OPEN",
        "tool": "uncrustify"
      }
    }
  ],
  "response_time (ms)": 18.91
}

Example Usage 3 (curl)

Request body:

//input.json
{
  "input": "function greet(name) {\n    console.log(\"Hello, \" + name);\n}",
  "mode": "code",
  "language": null,
  "repair_strategy": "on_failure"
}

Send the request using curl:

curl -X POST http://localhost:[port_number]/unformat_code \
  -H "Content-Type: application/json" \
  -d @input.json

The response will be:

{
  "segments": [
    {
      "type": "code",
      "content": "function greet(name){console.log(\"Hello, \"+name)}",
      "language": "JavaScript",
      "meta_info": {
        "status": "success",
        "repair_attempted": false,
        "original_error": null,
        "tool": "babel"
      }
    }
  ],
  "response_time (ms)": 487.64
}

📚 Citation

@article{pan2025hidden,
  title={The hidden cost of readability: How code formatting silently consumes your llm budget},
  author={Pan, Dangfeng and Sun, Zhensu and Zhang, Cenyuan and Lo, David and Du, Xiaoning},
  journal={arXiv preprint arXiv:2508.13666},
  year={2025}
}

🙏 Acknowledgements

Deblank depends on third-party open-source software, including the python:3.11-slim base image, Python packages (Tree-sitter, Tree-sitter-languages, Flask, TensorFlow, Guesslang, Yapf), and invokes Uncrustify, Node.js, Babel, and Go as external tools via CLI. See THIRD_PARTY_LICENSES/ for the full license texts.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
THIRD_PARTY_LICENSES		THIRD_PARTY_LICENSES
assets		assets
config		config
guesslang		guesslang
install_scripts		install_scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
api.py		api.py
dockerfile		dockerfile
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deblank

📊 Why Deblank?

🚀 Features

🛠 Deployment

Step 1: Obtain the Image

Step 2: Run the Container

📝 Usage

Input Format

Output Format

Examples

📚 Citation

🙏 Acknowledgements

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Deblank

📊 Why Deblank?

🚀 Features

🛠 Deployment

Step 1: Obtain the Image

Step 2: Run the Container

📝 Usage

Input Format

Output Format

Examples

📚 Citation

🙏 Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages