I got frustrated with SMILES, so I built one

Name: I got frustrated with SMILES, so I built one
Availability: InStock
Author: sangeet01

by sangeet01·Mar 14, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●●SolidZero to OneBig BrainBold Bet

Paninian grammar model for molecular notation fixes SMILES's 35-year-old canonicalization and materials limitations.

Strengths

•Topological ring back-counting replaces fragile global labels with path-invariant notation
•Generative state machine validation catches invalid structures during parsing
•Extends beyond SMILES to handle alloys, polymers, organometallics, and quantum states

Weaknesses

•Zero stars and unproven adoption—needs chemists to actually switch from SMILES
•RDKit interop exists but ecosystem integration remains to be demonstrated

Post Description

Hi HN, I'm an undergrad in Nepal.

For the last 35 years, computational chemistry and AI drug discovery have relied on SMILES to represent molecules. It was great for the 1980s, but today it is a massive bottleneck. It’s non-canonical, its stereochemistry parsing is fragile, and it completely breaks down when trying to represent organometallics, alloys, or polymers. To parse it reliably, you basically need a 300MB C++ dependency (RDKit) relying on decades of hard-coded heuristics.

I got frustrated and realized that representing matter isn't a graph theory problem—it’s a linguistics problem.

To fix it, I built SCRIPT (Structural Chemical Representation in Plain Text). I based the core parser on the generative linguistics of Pāṇini’s Sanskrit grammar. Instead of treating a molecule as a string of dumb nodes, SCRIPT treats it as a language of Roots, States (Vibhakti), and Relationships (Sandhi).

I just released V3 today for Pi Day.

How it works & what it fixes: • Aromaticity without the mess: SMILES uses lowercase letters (c1ccccc1), which causes endless parsing ambiguity. SCRIPT uses an Anubandha (governance marker) on the ring closure. C1CCCCC&6: explicitly tells the parser that the last 6 atoms in the DFS path are resonant.

• Vāk Order Stereochemistry: In SCRIPT, chirality is intrinsically resolved using the Depth-First Search sequence order as the native coordinate frame, making it mathematically order-invariant.

• Organometallics & Materials: Because of the grammar design, SCRIPT natively supports Haptic bonds (*5), fractional alloys (Ti<~0.9>N<~0.1>), crystal phases ([[Rutile]] Ti(O)2), and stochastic polymers ({[CC]}n).

• RDKit-Independent: The core engine uses a pure Python Lark grammar. It catches 6-valent carbons during parsing, generates a 100% native round-trip, and hits 95.9% RDKit InChI parity without relying on RDKit's C++ backend.

Examples: Aspirin (SMILES): CC(=O)Oc1ccccc1C(=O)O (or many other valid strings) Aspirin (SCRIPT): CC(=O)OC1=CC=CC=C1C(=O)O (Deterministic canonicalization) Cisplatin: Pt<sqp>(Cl)2(NH3)2@ (Preserves square-planar geometry and cis-configuration)

I'm just a daft undergrad splashing through code like a toddler (my wet-lab titrations are a mess, and yes, I've used my mouth to pipette). I would absolutely love your harshest technical feedback, especially from the parser nerds, chemoinformaticians, or anyone working in AI drug discovery. Happy to answer any questions about the grammar or the parser architecture!