Skip to main content

Legal LLM Fine-Tuning Framework

Constructing the scaffolding for legal AI

Jurisdiction

Switzerland (Federal)

Primary Rules

Financial Services Act (FinSA, SR 950.1), Banking Act (BA, SR 952.0), Financial Market Supervision Act (FINMASA, SR 956.1), Financial Market Infrastructure Act (FinMIA, SR 958.1), Financial Institutions Act (FinIA, SR 954.1), Collective Investment Schemes Act (CISA, SR 951.31), Anti-Money Laundering Act (AMLA, SR 955.0) and Insurance Supervision Act (ISA, SR 961.01)

Target Audience

Everybody

Project Hash ID

#I044

Project Category Name

Colossus

This project establishes a framework for building legal language models that are trained on structured corpus of law. It is developed initially for Swiss financial regulation, working from the eight core federal statutes and their implementing rules. The same pipeline, corpus assembly, structured data extraction, targeted fine-tuning and retrieval, applies to any legal domain where the source legislation is available in a structured format. The framework is designed to be reused.

Project Phases

Phase A: Prep

Seed corpus

The starting point is a list of the eight core Swiss financial market laws and their implementing regulations: FinSA, BA, FINMASA, FinMIA, FinIA, CISA, AMLA and ISA. Their structured XML have been sourced.

Scope mapping

Swiss financial laws reference each other constantly. Before any training data is generated, a script scans all source files and builds a complete list of every other law they cite. Each cited law is assigned a priority: laws that define terms used in the core corpus must be included in full; laws referenced only for procedural matters can be handled differently. The output is a sourcing checklist that drives all further corpus assembly.

Parsing

Once the full corpus is assembled, each law is read article by article. The article text is extracted together with its key details: which law it belongs to, its article number, when it came into force, and which other articles it points to. Each article is stored as a self-contained record.

Phase B: Training Data

Training data (Definitions)

Swiss laws follow consistent drafting conventions for definitions. These are detected automatically: terms introduced with an italic label followed by a colon, enumerated concept families introduced with a heading phrase and articles that define numeric thresholds. Each detected definition is turned into differently phrased questions with the correct answer drawn directly from the statutory text. This produces several hundred question-answer pairs.

Training data (Cross-references)

Every footnote in a Swiss law that points to another statute is a cross-reference. These are extracted and classified by type: references that define a term used in the source article, references that determine who or what falls within scope and references that point to procedural rules. Three question variants are generated per reference, and where the target law is already in the corpus, its article text is included in the answer. This produces further question-answer pairs and trains the model on the legislative connections between statutes rather than treating each law in isolation.

Training data (Amendments)

Each law carries two kinds of temporal information. The document metadata records when the law was passed, when it came into force and which consolidated version is being read. Individual articles carry footnotes recording when they were modified, inserted or repealed, and by which amending law. Both sources are extracted and turned into questions about version dates and amendment history. This gives the model the ability to answer questions like "which version of this article applied before March 2024", a common need in compliance work.

Training data (Manually curated examples)

The automated extraction tracks produce questions that follow directly from statutory structure. They do not cover the kind of question humans ask: how two statutes interact in a specific fact pattern, what FINMA expects in practice beyond what the law text says or where a common interpretation is wrong. A separate set of question-answer pairs will be written by hand to fill these gaps.

Evaluation set

Another set of questions-answers will be written and held back from training entirely. Its sole purpose will be to measure what the model has actually learned after fine-tuning.

Phase C: Fine-Tuning

Once the training and evaluation datasets are assembled, a base model will be fine-tuned on the full set of training examples. A parameter-efficient method will be used that adjusts only a small fraction of the model's weights rather than retraining it from scratch, keeping compute time and cost low. To prevent the model from losing its general language ability while gaining legal specialisation, a small proportion of general-purpose examples will be mixed into the training data.

Phase D: Retrieval-Augmented Generation

The corpus ingestion step that produce the parsed article records also populates a searchable vector store containing every article, paragraph and letter of every statute in the extended scope. At inference time, a question is matched against this index and the most relevant passages are passed to the model as explicit context alongside the question. The fine-tuned model provides the legal reasoning; the index provides the exact current statutory text. Because the two are separate, updating the index when a law is amended requires no retraining, only re-ingestion of the affected statute.

Phase E: Evaluation

The fine-tuned model will be tested on two axes simultaneously. The held-out evaluation set measures domain accuracy: whether the model cites the correct statute and article, whether the legal interpretation is sound and whether it handles edge cases and false premises correctly. A standard general-knowledge benchmark run in parallel checks that the model has not lost capability outside the legal domain. Citation precision will also be measured automatically by comparing the article references in the model's answers against those in the reference answers, giving a numerical score that tracks improvement across training runs.