Seed corpus
The starting point is a list of the eight core Swiss financial market laws and their implementing regulations: FinSA, BA, FINMASA, FinMIA, FinIA, CISA, AMLA and ISA. Their structured XML have been sourced.
Scope mapping
Swiss financial laws reference each other constantly. Before any training data is generated, a script scans all source files and builds a complete list of every other law they cite. Each cited law is assigned a priority: laws that define terms used in the core corpus must be included in full; laws referenced only for procedural matters can be handled differently. The output is a sourcing checklist that drives all further corpus assembly.
Parsing
Once the full corpus is assembled, each law is read article by article. The article text is extracted together with its key details: which law it belongs to, its article number, when it came into force, and which other articles it points to. Each article is stored as a self-contained record.
Training data (Definitions)
Swiss laws follow consistent drafting conventions for definitions. These are detected automatically: terms introduced with an italic label followed by a colon, enumerated concept families introduced with a heading phrase and articles that define numeric thresholds. Each detected definition is turned into differently phrased questions with the correct answer drawn directly from the statutory text. This produces several hundred question-answer pairs.

Training data (Cross-references)
Every footnote in a Swiss law that points to another statute is a cross-reference. These are extracted and classified by type: references that define a term used in the source article, references that determine who or what falls within scope and references that point to procedural rules. Three question variants are generated per reference, and where the target law is already in the corpus, its article text is included in the answer. This produces further question-answer pairs and trains the model on the legislative connections between statutes rather than treating each law in isolation.
Training data (Amendments)
Each law carries two kinds of temporal information. The document metadata records when the law was passed, when it came into force and which consolidated version is being read. Individual articles carry footnotes recording when they were modified, inserted or repealed, and by which amending law. Both sources are extracted and turned into questions about version dates and amendment history. This gives the model the ability to answer questions like "which version of this article applied before March 2024", a common need in compliance work.
Training data (Manually curated examples)
The automated extraction tracks produce questions that follow directly from statutory structure. They do not cover the kind of question humans ask: how two statutes interact in a specific fact pattern, what FINMA expects in practice beyond what the law text says or where a common interpretation is wrong. A separate set of question-answer pairs will be written by hand to fill these gaps.

Evaluation set
Another set of questions-answers will be written and held back from training entirely. Its sole purpose will be to measure what the model has actually learned after fine-tuning.
Once the training and evaluation datasets are assembled, a base model will be fine-tuned on the full set of training examples. A parameter-efficient method will be used that adjusts only a small fraction of the model's weights rather than retraining it from scratch, keeping compute time and cost low. To prevent the model from losing its general language ability while gaining legal specialisation, a small proportion of general-purpose examples will be mixed into the training data.
