1 year ago
Tues May 14, 2024 11:45am PST
Show HN: EmuBert – the first open encoder model for Australian law
Hey HN, I'm excited to share one of my most ambitious projects yet, EmuBert.

EmuBert is the largest and most accurate open-source masked language model for Australian law.

Trained on 180,000 laws, regulations and decisions across six Australian jurisdictions, totalling 1.4 billion tokens, taken from the Open Australian Legal Corpus, the largest open-source database of Australian law, EmuBert is well suited for tasks like: ⦁ Text classification; ⦁ Name extraction; ⦁ Question answering; ⦁ Text similarity; ⦁ Semantic search; and ⦁ Text embedding.

Not only that but, despite only being trained to guess missing words, EmuBert seems to know facts such as that Norfolk Island is an Australian territory (try the prompt, 'Norfolk Island is an Australian <mask>.'), it is Section 51 of the Constitution that grants Parliament the power to make laws for the peace, order, and good government of the Commonwealth ('Section <mask> of the Constitution grants the Australian Parliament the power to make laws for the peace, order, and good government of the Commonwealth.'), and that the representative of the monarch of Australia is the Governor-General ('The representative of the monarch of Australia is the <mask>-General.').

Finally, EmuBert achieves a perplexity of 2.05 on the Open Australian Legal QA, the first open dataset of Australian legal questions and answers, outperforming all known state-of-the-art masked language models, including Roberta, Bert and Legal-Bert.

You can check out EmuBert on Hugging Face here: https://huggingface.co/umarbutler/emubert

The code I used to create EmuBert is also openly available on GitHub: https://github.com/umarbutler/emubert-creator

read article
comments:
add comment
loading comments...