skip to content
the underscore
Dark Theme

Projects

BitNet on GPT

An attempt to implement the bitnet paper on the GPT. Built on top of NanoGPT in pyTorch. It also contains implementation of The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits.

I’ve not been able to train it on a large scale yet!!.

My LazyVim Config

A perfectly curated Neovim config. Built with Neovim, LazyVim and Mason.

AI on Web

Running AI models on Web.

Using Onnx-runtime-web to run BERT for sentimental analysis on the web.

This repository contains a simple implementation of ONNX (Open Neural Network Exchange) using the microsoft/xtremedistil-l6-h256-uncased model. The ONNX model is located in the onnx/model.py file, and we’ve also provided the exported classifier model in both onnx/classifier.onnx and onnx/classifier_int8.onnx formats.

Model Information

  • Model Used: microsoft/xtremedistil-l6-h256-uncased
  • ONNX Model Location: onnx/model.py
  • Exported Classifier Models: onnx/classifier.onnx and onnx/classifier_int8.onnx
  • Colab Notebook: notebook

transformer

Just another implementation of the transformer model as introduced in the paper Attention is all you need, this is a step-by-step process to building a transformer.

A tutorial project for understanding how the transformer works

trBPE: A Byte Pair Encoder tailored for Turkish

The current landscape of Large Language Models (LLMs) predominantly caters to the English language. This bias can be attributed to extensive training on English datasets and the efficacy of tokenization. Notably, OpenAI tokenizer for GPT-4’s excels in contextualizing tokens based on syllabic divisions, enhancing comprehension and generation capabilities.

However, for foreign languages like Turkish, this advantage diminishes due to tokenization randomness. To address this, a repository was created to develop a BPE tokenizer tailored to Turkish, using rich Turkish language datasets.

This was used by KomRade in the competition...

In an attempt to replicate methods outlined in this paper, with exceptions:

  • Non-agglutinative pieces are preceded by a space, and agglutinative pieces aren’t # prefixed.
  • Tokenization is case-insensitive.