shidiq

seseorang yang berusaha untuk hidup

proses 3 menit baca

Proses Hari ke 110

hitori

Notes

Tokenization

is the process of breaking down a text document or sequence word into smaller units called tokens. Tokenization is a step in NLP To transfrom natural language into format ML models can understand.

Token limits is restriction number of token that LLM can process in a single interaction.

Tokenization algorithms

  • Byte Pair Encoding (BPE)
  • WordPiece
  • SentencePiece

Embedding

is process convert a word into vector can be process or manipulated within the model, capturing the nuances of language and its associations.

The Softmax function takes a vector of numbers as input and produces another vector of the same dimension as output.

Embedding models:

  • Word Embeddings
  • Setence Embeddings
  • Image Embeddings
  • Document Embeddings

Transformers

refers to a specific type of neural network architecture called the “transformer model”, that is designed to process sequential data, such as natural language sequences.

Transformers were first introduced in the paper “Attention Is All You Need” by Vaswani et al. in 2017 and have since become the foundation for a wide range of state-of-the-art NLP models.

A recurrent neural network is a class of neural networks that includes weighted connections within a layer (compared with traditional feed-forward networks, where connects feed only to subsequent layers)

A convolutional neural network is an extension of artificial neural networks (ANN) and is predominantly used for image recognition-based tasks.

The transformer model consists of two main parts: an encoder and a decoder. The encoder processes the input sequence and produces a sequence of hidden states, while the decoder takes this sequence as input and produces an output sequence. By training the model on a large corpus of data, the encoder learns to encode the text in a way that captures important information and patterns, while the decoder learns to generate output text that is semantically and grammatically correct.

Encoder

The encoder receives the input sequence and transforms it into a representation that captures its meaning. This input sequence can be a sentence, a paragraph, or any other sequential data.

An encoder is a component or module that processes input data and converts it into a structured representation that can be understood and used by the model.

Decoder

The decoder takes the encoded input sequence produced by the encoder and generates the output sequence. The decoding process starts with a special “start token” as the initial input. The decoder generates the output tokens one by one, with each token being conditioned on the previously generated tokens and the encoded input sequence.

A decoder is a component or module responsible for processing the structured encoding generated by the encoder and producing the output sequence.

Risks and ethical concerns

  • Bias and Fairness
  • Hallucinations (or falsehoods)
  • Ethical use
  • Lack of explainability
  • Data privacy
  • Environment impact
  • Privacy and security concern
  • Social manipulation
  • Regulatory and policy challenges

© 2026 Shidiq. All rights reserved.