“The computer was born to solve problems that did not exist before.”

Random Posts

Friday, September 5, 2025

Natural Language Processing (NLP)

1) What is NLP? (Scope & Goals)

  • Definition: Field of AI that enables computers to understand, generate, and interact using human language.

  • Modalities: Text (documents, chat, code‑mixed social media) and Speech (ASR = speech→text, TTS = text→speech).

  • End goals: Information extraction, question answering, translation, summarization, dialogue systems, sentiment analysis, content moderation, retrieval‑augmented generation, etc.


2) End‑to‑End NLP Lifecycle

A. Problem framingB. DataC. PreprocessingD. Linguistic processingE. Feature/EmbeddingF. Modeling & TrainingG. EvaluationH. DeploymentI. Monitoring & Iteration.

A compact flow:


3) Data & Corpus Management

  • Sourcing: Open corpora, web crawl, logs (with consent), domain documents, transcribed audio.

  • Licensing & Privacy: Respect copyright, PII redaction, consent for user data.

  • Annotation: Gold labels for tasks (e.g., sentiment, entities, intent). Use guidelines, inter‑annotator agreement (Cohen’s κ), adjudication.

  • Splits: Train / Validation (dev) / Test. Avoid leakage; stratify by class; for time‑series, split chronologically.

  • Cleaning: Deduplicate, remove boilerplate, fix encoding, handle emojis/URL/markup, normalize Unicode (NFC/NFKC).

  • Class imbalance: Weighted loss, resampling, focal loss; data augmentation (back‑translation, synonym replacement, noise injection).

Note on multilingual/code‑mixed text: Use language ID, script detection, transliteration (e.g., Hinglish → Hindi/English), and tokenizers that support multiple scripts.


4) Text Preprocessing (Normalization Pipeline)

  1. Document segmentation: Split corpus into documents/sentences (rule‑based/ML models).

  2. Tokenization: Word/character/subword (BPE, WordPiece, SentencePiece) to handle OOV and morphology.

  3. Case‑folding & diacritics: Lowercasing where appropriate; be careful for NER/acronyms and scripts where case is meaningful.

  4. Noise handling: Remove/transform URLs, mentions, hashtags, HTML, emojis (map to tokens), punctuation (task‑dependent).

  5. Spelling normalization & slang: Correction; expand contractions; normalize variants (colour/color). For social text, keep expressive tokens if predictive.

  6. Stop‑words: Optional removal; avoid for transformer models and tasks needing function words.

  7. Stemming vs Lemmatization:

    • Stemming: heuristic suffix chopping (e.g., compute, computer, computingcomput).

    • Lemmatization: vocabulary + morphology aware (e.g., bettergood).


5) Linguistic Processing (Classical NLP)

  • POS Tagging: Assign word classes (NN, VB, JJ…). Tagsets: Penn Treebank, Universal Dependencies.

  • Chunking/Shallow Parsing: Group tokens (NP, VP, PP) using BIO tagging.

  • Named Entity Recognition (NER): Detect entities (PER, ORG, LOC, GPE, DATE, MONEY…).

  • Morphology: Lemmas, affixes, features (number, gender, case), especially for morphologically rich languages.

  • Syntactic Parsing:

    • Constituency: Build phrase‑structure trees.

    • Dependency: Head‑dependent arcs; useful for relation extraction.

  • Coreference Resolution: Link mentions that refer to the same entity (“Rahul… he…”).

  • Word Sense Disambiguation (WSD): Select the right sense for polysemous words (“bank” = river vs finance).

  • Semantic Role Labeling (SRL): Who did what to whom, when, where (predicate‑argument structure).

  • Discourse: Coherence relations across sentences (RST), topic segmentation.

These layers can be features for classical ML or learned implicitly by deep models.


6) Feature Engineering & Representations

  • Bag of Words (BoW) / n‑grams: Counts or presence; simple, strong baselines.

  • TF‑IDF: Weighs rare but informative terms.
    Formula: TFIDF(t,d) = TF(t,d) × log( N / (1 + DF(t)) )

  • Distributional vectors:

    • Static embeddings: word2vec (CBOW/Skip‑gram), GloVe, fastText (subword aware).

    • Contextual embeddings: ELMo, BERT‑family (encoder), GPT‑family (decoder), T5/Marian (encoder‑decoder). Tokens’ vectors depend on context.

  • Sentence/Document embeddings: Pooling, Sentence‑BERT, averaging, CLS token.

  • Character/Subword features: Tackle misspellings, OOV, morphology.


7) Modeling Paradigms

A) Classical ML

  • Classification: Naive Bayes, Logistic Regression, Linear SVM.

  • Sequence labeling: HMM, CRF; popular for POS/NER (BiLSTM‑CRF = hybrid).

  • Topic modeling: LDA for unsupervised themes.

B) Neural & Deep Learning

  • RNN/LSTM/GRU: Sequence modeling; BiLSTM for context both sides.

  • CNN for text: Local n‑gram features; strong for classification.

  • Attention: Focus on salient tokens. Scaled dot‑product attention (conceptually: query–key similarity → weights → value sum).

  • Transformers: Self‑attention layers; train with:

    • Encoder‑only (e.g., BERT) → understanding tasks via masked language modeling + fine‑tuning.

    • Decoder‑only (e.g., GPT) → generation via next‑token prediction; prompting/few‑shot learning.

    • Encoder‑decoder (e.g., T5, Marian) → seq2seq tasks (translation, summarization).

  • Decoding strategies: Greedy, beam search, length penalty; stochastic: top‑k, nucleus (top‑p), temperature.

  • Multitask/Multilingual: Shared parameters, adapters; XLM‑R, mBERT.

C) Retrieval‑Augmented Generation (RAG)

  • Index domain documents → EmbedRetrieve top‑k → (Re)RankGenerate grounded answer; improves factuality & freshness.


8) Task Archetypes & What Changes in the Pipeline

  • Text Classification (sentiment/toxicity/intent): tokenization → vectorize → classifier.

  • Sequence Labeling (POS/NER/Chunking): BIO tagging; per‑token predictions, CRF layer often helpful.

  • Span Extraction / QA: Predict start/end indices over context.

  • Sequence‑to‑Sequence (MT, summarization, data‑to‑text): encoder‑decoder + attention; careful decoding.

  • Information Extraction: NER + relation extraction + event extraction.

  • Dialogue/Chatbots: NLU (intent, slots) + Policy + NLG; or end‑to‑end with LLM + tools.

  • Search/Retrieval: BM25 or dense retrievers (dual encoders, ColBERT); rerankers (cross‑encoders).

  • Speech:

    • ASR: audio → features (MFCC/log‑mels) → acoustic model (CTC/Transducer/attention) → language model → text.

    • TTS: text → phonemes → acoustic model (Tacotron/FastSpeech) → vocoder (WaveNet/HiFi‑GAN) → audio.


9) Training Workflow (Supervised Example)

  1. Define objective (e.g., F1 on minority class ≥ 0.80).

  2. Prepare data (split, balance, augment, label quality checks).

  3. Tokenizer/Vectorizer setup (TF‑IDF or subword model).

  4. Model selection (baseline NB/SVM → Transformer fine‑tune for lift).

  5. Optimization: Adam/AdamW; schedule (linear warmup/decay); batch size, max length.

  6. Regularization: Dropout, weight decay, early stopping, gradient clipping, mixout.

  7. Hyperparameter search: learning rate, epochs, class weights; use dev set.

  8. Reproducibility: Fix seeds, log configs, save checkpoints & tokenizer.


10) Evaluation & Error Analysis

  • Classification: Accuracy, Precision/Recall/F1 (macro/micro), ROC‑AUC; confusion matrix.

  • Seq labeling: Token/Entity F1 (exact span match rules!).

  • QA (extractive): Exact Match (EM), F1 overlap.

  • Generation: BLEU/METEOR/TER for MT; ROUGE‑1/2/L for summarization; BERTScore, COMET; human eval (fluency, adequacy, factuality).

  • Language modeling: Perplexity.

  • ASR: WER/CER.

  • Fairness & Safety: Group‑wise metrics, toxicity rates, stereotype tests, PII leakage.

Error analysis loop: Sample failures → categorize (tokenization, OOV, long context, negation, sarcasm, code‑mixing, domain shift) → data/feature/model fixes → re‑test.


11) Deployment & MLOps for NLP

  • Packaging: Export model + tokenizer + config; quantize or distill for latency.

  • Serving: REST/gRPC; batching; streaming for ASR; caching hot prompts.

  • Observability: Track throughput/latency, success rates, drift (embedding shift, vocabulary changes), hallucination/factuality for LLMs.

  • Guardrails: Input validation, language ID, PII redaction, profanity/toxicity filters, prompt shields, rate limits.

  • Retraining cadence: Active learning (human‑in‑the‑loop), weak supervision, feedback loops.


12) Worked Mini‑Pipelines (Concrete Examples)

A) Sentiment Classifier (Tweets/Reviews)

  1. Collect & label data (pos/neg/neutral) → split.

  2. Normalize (URLs, emojis → tokens), subword tokenize.

  3. Baseline TF‑IDF + Linear SVM; log F1.

  4. Fine‑tune a small transformer (e.g., DistilBERT) with class weights.

  5. Evaluate macro‑F1; inspect confusion cases (sarcasm, negation scope).

  6. Deploy with thresholding + abstain policy for low confidence.

B) NER for Invoices (ORG, DATE, AMOUNT)

  1. Annotate spans with BIO scheme; handle currency formats.

  2. Train BiLSTM‑CRF or fine‑tune encoder‑only transformer.

  3. Post‑process with regex/validators (dates, currency sums).

  4. Evaluate span‑level F1; audit for privacy.

C) Abstractive Summarization (News)

  1. Build paired (article, summary) dataset; length control.

  2. Fine‑tune encoder‑decoder; use coverage loss or contrastive reranking to reduce hallucination.

  3. Decode with beam search + length penalty; evaluate ROUGE & human judgments.


13) Typical Pitfalls & Remedies

  • Tokenization mismatch: Always save and ship the exact tokenizer with the model.

  • Too much cleaning: Over‑aggressive stop‑word/punctuation removal can hurt.

  • Domain shift: Use domain adaptation, RAG, or continual fine‑tuning.

  • Class imbalance: Use weighted loss, focal loss, or data augmentation.

  • Long context: Use long‑context transformers, chunk + overlap, or retrieval.

  • Sarcasm/Irony: Add specialized data, context windows, pragmatics cues.

  • Multilingual/code‑mix: Use multilingual encoders; transliteration; script‑aware tokenizers.


14) Tools & Libraries (by category)

  • Preprocessing/Classic NLP: NLTK, spaCy, Stanza.

  • Transformers & Training: Hugging Face Transformers/PEFT, PyTorch, TensorFlow, Keras, OpenNMT, Fairseq.

  • Tokenization: SentencePiece, Hugging Face Tokenizers.

  • Speech: Kaldi, ESPnet, wav2vec 2.0 toolchains, Coqui‑TTS.

  • Serving & MLOps: FastAPI, Triton Inference Server, ONNX Runtime, LangChain/LlamaIndex (RAG), MLflow/W&B.


15) Quick Revision Table

Stage Key Outputs Common Models/Methods Metrics
Preprocess tokens, cleaned text normalization, tokenization, lemmatization
Linguistic POS/NER/parse trees CRF, BiLSTM‑CRF, parsers F1, UAS/LAS
Vectorize TF‑IDF/embeddings word2vec, GloVe, BERT/GPT/T5
Model labels/spans/seqs NB, SVM, LSTM, Transformer Acc/F1/ROUGE/BLEU
Decode final text/answers beam, top‑k/top‑p
Evaluate quality/fairness task‑specific task‑specific
Deploy API/app quantization, distillation latency, throughput
Monitor drift, safety dashboards, A/B error rates, drift

16) Exam Tips & Viva Pointers

  • Differentiate stemming vs lemmatization, constituency vs dependency, encoder vs decoder transformers.

  • Write the TF‑IDF formula and explain why IDF downweights frequent words.

  • For NER, mention BIO tagging and span‑level evaluation.

  • For MT/summarization, name BLEU/ROUGE and explain their intuition.

  • Be ready to sketch a full pipeline and justify each step for a chosen task.


17) Pseudocode: Training a Simple Text Classifier

# Inputs: labeled docs D = {(x_i, y_i)}
# Output: trained model M

docs = clean_normalize(D)
X_train, X_val, y_train, y_val = split(docs)
vectorizer = TFIDF(ngram_range=(1,2), min_df=5)
Xtr = vectorizer.fit_transform(X_train)
Xva = vectorizer.transform(X_val)
M = LinearSVM(C=1.0, class_weight='balanced')
M.fit(Xtr, y_train)
metrics = evaluate(M.predict(Xva), y_val)  # precision, recall, F1
save(M, vectorizer)

Final Takeaway

NLP systems succeed when data quality, tokenization/representation, and evaluation discipline are treated as first‑class citizens—not just the model. Pair strong baselines with well‑tuned transformers and a robust MLOps loop for production‑grade results.

No comments:

Post a Comment

Post Top Ad

Your Ad Spot

Pages

SoraTemplates

Best Free and Premium Blogger Templates Provider.

Buy This Template