Natural Language Processing (NLP)

1) What is NLP? (Scope & Goals)

Definition: Field of AI that enables computers to understand, generate, and interact using human language.
Modalities: Text (documents, chat, code‑mixed social media) and Speech (ASR = speech→text, TTS = text→speech).
End goals: Information extraction, question answering, translation, summarization, dialogue systems, sentiment analysis, content moderation, retrieval‑augmented generation, etc.

2) End‑to‑End NLP Lifecycle

A. Problem framing → B. Data → C. Preprocessing → D. Linguistic processing → E. Feature/Embedding → F. Modeling & Training → G. Evaluation → H. Deployment → I. Monitoring & Iteration.

A compact flow:

3) Data & Corpus Management

Sourcing: Open corpora, web crawl, logs (with consent), domain documents, transcribed audio.
Licensing & Privacy: Respect copyright, PII redaction, consent for user data.
Annotation: Gold labels for tasks (e.g., sentiment, entities, intent). Use guidelines, inter‑annotator agreement (Cohen’s κ), adjudication.
Splits: Train / Validation (dev) / Test. Avoid leakage; stratify by class; for time‑series, split chronologically.
Cleaning: Deduplicate, remove boilerplate, fix encoding, handle emojis/URL/markup, normalize Unicode (NFC/NFKC).
Class imbalance: Weighted loss, resampling, focal loss; data augmentation (back‑translation, synonym replacement, noise injection).

Note on multilingual/code‑mixed text: Use language ID, script detection, transliteration (e.g., Hinglish → Hindi/English), and tokenizers that support multiple scripts.

4) Text Preprocessing (Normalization Pipeline)

Document segmentation: Split corpus into documents/sentences (rule‑based/ML models).
Tokenization: Word/character/subword (BPE, WordPiece, SentencePiece) to handle OOV and morphology.
Case‑folding & diacritics: Lowercasing where appropriate; be careful for NER/acronyms and scripts where case is meaningful.
Noise handling: Remove/transform URLs, mentions, hashtags, HTML, emojis (map to tokens), punctuation (task‑dependent).
Spelling normalization & slang: Correction; expand contractions; normalize variants (colour/color). For social text, keep expressive tokens if predictive.
Stop‑words: Optional removal; avoid for transformer models and tasks needing function words.
Stemming vs Lemmatization:
- Stemming: heuristic suffix chopping (e.g., compute, computer, computing → comput).
- Lemmatization: vocabulary + morphology aware (e.g., better → good).

5) Linguistic Processing (Classical NLP)

POS Tagging: Assign word classes (NN, VB, JJ…). Tagsets: Penn Treebank, Universal Dependencies.
Chunking/Shallow Parsing: Group tokens (NP, VP, PP) using BIO tagging.
Named Entity Recognition (NER): Detect entities (PER, ORG, LOC, GPE, DATE, MONEY…).
Morphology: Lemmas, affixes, features (number, gender, case), especially for morphologically rich languages.
Syntactic Parsing:
- Constituency: Build phrase‑structure trees.
- Dependency: Head‑dependent arcs; useful for relation extraction.
Coreference Resolution: Link mentions that refer to the same entity (“Rahul… he…”).
Word Sense Disambiguation (WSD): Select the right sense for polysemous words (“bank” = river vs finance).
Semantic Role Labeling (SRL): Who did what to whom, when, where (predicate‑argument structure).
Discourse: Coherence relations across sentences (RST), topic segmentation.

These layers can be features for classical ML or learned implicitly by deep models.

6) Feature Engineering & Representations

Bag of Words (BoW) / n‑grams: Counts or presence; simple, strong baselines.
TF‑IDF: Weighs rare but informative terms.
Formula: TFIDF(t,d) = TF(t,d) × log( N / (1 + DF(t)) )
Distributional vectors:
- Static embeddings: word2vec (CBOW/Skip‑gram), GloVe, fastText (subword aware).
- Contextual embeddings: ELMo, BERT‑family (encoder), GPT‑family (decoder), T5/Marian (encoder‑decoder). Tokens’ vectors depend on context.
Sentence/Document embeddings: Pooling, Sentence‑BERT, averaging, CLS token.
Character/Subword features: Tackle misspellings, OOV, morphology.

7) Modeling Paradigms

A) Classical ML

Classification: Naive Bayes, Logistic Regression, Linear SVM.
Sequence labeling: HMM, CRF; popular for POS/NER (BiLSTM‑CRF = hybrid).
Topic modeling: LDA for unsupervised themes.

B) Neural & Deep Learning

RNN/LSTM/GRU: Sequence modeling; BiLSTM for context both sides.
CNN for text: Local n‑gram features; strong for classification.
Attention: Focus on salient tokens. Scaled dot‑product attention (conceptually: query–key similarity → weights → value sum).
Transformers: Self‑attention layers; train with:
- Encoder‑only (e.g., BERT) → understanding tasks via masked language modeling + fine‑tuning.
- Decoder‑only (e.g., GPT) → generation via next‑token prediction; prompting/few‑shot learning.
- Encoder‑decoder (e.g., T5, Marian) → seq2seq tasks (translation, summarization).
Decoding strategies: Greedy, beam search, length penalty; stochastic: top‑k, nucleus (top‑p), temperature.
Multitask/Multilingual: Shared parameters, adapters; XLM‑R, mBERT.

C) Retrieval‑Augmented Generation (RAG)

Index domain documents → Embed → Retrieve top‑k → (Re)Rank → Generate grounded answer; improves factuality & freshness.

8) Task Archetypes & What Changes in the Pipeline

Text Classification (sentiment/toxicity/intent): tokenization → vectorize → classifier.
Sequence Labeling (POS/NER/Chunking): BIO tagging; per‑token predictions, CRF layer often helpful.
Span Extraction / QA: Predict start/end indices over context.
Sequence‑to‑Sequence (MT, summarization, data‑to‑text): encoder‑decoder + attention; careful decoding.
Information Extraction: NER + relation extraction + event extraction.
Dialogue/Chatbots: NLU (intent, slots) + Policy + NLG; or end‑to‑end with LLM + tools.
Search/Retrieval: BM25 or dense retrievers (dual encoders, ColBERT); rerankers (cross‑encoders).
Speech:
- ASR: audio → features (MFCC/log‑mels) → acoustic model (CTC/Transducer/attention) → language model → text.
- TTS: text → phonemes → acoustic model (Tacotron/FastSpeech) → vocoder (WaveNet/HiFi‑GAN) → audio.

9) Training Workflow (Supervised Example)

Define objective (e.g., F1 on minority class ≥ 0.80).
Prepare data (split, balance, augment, label quality checks).
Tokenizer/Vectorizer setup (TF‑IDF or subword model).
Model selection (baseline NB/SVM → Transformer fine‑tune for lift).
Optimization: Adam/AdamW; schedule (linear warmup/decay); batch size, max length.
Regularization: Dropout, weight decay, early stopping, gradient clipping, mixout.
Hyperparameter search: learning rate, epochs, class weights; use dev set.
Reproducibility: Fix seeds, log configs, save checkpoints & tokenizer.

10) Evaluation & Error Analysis

Classification: Accuracy, Precision/Recall/F1 (macro/micro), ROC‑AUC; confusion matrix.
Seq labeling: Token/Entity F1 (exact span match rules!).
QA (extractive): Exact Match (EM), F1 overlap.
Generation: BLEU/METEOR/TER for MT; ROUGE‑1/2/L for summarization; BERTScore, COMET; human eval (fluency, adequacy, factuality).
Language modeling: Perplexity.
ASR: WER/CER.
Fairness & Safety: Group‑wise metrics, toxicity rates, stereotype tests, PII leakage.

Error analysis loop: Sample failures → categorize (tokenization, OOV, long context, negation, sarcasm, code‑mixing, domain shift) → data/feature/model fixes → re‑test.

11) Deployment & MLOps for NLP

Packaging: Export model + tokenizer + config; quantize or distill for latency.
Serving: REST/gRPC; batching; streaming for ASR; caching hot prompts.
Observability: Track throughput/latency, success rates, drift (embedding shift, vocabulary changes), hallucination/factuality for LLMs.
Guardrails: Input validation, language ID, PII redaction, profanity/toxicity filters, prompt shields, rate limits.
Retraining cadence: Active learning (human‑in‑the‑loop), weak supervision, feedback loops.

12) Worked Mini‑Pipelines (Concrete Examples)

A) Sentiment Classifier (Tweets/Reviews)

Collect & label data (pos/neg/neutral) → split.
Normalize (URLs, emojis → tokens), subword tokenize.
Baseline TF‑IDF + Linear SVM; log F1.
Fine‑tune a small transformer (e.g., DistilBERT) with class weights.
Evaluate macro‑F1; inspect confusion cases (sarcasm, negation scope).
Deploy with thresholding + abstain policy for low confidence.

B) NER for Invoices (ORG, DATE, AMOUNT)

Annotate spans with BIO scheme; handle currency formats.
Train BiLSTM‑CRF or fine‑tune encoder‑only transformer.
Post‑process with regex/validators (dates, currency sums).
Evaluate span‑level F1; audit for privacy.

C) Abstractive Summarization (News)

Build paired (article, summary) dataset; length control.
Fine‑tune encoder‑decoder; use coverage loss or contrastive reranking to reduce hallucination.
Decode with beam search + length penalty; evaluate ROUGE & human judgments.

13) Typical Pitfalls & Remedies

Tokenization mismatch: Always save and ship the exact tokenizer with the model.
Too much cleaning: Over‑aggressive stop‑word/punctuation removal can hurt.
Domain shift: Use domain adaptation, RAG, or continual fine‑tuning.
Class imbalance: Use weighted loss, focal loss, or data augmentation.
Long context: Use long‑context transformers, chunk + overlap, or retrieval.
Sarcasm/Irony: Add specialized data, context windows, pragmatics cues.
Multilingual/code‑mix: Use multilingual encoders; transliteration; script‑aware tokenizers.

14) Tools & Libraries (by category)

Preprocessing/Classic NLP: NLTK, spaCy, Stanza.
Transformers & Training: Hugging Face Transformers/PEFT, PyTorch, TensorFlow, Keras, OpenNMT, Fairseq.
Tokenization: SentencePiece, Hugging Face Tokenizers.
Speech: Kaldi, ESPnet, wav2vec 2.0 toolchains, Coqui‑TTS.
Serving & MLOps: FastAPI, Triton Inference Server, ONNX Runtime, LangChain/LlamaIndex (RAG), MLflow/W&B.

15) Quick Revision Table

Stage	Key Outputs	Common Models/Methods	Metrics
Preprocess	tokens, cleaned text	normalization, tokenization, lemmatization	—
Linguistic	POS/NER/parse trees	CRF, BiLSTM‑CRF, parsers	F1, UAS/LAS
Vectorize	TF‑IDF/embeddings	word2vec, GloVe, BERT/GPT/T5	—
Model	labels/spans/seqs	NB, SVM, LSTM, Transformer	Acc/F1/ROUGE/BLEU
Decode	final text/answers	beam, top‑k/top‑p	—
Evaluate	quality/fairness	task‑specific	task‑specific
Deploy	API/app	quantization, distillation	latency, throughput
Monitor	drift, safety	dashboards, A/B	error rates, drift

16) Exam Tips & Viva Pointers

Differentiate stemming vs lemmatization, constituency vs dependency, encoder vs decoder transformers.
Write the TF‑IDF formula and explain why IDF downweights frequent words.
For NER, mention BIO tagging and span‑level evaluation.
For MT/summarization, name BLEU/ROUGE and explain their intuition.
Be ready to sketch a full pipeline and justify each step for a chosen task.

17) Pseudocode: Training a Simple Text Classifier

# Inputs: labeled docs D = {(x_i, y_i)}
# Output: trained model M

docs = clean_normalize(D)
X_train, X_val, y_train, y_val = split(docs)
vectorizer = TFIDF(ngram_range=(1,2), min_df=5)
Xtr = vectorizer.fit_transform(X_train)
Xva = vectorizer.transform(X_val)
M = LinearSVM(C=1.0, class_weight='balanced')
M.fit(Xtr, y_train)
metrics = evaluate(M.predict(Xva), y_val)  # precision, recall, F1
save(M, vectorizer)

Final Takeaway

NLP systems succeed when data quality, tokenization/representation, and evaluation discipline are treated as first‑class citizens—not just the model. Pair strong baselines with well‑tuned transformers and a robust MLOps loop for production‑grade results.

Random Posts

Friday, September 5, 2025

Natural Language Processing (NLP)

1) What is NLP? (Scope & Goals)

2) End‑to‑End NLP Lifecycle

3) Data & Corpus Management

4) Text Preprocessing (Normalization Pipeline)

5) Linguistic Processing (Classical NLP)

6) Feature Engineering & Representations

7) Modeling Paradigms

A) Classical ML

B) Neural & Deep Learning

C) Retrieval‑Augmented Generation (RAG)

8) Task Archetypes & What Changes in the Pipeline

9) Training Workflow (Supervised Example)

10) Evaluation & Error Analysis

11) Deployment & MLOps for NLP

12) Worked Mini‑Pipelines (Concrete Examples)

A) Sentiment Classifier (Tweets/Reviews)

B) NER for Invoices (ORG, DATE, AMOUNT)

C) Abstractive Summarization (News)

13) Typical Pitfalls & Remedies

14) Tools & Libraries (by category)

15) Quick Revision Table

16) Exam Tips & Viva Pointers

17) Pseudocode: Training a Simple Text Classifier

Final Takeaway

No comments:

Post a Comment

Post Top Ad

Author Details

Socialize

Comments

Ad Code

Facebook

Total Pageviews

Search This Blog

Blog Archive

Ad Home

Pages

Random Posts

Recent Posts

Header Ads

Menu Footer Widget

Social Plugin

Subject Labels

Tags

Advertisement

Advertisement

Sponsor

Popular Posts

Recent in Sports

Random Posts

Popular Posts

Popular Posts

Facebook

Categories

Pages

About Me

Popular Posts

Tags

Send Quick Message

SoraTemplates