1) What is NLP? (Scope & Goals)
-
Definition: Field of AI that enables computers to understand, generate, and interact using human language.
-
Modalities: Text (documents, chat, code‑mixed social media) and Speech (ASR = speech→text, TTS = text→speech).
-
End goals: Information extraction, question answering, translation, summarization, dialogue systems, sentiment analysis, content moderation, retrieval‑augmented generation, etc.
2) End‑to‑End NLP Lifecycle
A. Problem framing → B. Data → C. Preprocessing → D. Linguistic processing → E. Feature/Embedding → F. Modeling & Training → G. Evaluation → H. Deployment → I. Monitoring & Iteration.
A compact flow:
3) Data & Corpus Management
-
Sourcing: Open corpora, web crawl, logs (with consent), domain documents, transcribed audio.
-
Licensing & Privacy: Respect copyright, PII redaction, consent for user data.
-
Annotation: Gold labels for tasks (e.g., sentiment, entities, intent). Use guidelines, inter‑annotator agreement (Cohen’s κ), adjudication.
-
Splits: Train / Validation (dev) / Test. Avoid leakage; stratify by class; for time‑series, split chronologically.
-
Cleaning: Deduplicate, remove boilerplate, fix encoding, handle emojis/URL/markup, normalize Unicode (NFC/NFKC).
-
Class imbalance: Weighted loss, resampling, focal loss; data augmentation (back‑translation, synonym replacement, noise injection).
Note on multilingual/code‑mixed text: Use language ID, script detection, transliteration (e.g., Hinglish → Hindi/English), and tokenizers that support multiple scripts.
4) Text Preprocessing (Normalization Pipeline)
-
Document segmentation: Split corpus into documents/sentences (rule‑based/ML models).
-
Tokenization: Word/character/subword (BPE, WordPiece, SentencePiece) to handle OOV and morphology.
-
Case‑folding & diacritics: Lowercasing where appropriate; be careful for NER/acronyms and scripts where case is meaningful.
-
Noise handling: Remove/transform URLs, mentions, hashtags, HTML, emojis (map to tokens), punctuation (task‑dependent).
-
Spelling normalization & slang: Correction; expand contractions; normalize variants (colour/color). For social text, keep expressive tokens if predictive.
-
Stop‑words: Optional removal; avoid for transformer models and tasks needing function words.
-
Stemming vs Lemmatization:
-
Stemming: heuristic suffix chopping (e.g., compute, computer, computing → comput).
-
Lemmatization: vocabulary + morphology aware (e.g., better → good).
-
5) Linguistic Processing (Classical NLP)
-
POS Tagging: Assign word classes (NN, VB, JJ…). Tagsets: Penn Treebank, Universal Dependencies.
-
Chunking/Shallow Parsing: Group tokens (NP, VP, PP) using BIO tagging.
-
Named Entity Recognition (NER): Detect entities (PER, ORG, LOC, GPE, DATE, MONEY…).
-
Morphology: Lemmas, affixes, features (number, gender, case), especially for morphologically rich languages.
-
Syntactic Parsing:
-
Constituency: Build phrase‑structure trees.
-
Dependency: Head‑dependent arcs; useful for relation extraction.
-
-
Coreference Resolution: Link mentions that refer to the same entity (“Rahul… he…”).
-
Word Sense Disambiguation (WSD): Select the right sense for polysemous words (“bank” = river vs finance).
-
Semantic Role Labeling (SRL): Who did what to whom, when, where (predicate‑argument structure).
-
Discourse: Coherence relations across sentences (RST), topic segmentation.
These layers can be features for classical ML or learned implicitly by deep models.
6) Feature Engineering & Representations
-
Bag of Words (BoW) / n‑grams: Counts or presence; simple, strong baselines.
-
TF‑IDF: Weighs rare but informative terms.
Formula:TFIDF(t,d) = TF(t,d) × log( N / (1 + DF(t)) )
-
Distributional vectors:
-
Static embeddings: word2vec (CBOW/Skip‑gram), GloVe, fastText (subword aware).
-
Contextual embeddings: ELMo, BERT‑family (encoder), GPT‑family (decoder), T5/Marian (encoder‑decoder). Tokens’ vectors depend on context.
-
-
Sentence/Document embeddings: Pooling, Sentence‑BERT, averaging, CLS token.
-
Character/Subword features: Tackle misspellings, OOV, morphology.
7) Modeling Paradigms
A) Classical ML
-
Classification: Naive Bayes, Logistic Regression, Linear SVM.
-
Sequence labeling: HMM, CRF; popular for POS/NER (BiLSTM‑CRF = hybrid).
-
Topic modeling: LDA for unsupervised themes.
B) Neural & Deep Learning
-
RNN/LSTM/GRU: Sequence modeling; BiLSTM for context both sides.
-
CNN for text: Local n‑gram features; strong for classification.
-
Attention: Focus on salient tokens. Scaled dot‑product attention (conceptually: query–key similarity → weights → value sum).
-
Transformers: Self‑attention layers; train with:
-
Encoder‑only (e.g., BERT) → understanding tasks via masked language modeling + fine‑tuning.
-
Decoder‑only (e.g., GPT) → generation via next‑token prediction; prompting/few‑shot learning.
-
Encoder‑decoder (e.g., T5, Marian) → seq2seq tasks (translation, summarization).
-
-
Decoding strategies: Greedy, beam search, length penalty; stochastic: top‑k, nucleus (top‑p), temperature.
-
Multitask/Multilingual: Shared parameters, adapters; XLM‑R, mBERT.
C) Retrieval‑Augmented Generation (RAG)
-
Index domain documents → Embed → Retrieve top‑k → (Re)Rank → Generate grounded answer; improves factuality & freshness.
8) Task Archetypes & What Changes in the Pipeline
-
Text Classification (sentiment/toxicity/intent): tokenization → vectorize → classifier.
-
Sequence Labeling (POS/NER/Chunking): BIO tagging; per‑token predictions, CRF layer often helpful.
-
Span Extraction / QA: Predict start/end indices over context.
-
Sequence‑to‑Sequence (MT, summarization, data‑to‑text): encoder‑decoder + attention; careful decoding.
-
Information Extraction: NER + relation extraction + event extraction.
-
Dialogue/Chatbots: NLU (intent, slots) + Policy + NLG; or end‑to‑end with LLM + tools.
-
Search/Retrieval: BM25 or dense retrievers (dual encoders, ColBERT); rerankers (cross‑encoders).
-
Speech:
-
ASR: audio → features (MFCC/log‑mels) → acoustic model (CTC/Transducer/attention) → language model → text.
-
TTS: text → phonemes → acoustic model (Tacotron/FastSpeech) → vocoder (WaveNet/HiFi‑GAN) → audio.
-
9) Training Workflow (Supervised Example)
-
Define objective (e.g., F1 on minority class ≥ 0.80).
-
Prepare data (split, balance, augment, label quality checks).
-
Tokenizer/Vectorizer setup (TF‑IDF or subword model).
-
Model selection (baseline NB/SVM → Transformer fine‑tune for lift).
-
Optimization: Adam/AdamW; schedule (linear warmup/decay); batch size, max length.
-
Regularization: Dropout, weight decay, early stopping, gradient clipping, mixout.
-
Hyperparameter search: learning rate, epochs, class weights; use dev set.
-
Reproducibility: Fix seeds, log configs, save checkpoints & tokenizer.
10) Evaluation & Error Analysis
-
Classification: Accuracy, Precision/Recall/F1 (macro/micro), ROC‑AUC; confusion matrix.
-
Seq labeling: Token/Entity F1 (exact span match rules!).
-
QA (extractive): Exact Match (EM), F1 overlap.
-
Generation: BLEU/METEOR/TER for MT; ROUGE‑1/2/L for summarization; BERTScore, COMET; human eval (fluency, adequacy, factuality).
-
Language modeling: Perplexity.
-
ASR: WER/CER.
-
Fairness & Safety: Group‑wise metrics, toxicity rates, stereotype tests, PII leakage.
Error analysis loop: Sample failures → categorize (tokenization, OOV, long context, negation, sarcasm, code‑mixing, domain shift) → data/feature/model fixes → re‑test.
11) Deployment & MLOps for NLP
-
Packaging: Export model + tokenizer + config; quantize or distill for latency.
-
Serving: REST/gRPC; batching; streaming for ASR; caching hot prompts.
-
Observability: Track throughput/latency, success rates, drift (embedding shift, vocabulary changes), hallucination/factuality for LLMs.
-
Guardrails: Input validation, language ID, PII redaction, profanity/toxicity filters, prompt shields, rate limits.
-
Retraining cadence: Active learning (human‑in‑the‑loop), weak supervision, feedback loops.
12) Worked Mini‑Pipelines (Concrete Examples)
A) Sentiment Classifier (Tweets/Reviews)
-
Collect & label data (pos/neg/neutral) → split.
-
Normalize (URLs, emojis → tokens), subword tokenize.
-
Baseline TF‑IDF + Linear SVM; log F1.
-
Fine‑tune a small transformer (e.g., DistilBERT) with class weights.
-
Evaluate macro‑F1; inspect confusion cases (sarcasm, negation scope).
-
Deploy with thresholding + abstain policy for low confidence.
B) NER for Invoices (ORG, DATE, AMOUNT)
-
Annotate spans with BIO scheme; handle currency formats.
-
Train BiLSTM‑CRF or fine‑tune encoder‑only transformer.
-
Post‑process with regex/validators (dates, currency sums).
-
Evaluate span‑level F1; audit for privacy.
C) Abstractive Summarization (News)
-
Build paired (article, summary) dataset; length control.
-
Fine‑tune encoder‑decoder; use coverage loss or contrastive reranking to reduce hallucination.
-
Decode with beam search + length penalty; evaluate ROUGE & human judgments.
13) Typical Pitfalls & Remedies
-
Tokenization mismatch: Always save and ship the exact tokenizer with the model.
-
Too much cleaning: Over‑aggressive stop‑word/punctuation removal can hurt.
-
Domain shift: Use domain adaptation, RAG, or continual fine‑tuning.
-
Class imbalance: Use weighted loss, focal loss, or data augmentation.
-
Long context: Use long‑context transformers, chunk + overlap, or retrieval.
-
Sarcasm/Irony: Add specialized data, context windows, pragmatics cues.
-
Multilingual/code‑mix: Use multilingual encoders; transliteration; script‑aware tokenizers.
14) Tools & Libraries (by category)
-
Preprocessing/Classic NLP: NLTK, spaCy, Stanza.
-
Transformers & Training: Hugging Face Transformers/PEFT, PyTorch, TensorFlow, Keras, OpenNMT, Fairseq.
-
Tokenization: SentencePiece, Hugging Face Tokenizers.
-
Speech: Kaldi, ESPnet, wav2vec 2.0 toolchains, Coqui‑TTS.
-
Serving & MLOps: FastAPI, Triton Inference Server, ONNX Runtime, LangChain/LlamaIndex (RAG), MLflow/W&B.
15) Quick Revision Table
Stage | Key Outputs | Common Models/Methods | Metrics |
---|---|---|---|
Preprocess | tokens, cleaned text | normalization, tokenization, lemmatization | — |
Linguistic | POS/NER/parse trees | CRF, BiLSTM‑CRF, parsers | F1, UAS/LAS |
Vectorize | TF‑IDF/embeddings | word2vec, GloVe, BERT/GPT/T5 | — |
Model | labels/spans/seqs | NB, SVM, LSTM, Transformer | Acc/F1/ROUGE/BLEU |
Decode | final text/answers | beam, top‑k/top‑p | — |
Evaluate | quality/fairness | task‑specific | task‑specific |
Deploy | API/app | quantization, distillation | latency, throughput |
Monitor | drift, safety | dashboards, A/B | error rates, drift |
16) Exam Tips & Viva Pointers
-
Differentiate stemming vs lemmatization, constituency vs dependency, encoder vs decoder transformers.
-
Write the TF‑IDF formula and explain why IDF downweights frequent words.
-
For NER, mention BIO tagging and span‑level evaluation.
-
For MT/summarization, name BLEU/ROUGE and explain their intuition.
-
Be ready to sketch a full pipeline and justify each step for a chosen task.
17) Pseudocode: Training a Simple Text Classifier
# Inputs: labeled docs D = {(x_i, y_i)}
# Output: trained model M
docs = clean_normalize(D)
X_train, X_val, y_train, y_val = split(docs)
vectorizer = TFIDF(ngram_range=(1,2), min_df=5)
Xtr = vectorizer.fit_transform(X_train)
Xva = vectorizer.transform(X_val)
M = LinearSVM(C=1.0, class_weight='balanced')
M.fit(Xtr, y_train)
metrics = evaluate(M.predict(Xva), y_val) # precision, recall, F1
save(M, vectorizer)
Final Takeaway
NLP systems succeed when data quality, tokenization/representation, and evaluation discipline are treated as first‑class citizens—not just the model. Pair strong baselines with well‑tuned transformers and a robust MLOps loop for production‑grade results.
No comments:
Post a Comment