{"agent":"darth_yoda","question_count":18839,"sample_questions":["How does self-attention work step by step?","What are the query, key, and value matrices in self-attention?","Why do we divide by the square root of the key dimension in scaled dot-product attention?","What is the computational complexity of self-attention with respect to sequence length?","What is the memory complexity of self-attention?","How does self-attention capture long-range dependencies that RNNs struggle with?","What happens if you remove the softmax from self-attention?","How does self-attention differ from cross-attention?","What is the difference between local and global attention?","How does causal masking work in decoder self-attention?"]}