TaskList (First Draft) - Procedure to design a Search Engine Online
DomainStudy
Market Space
Players
Technologies
Legalities
Specifications
Freezing Specifications
Number of Sites, attributes required, database updation frequency, user operations etc
Pegging Benchmarks
Accuracy/Time/Space etcetera
Platform/Architecture
Deciding
Configuring/Mounting/Installing/Deployment
Crawler
Understanding and Deciding Core Technology
Identifying and understanding various technology issues that affect performance/scalability etc
Understanding the various core technologies/approaches
Choosing the one which best meets our requirements
Understanding the Protocols
Understanding the various industry protocols and standards such as APIs/Exclusion Rules
Automating identification of Sources
Deciding the strategy to automate decision on which sites to crawl
Deciphering Permissions/Exclusions
Deciding the strategy of how to order crawl
Deciding the strategy to identify relevancy of a crawled website
Ranking Sources (Ranking.com, Alexa.com)
Deciding whether it be fully automated or with manual intervention/overriding (try to make it fully automated)
Approach, Evaluation Matrix and Algorithm to decide rank
Automating identification of Content
Deciding the strategy to identify which page to grab
Identification of Duplicate Pages
Deciding strategy to identify layout of required attributes
Extracting Information
Deciding strategy to extract content from a page
Categorizing Information
Identifying which object belongs to which category
Site Submission Procedures (Yahoo, Google, MSN)
Deciding Protocols for site submission by respective owner
Updation
Deciding Order of updation based on inventory levels etc.
Deciding strategy on how/when to schedule updation
Implementation
FeedGatherer
Identifying and Understanding the Technology Issues
Understanding the Protocols
Understanding the various industry protocols and standards such as feed formats/time of release etc
Creating Feed Format
Sourcing Feeds
Deciding Feed Submission Process by Site Owners
Tackling Format Incompatibility Issues
Deciding Policy for Proactive Feed Gathering
Ranking Sources
Deciding whether it be manual, automated or mix
Approach, Evaluation Matrix and Algorithm to decide rank
Updation
Deciding Updation Philosophy (Realtime from feeds, or from periodic refreshed dbase, mix)
Deciding Order of updation based on inventory levels etc.
Deciding strategy on how/when to schedule
updation
Differential Updation Frequency
Extracting Attributes
Categorizing Information
Identifying which object belongs to which category
Implementation
Database
Understanding and Deciding Core Technology
Identifying and Understanding the relevant technology issues
Choosing the base core technology which best meets our requirements
Design
Understanding the Technology Issues
Drawing the Schema
Populating
Populating Database with the extracted information
Updation
Deciding strategy for updating the database
Tackling handling of Queries during updation Process
Implementation
Searching
Methodology
Deciding methodology and algorithms
Handling Concurrency Issues
Intelligent Deciphering
Incorporating Intelligent Deciphering using Stemming and Phonetic support etc
Optimizations
Strategy for Caching recent/popular queries
Planning structured and generic queries
Implementation
Website
Design
Conceptualizing processes/actions facilitated
Formalization
(Flowcharts/Constraints/Specifications) of various processes/actions
Demarcating/Categorizing as Control or View
Backend(Control)
Algorithms for various process controlers (search engine ranking)
Frontend(View)
Layout/Structure/Map of the website
Drawing Views
Aestheticity (Graphics etc)
Implementation
Testing
Identifying what all needs to be tested
Conceptualising TestCases
Implementation
Runs
Courtsey : Software Engineer