OpenSolaris
Collectives
Discussions
Documentation
Download
Source Browser
Free CD
Log-in
|
en
Project opengrok
:
Documentation
>
Story of OpenGrok
Top Menu
Show
:
Comments
Attachments
History
Information
Print
:
Print
Print preview
Export as PDF
Export as RTF
Export as HTML
Export as XAR
Wiki code for
Story of OpenGrok
Hide Line numbers
1: = The Story of OpenGrok - the Wicked Fast Source Browser 2: 3: Chandan B N, Dec 2005 4: 5: This blog entry will act as the --paper-- story of [[OpenGrok>>Project opengrok.WebHome]] the wicked fast source browser that powers the [[OpenSolaris source browser>>http://src.opensolaris.org/source/]]. 6: [[[[image:http://static.flickr.com/35/71852240_d42e0556fe_m.jpg||alt="OpenGrok on OpenSolaris.org"width="240"height="167"]]>>http://mediacast.sun.com/share/chandan/OpenGrok-screenshot.png]] 7: This would be explaining the technology behind it in more detail. It would be a living blog entry. i.e I would keep updating as and when there are updates. There would no other technical documentation, if you are fond of reading in postscript or PDF formats, I’ll try to make one available sooner. 8: 9: === The Revenge of the Binaries 10: 11: Being a security sentry of all Sun products, I keep a watch on reports of newly discovered security holes and then check if any Sun software is affected by them. Typically I would use [[cscope>>http://docs.sun.com/source/806-3567/cscope.html]] to find any fragments of code that I suspect to be vulnerable. In Solaris land, most of the ON(OS/Net) gate has [[cscope>>http://docs.sun.com/source/806-3567/cscope.html]] indexes pre-built nightly. Solaris is not just ON, it is a WOS (Wad Of Stuff); code and binaries flowing in from a dozen or so gates; and there are many more software products which I had no ready access to cscope indexes or sources. 12: 13: I gathered Solaris Install images, JES install images, Sun Cluster and N1 software and many other good and great things under Sun. I wrote a Perl script, that would go down recursively on files and directory content. It would extract as much textual information as it could. For example it would convert stream packages to directories, uncompresses and extract tar, zip, gz, bzip files or run [[dis(1)>>http://docs.sun.com/app/docs/doc/816-5165/6mbb0m9en?a=view]] on [[ELF (3)>>http://docs.sun.com/app/docs/doc/816-5172/6mbb7btp9?a=view]] files, [[strings(1)>>http://docs.sun.com/app/docs/doc/816-5165/6mbb0m9t8?a=view]] on binary files. All this textual information was stored in a separate directory. This was gigabytes of characters. The initial perl script was named rob.pl (**r**evenge **o**f the **b**inaries). It took more than a day to complete the process; but it worked. 14: 15: Then to be able to search through all the text files rob.pl generated, I needed a really good text search engine. I evaluated a number of them, most were meant for searching generic plain text or html documents. They either had a poor performance or did not give accurate results. [[Lucene>>http://lucene.apache.org/java]] won the race. It is not a search engine as such. It is a library to create an inverted index and search it. You can use it to build your own search engine to suite your own domain of files. Initially I just used the example search engine that came with Lucene, and was amazed at its speed. Then I looked at the Lucene website and found that [[Doug Cutting>>http://nutch.sourceforge.net/blog/cutting.html]] was the author of Lucene. To any student of information retrieval (aka study of search engines) it is a familiar name. I took it for granted that Lucene is the right software to use. 16: 17: === The Search Engines 18: 19: The search engine technology is no rocket science. It has been there, much before modern computers arrived. Take a book and turn the last few pages. Most likely you will find an "Index" section. That is an alphabetically sorted list of special terms and corresponding page numbers. If you are looking for some word in the whole book, (1) you would first look up the page numbers in the index, and then (2) look up the page and (3) scan through the page looking for your word. This is much faster than going page by page reading each line searching for your word. 20: 21: [[image:http://blogs.sun.com/roller/resources/chandan/ogstory-index.png]] 22: 23: All modern search engines to the same thing. For google, yahoo etc., the internet is a big book, each webpage is a page. They generate an Index, that for a given word gives you a list of pages that contain that word. It is called an inverted index since unlike seeing a document as a set of words, it sees as a set of documents for a given word. 24: 25: === The Program 26: 27: The good thing about Lucene, is that it does not understand document content. You will have to write analyzers for your own content. So you have the control and freedom to interpret different kinds of files the way you want. Lucene does a good job storing your interpretation and searching it. To interpret a variety of languages under common terms, I came up with a very simple idea which fits all programming languages and executables. Programs have "symbol definitions", "symbol references", "human readable text", "path" and may be "revision history". Be it java class files starting with 0xcafebabe or C programs, it is possible to extract definitions, symbols, text, path and history. 28: 29: [[image:http://blogs.sun.com/roller/resources/chandan/ogstory-program.png]] 30: 31: The initial version of OpenGrok was a perl script named rob.pl that extracted the above 5 streams and piped them to a lucene search engine. rob.pl had become more intelligent. It was now running each file through ctags and extracting definitions. It also parsed out program identifiers. It would run **dis(1)** on ELF files and extract labels and call statement symbols. 32: 33: I called it the //Universal Program Search Engine//. I was using this on my machine for quite some time. This system was used to confirm or deny existence of several vulnerabilities. For example I used it to confirm that no code in Solaris 7 was calling gzprintf() which was the cause of [[CVE-2003-0107>>http://www.cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2003-0107]] Now I could pinpoint affected areas in Solaris for each newly discovered security hole. 34: 35: === Perl to Java 36: 37: I choose Perl because it was very easy and quick to code. I could use its efficient data structures. It was really quick to prototype a design and make sure it actually worked. I realized choosing perl for a long term solution was a mistake. Perl is great for onetime //use and throw// type of applications. When I profiled the processes, java process was mostly waiting for perl to parse the text. Processing the entire program tree source and binaries took a almost half a day. After some profiling the perl code and some optimizations, I could reduce the time to about 8-9 hours. Perl was consuming too many compute cycles, despite my script being only couple of hundred lines. 38: 39: [[image:http://blogs.sun.com/roller/resources/chandan/ogstory-then.png]] 40: 41: Then I realized that I can write custom [[Analyzers>>http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/Analyzer.html]] in Lucene that analyzed each type of program giving out token streams to the Lucene indexer. It was very efficient and much faster. It now takes about 20 minutes to both decompose the ON gate sources into 5 streams and then index them. 42: 43: [[image:http://blogs.sun.com/roller/resources/chandan/ogstory-now.png]] 44: 45: The other reason for choosing Java was that after the release of 1.5 it is much like Perl in terms of data structures (aka [[Collections>>http://java.sun.com/j2se/1.5.0/docs/guide/collections/overview.html]]) and ease of programming, but more efficient and faster. 46: 47: === OpenSolaris 48: 49: Initially a LXR based solution was in mind for hosting OpenSolaris code. Due to many issues with LXR, a web front end for cscope was also prototyped. Internally we use cscope extensively. **cscope** has a good feature that is [[missing>>http://sourceforge.net/tracker/index.php?func=detail&aid=742274&group_id=27350&atid=390120]] in many other code searching tools like LXR. When search results are shown, LXR dumps line number and gives no clue as to what matched. Where as cscope can show lines where a symbol or function is used. Ctags [[does not show>>http://ctags.sourceforge.net/faq.html#12]] symbol references, cscope does. Cscope works perfectly for a small project, but wasn’t scalable for something as big as OpenSolaris. 50: 51: Cscope had a number of minor dificiencies; its full text search is too slow. Giving the book analogy above, it is does something like linearly searching through each page looking for your words. It could not AND two search conditions, which was a severe limitation if you wanted to search only with in a hierarchy of source tree. To work around this limitation, we had pre-built cscope index in each major tree branch (like library or kernel code), which is very expensive and inefficient. Also it understood only C or C like languages, and could not search for definitions in Makefiles for example. 52: 53: At that time, this summarizes the state of the art in code lookup and version control web interfaces that were deployed for large open source projects (like [[http://lxr.mozilla.org/>>http://lxr.mozilla.org/]] or [[http://cvs.gnome.org/>>http://cvs.gnome.org/]]) 54: 55: |=Feature|= LXR |= ctags |= cscope |= CVSview/web 56: |Full text Search| Y| | #| 57: |Definition Search| #| Y| Y| 58: |Identifier Search| Y| | Y| 59: |Path search| Y| | Y| 60: |History Search| | | | 61: |Shows matching lines| | Y| Y| 62: |Hierarchical Search| | | | 63: |query syntax like **AND**, **OR**, **field:**| | | | 64: |Incremental update| | | | 65: |Syntax highlighting-Xref| Y| | | # 66: |Interface for SCM| | | | Y 67: |open source | | Y| Y| Y 68: |Usable URLs | Y| -| -| 69: |Individual file download| | -| -| Y 70: |Changes at directory level| | -| -| # 71: |Multi language support|#| Y| #| - 72: 73: Legend: 74: **Y** : Yes the feature is present 75: **#** : the feature may be partly present 76: **-** : not applicable 77: Note: OpenGrok provides all the above features. 78: 79: The Lucene based program search engine readily had all the missing search features above. It did the correct way of doing full text searches. It could identify definitions in any language that ctags understood. It could limit searches to a branch of the source tree (aka hierarchical search). It showed the matching lines for a search instead of just line numbers. And it did an incremental update of its index. 80: 81: At that time the only thing missing in OpenGrok was good color-coding of source and displaying version control information, which was not much difficult to implement. I evaluated a number of Lex like code generators and [[JFlex>>http://jflex.de/]] turned out be the fastest. Jflex is now used for tokenizing and hyper-text cross reference. 82: 83: Early this year it was deployed to host the OpenSolaris source code on [[opensolaris.org>>http://cvs.opensolaris.org/source/]] and accessible only to OpenSolaris pilot community members. Many things improved with the comments and feedback received from community members. By June 14th this year it was ready for broadcasting millions of lines of OpenSolaris source code to the world. Now some consider OpenGrok as one of the [[10 most important things about OpenSolaris>>http://www.opensolaris.org/jive/thread.jspa?messageID=11593]]. 84: 85: === "Make the common case faster and easier" 86: 87: Before being considered for OpenSolaris it was just a program written for my own use. If it is going to be deployed for OpenSolaris, and publicly available, there will be hundreds of people and robots hitting it all the time. It had to be secure, fast and efficient and well as usable. OpenGrok then aimed to be a complete integrated and usable solution to source code search and browsing. 88: 89: The secret behind OpenGrok is the principle "make the common case faster and easier". Call it Chandan’s law. (Since Google says I am the first one to say so!). It combines a fundamental principle in Computer Architecture (i.e "//make the common case fast//" (also known as [[Amdahl’s law>>http://en.wikipedia.org/wiki/Amdahl’s_law]]) with a fundamental principle in Human Computer Interaction (i.e "//make the common case easier//"). To build your software pick the best tools available like Lucene and Java. The result is what you now see on [[http://cvs.opensolaris.org>>http://cvs.opensolaris.org/source/]]. Believe me, it takes lot more effort to make things simple than to make things complicated. 90: 91: There are more smaller features and attention to detail that make big difference in usability and speed. 92: 93: === Refactoring 94: 95: The initial working prototype was available to OpenSolaris pilot community members. It also easily withstood the [[slashdotting>>http://en.wikipedia.org/wiki/Slashdotted]] on the day OpenSolaris was launched. Allen who owns the webservers said the CPU usage was quite low even when we had tens of thousands of hits on that day. That was a sigh of relief for me, my efforts in making fast, secure and scalable did not go waste. 96: 97: Then I started refactoring the code, to keep code organized, to help others extend the functionalities or easily add support for more language types and version control systems. I used [[Netbeans>>http://www.netbeans.org]] to do the refactoring. Netbeans has changed a lot since the very first time I used it, many years ago. It makes Java programing a more pleasant experience. I could easily move code around; safely delete unused methods and classes; rename things in more logical way and do much more. Refactoring took time, the core analysis engine changed beyond comparison. Once finished, it was open sourced under the name OpenGrok. 98: It is available for [[download here>>Project opengrok.WebHome]]. The logo has a opening flower bracket. Which is a common block start symbol in languages like C and Java. It has no closing bracket to signify open source. 99: [[[[image:Project opengrok.WebHome@logo.png||alt="logo.png"]]>>Project opengrok.WebHome]] 100: 101: === OpenGrok Internals 102: 103: Some details of OpenGrok internal workings can be found on [[OpenGrok Internals>>Project opengrok.internals]] page. 104: 105: === In Retrospect 106: 107: If people can not easily understand how a piece of software source code works, or if people cannot easily and accurately search through it, then it is as bad as any proprietary software or data formats which are difficult for general public to understand. 108: 109: OpenGrok in that sense is a true enabler for source code that is open source. It lets people easily and quickly find the source code, look at it, understand the history and changes made to the source. It makes a developers life easy. It certainly has made my life easy when I am looking for security holes in software. 110: 111: {{html}} 112: <script type="text/javascript"> 113: var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www."); 114: document.write(unescape("%3Cscript src=’" + gaJsHost + "google-analytics.com/ga.js’ type=’text/javascript’%3E%3C/script%3E")); 115: </script> 116: <script type="text/javascript"> 117: try { 118: var pageTracker = _gat._getTracker("UA-3665334-1"); 119: pageTracker._trackPageview(); 120: } catch(err) {}</script> 121: {{/html}}
Search
Project opengrok Pages
Discussions
Issues
Tentative Roadmap
Files / Downloads
Documentation
Supported Revision Control Systems
How to build OpenGrok from source
How to install OpenGrok
Internals
Story of OpenGrok
Collectives
Community Group
Academic and Research
Accessibility
Advocacy
Appliances
Approachability
Architecture Process and Tools
BrandZ
Chinese Users
Community Advisory Board
Databases
Desktop
Device Drivers
Distribution
Documentation
DTrace
Emerging Platforms
Fault Management
Games on OpenSolaris
HA Clusters
HPC Developer
Installation and Packaging
Internationalization and Localization
Laptop
Logical Domains
Modular Debugger (MDB)
Networking
NFS
Observability
OpenSolaris Governing Board (OGB)
OpenSolaris Printing
OS/Net (ON)
Performance
Power Management
PowerPC
Security
Service Management Facility (smf(5))
Software Porters
Solaris Volume Manager
Storage
Systems Administration Community Group
Testing
Tools Home
Unix File Systems (UFS)
Website Community
X Window System
Xen
ZFS
Zones
Project
ADSL Modem Enhancement
ARC Process Definition
ARM Platform Port
Automatic Data Migration
BIND Update
Bluetooth Stack & Drivers
Brocade FC HBA - Initiator
Brocade FC HBA - Target
Brussels - unified network link configuration
Caiman, Solaris Install Revisited
Celeste
Český portál
Chime Visualization Tool for DTrace
CIFS client for Solaris
CIFS Server
Clearview: Network Interface Coherence
Cluster Agent: Informix Dynamic Server
Cluster Agent: OpenSolaris Container
Cluster Agent: OpenSolaris xVM
Cluster Agent: Oracle E-Business Suite
Cluster agent: PostgreSQL
Cluster Agent: Samba
Cluster Agent: Tomcat
CMT
Coarse Data Flow Parallelism
Colorado: Open HA Cluster on OpenSolaris
Command Assistant
Common Array Manager
Companion - /opt/sfw: Free and Open Source software
COMSTAR: Common Multiprotocol SCSI Target
Content
Contest
CPU Observability
Credentials Process Groups
Crossbow: Network Virtualization and Resource Control
Crypto KMS Agent Toolkit
Cryptographic Framework
Data Migration Manager
Data Tethers
Deutsches Portal
Device Detection Tool
Device Driver Utility
Device Manager
Device Mapper
Direct Rendering Infrastructure & 3D drivers
DTrace Guide
Duckwater: Simplified name services management
Easy Tools
Emancipation
Emulex Fibre Channel Device Driver
Emulex Advanced Ethernet Device Driver
Enable/Enhance Solaris support for Intel Platform
Enhance the support of USB webcams
Enhanced SMF Profiles
Enhancements for AMD-based Platforms
Erlang DTrace Integration
Ethernet bridge module for Solaris
Evaluate Conary
Events Registry
Ext3 file system support
F/OSS Package Base
Facilitation
Fibre Channel over Ethernet
Fine Grained Access Policy (FGAP)
Fingerprint Authentication
Flexible Mandatory Access Control
Forensic Tools
Fully Open X Project
Fuse on Solaris
gcore
Generic Machine Check Architecture Improvements
Google SOC
HA-JBoss
HA-MySQL
Hadoop Live CD
Hitachi
HoneyComb Fixed Content Storage
HPC Stack
Image Packaging System
Improved Performance MIB
Indiana
Innovation Awards
Input Method
Intel Graphics
Interrupt Resource Management
IP Datapath Refactoring
IP over Infiniband
IPsec Tunnel Reform
iSCSI Extensions for Remote DMA (iSER)
iSNS Server
JeOS - Just enough Operating System
JKstat - a java binding for libkstat
Journaled File System (JFS)
K Desktop Environment
Kerberos
Kernel Sockets
Kernel SSL Enhancements
Key Management Framework
Korn Shell 93 integration/migration project
Labeled IPsec
LatencyTOP
Layer 2 Filtering
LDoms Manager
Lending
libMicro - portable microbenchmarks
Link Layer Discovery
Live Media: Technologies for distributions running from CD and other media
Locale Data
lofi compression and cryptography support
lx64 brand
Media Management System
Mega_sas
Mexico
MilaX minimal Live Distribution
MIPS Platform Port
Mozilla DTrace
MRSL.NONsharedDevice
Multi-lingual Glossary
Multi-pathing software (MPxIO)
Multiple disk sector size support
Multiple DOI
Muskoka: An open repository for OpenSolaris technical content
Navigator
Nemo: A Framework for High-Performance Networking
Network Auto-Magic
Network Data Management Protocol
Network MIBs
Network Storage
Network Time Protocol (NTP)
Nevada Globalization
New Design of 4over6 Mechanism Based on OpenSolaris
NFS RDMA transport update and performance analysis
NFS Server in non-Global Zones
NFS version 4.1 pNFS
NFSv4 namespace extensions
Nightingale: Port Songbird to OpenSolaris
NPort ID Virtualization (NPIV)
NUMA
Object Storage Device (OSD) support for Solaris
OHACGE Script Based Plug-in
ON/Nevada (ONNV) Project
Open Development Infrastructure
Open HA Cluster Utilities
Open Sound System
OpenGrok
OpenPegasus CIM Server
OpenRTI
OpenSolaris Busybox
OpenSolaris Desktop
OpenSolaris Hispano
OpenSolaris Security Audit
OpenSolaris support for the QEMU processor emulator: host and guest
PEF: Packet Event Framework
Performance Wrappers
Pkgfactory
Polski Portal
Portail Francophone
Portal Brasil
Portals
Power Management Usability Interfaces
Presto: Automatic Printing Configuration
Printable Many Page Solaris Manuals
Promise SuperTrak RAID HBA Driver
QLogic Converged Network Adapter GLDv3 NIC Driver
Quagga Routing Protocol Suite Integration
RAID Configuration Utility
RBridge (IETF TRILL) support
RDMA Offload Framework
Reno: Login Process Enhancements for Interop
Resource Management
s10brand
SAM/QFS
SCM Migration Project
SCSI RDMA Protocol
SDcard Drivers
Sensor Abstraction Layer
Session Initiation Protocol
SFW
Shell: bourne shell, korn shell, C shell, etc.
Sierra: Intel WiFi Chipsets Support
Simple Panels
SM-HBA Based SAS HBA Management
SMF Documentation
Solaris iSCSI Target
Solaris PowerPC Port
SourceJuicer
Sparks: name service switch/nscd enhancements
Squashfs
Star integration/migration project
Starfish
Starter Kit
Storage Power Management
Sun Security Toolkit
Sun StorageTek Availability Suite
Support for OpenFabrics User Verbs / API on OpenSolaris OS
Support gcc4/GCCfss in Solaris
Suspend/Resume
SVR4 Packaging
Systemz
Tamarack: Removable Media Enhancements in Solaris
Tesla: OpenSolaris Enhanced Power Management
Test Development
Tickless Kernel Architecture
TIPC
Trademarks
Trusted networking interface policy database for Trusted Extensions
Trusted Platform Module support
Use Case
Validated Execution Project
Virtual Console
Virtual Network Machines
Visual Panels
Visualization for HPC
Volo
VRRP: Virtual Router Redundancy Protocol Implementation
VSCAN service
Web Stack
Website
Winchester: Schema mapping and ID mapping for AD Interoperability
Wireless USB Support
Wireless Wide Area Network
X Consolidation
x86 Generic FMA Topology Enumerator
Xen Gate
Xfce: A lightweight desktop environment
ZFS Boot and Install
ZFS on disk encryption support
Zone Manager
Zone Statistics
Русский портал
البوابة العربية
भारतीय पोर्टल
中国门户
日本ポータル
한국 포탈
User Group
Adelaide
Argentina
Arizona
Atlanta
Baltimore-Washington
Bangalore
Bangkok
Bangladesh
Beijing
Bélem
Berlin
Bhimavaram
Bloomington
Campus Ambassadors
Capital Region
Cardiff
Charlotte
Chengdu
Chennai
Chihuahua
Chile
Cleveland
Colombia
Columbus
Connecticut
Cracow
Czech
Dallas/Ft. Worth
Danish
Delaware
Edinburgh
Egypt
Finland
Florida
Front Range
FuZhou
Great Lakes
Greece
Hangzhou
Hawaii
HeFei
Houston
Hyderabad
Indonesia
Irish
Israel
Italian
Jinan
Kabul
Kansas City
Latvia
London
Madurai
Manchester
Mato Grosso
Melbourne
Minas Gerais
Minnesota
Montreal
Moscow
Mumbai
Munich
NEA
Netherlands
New England
New York City
New Zealand
NIT Hamirpur
Noroeste
Oklahoma City
Osnabrück
Peru
Philadelphia
Piaski
Pittsburgh
Porto Alegre
Puget Sound
Pune
Queensland
Research Triangle Park
Romania
Russia
San Antonio
San Diego
San Francisco
São Paulo
Scottish
Serbia
Shanghai
Shenzhen
Silicon Valley
Singapore
Slovak
South African
Southern Connecticut
St. Louis
Sweden
Switzerland
Sydney
Szczecin
Taiwan
Tecum
Thames Valley
Tokyo
Toronto
Trondheim
Tulsa
Turkey
Ukraine
University of Melbourne
Vale do Paraíba
Vancouver
Venezuela
Welsh - Cymru
Wisconsin
Xi'an
Subsites
Code Reviews
Code Repositories
Package Search
Bugster
Bugzilla
Test Machines
Planet
Mailing Lists
Elections & Polls
ARC Case Logs
Source Juicer
Package Factory
User Authentication