Christina Lyu's CPT Project
Spiffy Query Set Reduction
I interned in Engineering Practicum program at Google from May 21, 2018 to August 10, 2018.
The engineering practicum is a 12 week internship at Google designed specifically for freshmen and sophomores. Interns work as a pair of two and get help from hosts and cohosts. It allows freshmen and sophomores to gain experience in the real industry with proper support and guidance from actual Google engineers. After two phone interviews and many project matching, I was placed on the team of Spiffy.
My project is Query Set Reduction, which is supervised by Google Spiffy team under Google Search. My podmate Cynoc and I worked on finding algorithms to reduce the size of query set needed to run for the testing purposes of new search features.
Both Cynoc and I didn't have any background with queries or the file format we used at Google. We started with learning the tool chains and understanding code that had already been submitted and is currently working in the team. We had some trouble collaborating at first because we didn't quite divide our tasks so we were doing the same things and writing the same code. But later we each had different ideas about how to define redundant queries, so we went on different paths and at the end of the internship came up with two sets of algorithms that could reduce 75% of the queries and still produce 95% of the result.
The graph below is the visualization of the filtering algorithms. The x-axis represents the number of queries, and the y-axis represents the percentage of result covered. The red and yellow dots represent the number of queries I generated from the algorithm, and the dots below represents the percentage covered by randomly selected queries. For example, the first red point represent that the 5,000 queries generated from my algorithm can cover around 90% of all the diffs. But if we just randomly select 5,000 queries, they only covers around 30% of the diffs. It shows that the queries generated from my algorithms are very efficient and therefore we can only run those 5,000 queries from my algorithm and cover 90% of the diffs with only 10% of the original running time.
Relationship with Smith
I was able to finish the internship with the knowledge from CSC 212 Java and Data Structures and CSC 231 Microprocessing and Assembly. Thanks to CSC 212, all my programing during the internship is done by Java and in the internship I was able to juggle with multiple different data structures including the parallel collection. The C code we learned in Assembly allowed me to quickly adapt and be able to read C++ code in the data base to understand what people have already did in the team.
- Developed an internal backend tool to reduce the run-time for feature tests by filtering out superfluous
- Wrote customized interface and implementations to reduce the amount of code necessary
- Adapted pipelines to process big data with multiple workers in parallel
- Learnt and created protos and markdown files
- Completed development process including design doc writing, presentation and teamwork skills
- Learnt the Google Toolchain / development tools & process
- Attended leadership talk series and java reading group