We focus on solving big data problems that impact the Intelligence Community and national security.
All of our completed work is available on Github.
Security & Machine Learning
Machine Learning Model Pipeline SecurityLaunch GitHub
Is your training data safe? Cyphercat demonstrates security vulnerabilities for machine learning pipelines and training data by attacking various model architectures. Attacks include model inversion (reconstructing an image from the training data) and membership inference (determining if a specific piece of data is contained within the training data).
Open Source Speech Dataset
Transcription, Denoising, Speaker Separation, Speaker IdentificationDownload VOiCES Dataset
SRI International and Lab41, In-Q-Tel, are proud to release the Voices Obscured in Complex Environmental Settings (VOICES) corpus, a collaborative effort that brings speech data in acoustically challenging reverberant environments to the researcher. Clean speech was recorded in rooms of different sizes, each having distinct room acoustic profiles, with background noise played concurrently. These recordings provides audio data that better represent real-use scenarios. The intended purpose of this corpus is to promote acoustic research.Introducing the Voices Obscured in Complex Environmental Settings (VOiCES) corpus
Speech Isolation using Deep Learning
Speech, Audio, Deep learning, Tensorflow, Neural NetworksLaunch GitHub
At cocktail parties, it is often difficult to make out what someone is saying because several people are talking at once. Humans do a decent job understanding anyway, in part because we have two ears that can determine the direction of a speaker. The same idea can be applied to microphones, using many microphones to resolve many speakers and isolate their speech signals. Current technologies use expensive microphone arrays, are limited in the environments they can operate in, or can isolate a limited numbers of speakers. Magnolia proposes to use deep learning to break these constraints and isolate speech to work with COTS microphones in a variety of environmental conditions.
Car Recognition using Deep Learning
Python, TensorFlow, Keras, DockerLaunch GitHub
Cars are ubiquitous in urban life. They are uniquely identifiable via their license plates, but unfortunately license plates are only visible from certain angles and even then they are hard to read at a distance. Pelops will use deep learning based methods to automatically identify cars by using their large scale features—color, shape, light configuration, etc. Pelops will also attempt to re-identify specific cars if they are seen multiple times, allowing automatic pattern of life discovery.
Recommending Code to Coders
Jupyter Notebooks, Docker, Spark, Mesos, Python
Software development and data science teams typically consolidate previous projects into a common repository but fostering source code re-use and algorithm discoverability are vexing challenges. Altair will apply collaborative filtering and content-based filtering recommender techniques from Lab41’s previous Hermes challenge on galleries of Jupyter notebooks used by technical teams. The main goal will be to identify similarities between user activity and among source code segments such that a recommender system can predict a meaningful overlap between a user’s needs and code in the repository that the user has not yet discovered.
Software Defined Network Situational Awareness
Software Defined Networking (SDN), GPUs, Python, Docker, SparkLaunch GitHub
This challenge is a joint challenge between two IQT Labs: Lab41 and Cyber Reboot. Current software defined network offerings lack tangible security emphasis much less methods to enhance operational security. Without situational awareness and context, defending a network remains a difficult proposition. This challenge will utilize SDN and machine learning to determine what is on the network, and what is it doing. This goal will help sponsors leverage SDN to provide situational awareness and better defend their networks.
Multimodal Joint Vector Representations
Tensorflow, Neon, Docker, PythonLaunch GitHub
Current methods using machine learning are focused on classifying items into one of many classes. These techniques are often trained on one type of data (e.g., images), but ignore other information in the dataset (e.g., tags, metadata, etc.) The Attalos Challenge is focused on building representations of images, text, and social networks, leveraging all the information together. Doing so will enable training classifiers that will work across a variety of datasets.
Visual Data Story Telling
Vega, Lyra, Cognitive and Perceptual Principles, Human Centered Design, Front-End TechnologiesLaunch GitHub
This challenge will employ various data visualization tools and user experience frameworks to construct cohesive data stories focused on communicating ripple-effect scenarios. Several event-based datasets from different disciplines will serve as the basis for data story development. Lab41 will develop an optimal front-end visualization development stack in which user experience is a driver. Our ultimate goal is to create a roadmap for how to approach visual data stories from technical considerations to user engagement.
Natural Language Processing & Text Classification
Python, TensorFlow, Docker, NeonLaunch GitHub
Pythia discovers new and useful information in large, dynamic document stores. It constructs systems to measure and locate new information in a document being ingested into a corpus, and is exploring predictive analytics for making existing structured metadata more informative, by doing modeling on document content.
Scalable Security Log File Ingest and Analysis
Jupyter, Spark, Mesos, PythonLaunch Github
The challenge will evaluate text clustering machine learning algorithms and graph modeling for scalable system log ingest and analysis. Lab41 will create a solution that can automatically identify and parse multiple log file formats to obviate the need to write a specialized parser to each new type. Our ultimate goal is to transform disparate text-based event log content into a graph model for advanced analytics and reduced storage requirements.
Identifying Authorship From Images of Unstructured Handwriting
TensorFlow, Neon, Torch, Caffe, TheanoLaunch GitHub
This challenge will evaluate the potential of neural networks to recognize authorship over a variety of unstructured handwriting images. The data will include different document types, paper texture, pens used, and writer range of variation. We will implement state of the art technologies that have shown promise in computer vision related to visual attention (what features to specifically pay attention to) and sequential modeling (in what order do writers use pen strokes).
Recommender System Analysis
Jupyter, Spark, Mesos, PythonLaunch Github
Hermes will compare the results of multiple recommender systems on a variety of datasets. These datasets include common, conventional datasets that are traditionally used in recommender systems: movies, books, and news. However, the challenge will also look to explore programmatic datasets from GitHub, and data from internal sources. Each of these datasets will then be subjected to a variety of recommender systems where we can compare and contrast a wide variety of performance metrics.
Deep Learning Sentiment Analysis
Torch, Caffe, Theano, Pylearn2, Neon, Lua, Python, Docker, Spark, GPUsLaunch Github
This challenge will evaluate the feasibility of using architectures such as Convolutional and Recurrent Neural Networks to classify the positive, negative, or neutral sentiment of Twitter messages towards a specific topic. The ultimate goal is to help government sponsors better characterize opinions expressed towards topics and events of national security importance.See how we did it
Geo-Inference of Social Media Data
IPython, Spark, Mesos, PythonLaunch Github
This challenge will employ various geospatial inference methods to determine the location of Twitter users. Lab41 will create and evaluate novel network based and content based approaches of inferring where users are based when posting a Tweet.
Python, igraph, SNAP, sklearnLaunch Github
Circulo is a Python framework to evaluate community detection algorithms. The framework calculates a variety of quantitative metrics on each resulting community. This data can be used to draw conclusions about algorithm performance and efficacy.See how we did it
Streaming Updates to Graph Databases
Python, TitanDBLaunch Github
Lab41 conducted a market survey to assess the feature sets of existing open source graph databases and graph analytics platforms. We wanted to determine which would be most suitable for processing streaming updates to a large collection of graphs and triggering notifications when those updates cause certain conditions to be met or cease to be met.See how we did it
Spark, Docker, IPythonLaunch Github
This challenge explored how to deploy an Apache Spark cluster driven by IPython notebooks, running Docker containers for each component. By using IPython as the interface, we were able to leverage a variety of data processing, machine learning, and visualization tasks using several data analysis tools and libraries.See how we did it
Gephi, Tinkerpop2Launch Github
Rio enabled visualization of large-scale and streaming graphs. We employed Blueprints, an abstract specification for graphs, and Gephi, a prominent graph visualization package to enable cross-interface interactions. By connecting the two, end users could use Gephi on Blueprints-enabled datastores such as the Titan Distributed Graph Database.See how we did it
Titan Graph Database; GraphLab, JUNG Java Framework, Faunus Graph Engine, ElasticSearch, Rexster Graph Server, SpringMVC, AngularJS, Hadoop, HBase, BerkeleyDBLaunch Github
Dendrite illustrated how to use graph storage and analytics within a shared environment. Lab41 borrowed inspiration from distributed version control systems, such as Git, to provide a user interface for project management and collaboration around graph analytics.See how we did it
File Anomaly Detection
Python, sklearnLaunch Github
Finding files of interest from large data collections is difficult for forensic analysts given the time and resources required. Redwood identified a subset class of files from a larger collection by evaluating how strongly any given file is associated with that known class.
Interested in participating?
Join us on any of our In Progress projects - or our next challenge!Work with us
Have any interesting techniques or challenges we should consider?
We’d love to hear from you.Talk with us
Get more insight into how we work, and where you fit in.Let's Go