The Media Cloud project is looking for a software engineer to focus on our data pipeline. We are an open source project conducting primary research about the media ecosystem as well as helping others to do their own research about it on over 1.5 billion news stories, adding more than 700,000 new stories daily. Pay is competitive, and while the initial contract will be for 4-6 months, we anticipate this role extending long-term.
* work on our server architecture, which collects and processes and allows researchers to analyze these stories via an API; you will approximately spend half your time planning, designing, building and the other half, maintaining and running the project's data pipeline;
* work with senior engineers to establish a technical vision for the project;
* contribute to and follow a technical roadmap to meet research needs and to complete grant deliverables;
* collaborate with other developers, designers, and system administrators in implementing technical roadmap;
* accurately communicate project status internally and externally to our community of users;
* maintain, upgrade and build systems within an existing (rather large) codebase to collect, archive, and analyze content from online media;
* write code that can scale systems to handle ever-expanding data requirements.
* college degree or other domain-specific accreditation, preferably in computer science or data science related field;
* at least two years experience working as a software engineer on big data systems;
* programming fluency — Python required;
* some experience with Linux;
* demonstrated ability to design, build, test, and deploy robust code;
* demonstrated ability to iterate quickly through prototypes;
* demonstrated ability to use data to validate architectural decisions;
* ability to work productively in a virtual environment with remote team members all over the world;
* interest in working on issues related to hate-speech, democracy, gender, race, or health.
* experience implementing and maintaining a production ETL pipeline;
* experience scaling platforms to handle large data sets;
* experience writing web crawlers or API scrapers;
* experience writing, maintaining, and optimizing SQL queries against databases;
* experience working with PostgreSQL and Solr / Lucene in Ubuntu environments;
* experience working with text-based data system (ie. NLP);
* experience working in a modern dev / systems environment including git and docker.
Our upcoming technical roadmap includes ingesting new platforms into our data pipeline, analyzing images from news stories, and incorporating new sources of audience/readership data, as well as ongoing updates to improve the scalability, performance, and reliability of our existing pipeline.
We are a diverse and welcoming community of researchers and technologists who love to engage with hard questions about online media by using a combination of social, computer, and data sciences. You will work with all members of our small team, from senior faculty to junior developers, and thrive in an academic atmosphere that encourages experimentation, constant questioning, and validation at all levels of our platform.
Much of our substantive work focuses on issues of online hate-speech, race, democracy, and health. We strongly encourage women, people of color, and people of any sexual identity to apply.
Our entire team is remote, with team members working all around the world, and we welcome remote workers.
We strive to make the interview process smooth and painless for both parties. Should you choose to submit your application for this position, you can expect the following:
1. Phone screen (30 mins);
2. Technical interview (1 hour);
3. Paid coding challenge (1 week to complete);
4. Team interviews (2 hour);
5. Final decision.